Date: Sat, 31 Mar 2012 16:19:53 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: fast hashes on GPU myrice - Here's an extremely rough estimate for the desired speed. On CPU we get: On Sat, Mar 31, 2012 at 12:17:54PM +0400, Solar Designer wrote: > CPU OpenMP 8 threads (probably about 3.7 GHz, the sensor reports 124 W): > > Benchmarking: Mac OS X 10.7+ salted SHA-512 [64/64]... (8xOMP) DONE > Many salts: 10846K c/s real, 1364K c/s virtual > Only one salt: 8169K c/s real, 1018K c/s virtual On GPU with some tuning of your code, I got: On Sat, Mar 31, 2012 at 02:56:38PM +0400, Solar Designer wrote: > PLAINTEXT_LENGTH 12 gives: > > Many salts: 32011K c/s real, 32011K c/s virtual > Only one salt: 23086K c/s real, 23086K c/s virtual Now, on CPU we run 64-bit ALU code (we don't use SIMD yet) with 8 hardware threads (this CPU actually has separate ALUs per thread). On this GPU, we have 32-bit ALUs and 480 threads. So the expected speed for GPU code that would be as (in)optimal as our current CPU code (not considering the SIMD and XOP potential, which is irrelevant to this estimate) roughly is: 11 / 8 / 3700 * 480 * 1600 / 2 = 142 million c/s This is consistent with the published speed for oclHashcat-* on an equivalent card - around 270M for SHA-256 - assuming that SHA-512 is roughly twice slower than SHA-256 when we're on a 32-bit architecture. We currently get up to 32 million (and that's with dirty hacks). So a reasonable goal is for you to make the code 4 to 5 times faster. ...Oh, I just got it to: Many salts: 38062K c/s real, 38062K c/s virtual Only one salt: 26270K c/s real, 26270K c/s virtual by simply adding "#pragma unroll 64" before the last loop in sha512_block(). A further 3.7x speedup is still desired. ;-) The next change to make may be to remove the init/update/final and to invoke the compression function directly. Also, hard-code the initial value assignments right in there. And indeed you need to stop re-transferring the same candidate passwords when only the salt has changed. I think the desired 142M c/s may fit in the current formats interface for the "many salts" case. We're getting a twice higher speed for CRC-32 on CPU currently: Benchmarking: CRC-32 [32/64]... (4xOMP) DONE Many salts: 281067K c/s real, 70174K c/s virtual Only one salt: 56844K c/s real, 14271K c/s virtual or even: Benchmarking: CRC-32 [32/64]... (8xOMP) DONE Many salts: 378031K c/s real, 47420K c/s virtual Only one salt: 37877K c/s real, 4734K c/s virtual So the interface permits for that. ...after many of these optimizations, we may also proceed to reverse the last few rounds of SHA-512 itself, including on CPU. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.