john-dev - Re: fast hashes on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120331121953.GA1498@openwall.com>
Date: Sat, 31 Mar 2012 16:19:53 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: fast hashes on GPU

myrice -

Here's an extremely rough estimate for the desired speed.  On CPU we get:

On Sat, Mar 31, 2012 at 12:17:54PM +0400, Solar Designer wrote:
> CPU OpenMP 8 threads (probably about 3.7 GHz, the sensor reports 124 W):
> 
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [64/64]... (8xOMP) DONE
> Many salts:     10846K c/s real, 1364K c/s virtual
> Only one salt:  8169K c/s real, 1018K c/s virtual

On GPU with some tuning of your code, I got:

On Sat, Mar 31, 2012 at 02:56:38PM +0400, Solar Designer wrote:
> PLAINTEXT_LENGTH 12 gives:
> 
> Many salts:     32011K c/s real, 32011K c/s virtual
> Only one salt:  23086K c/s real, 23086K c/s virtual

Now, on CPU we run 64-bit ALU code (we don't use SIMD yet) with 8
hardware threads (this CPU actually has separate ALUs per thread).
On this GPU, we have 32-bit ALUs and 480 threads.  So the expected speed
for GPU code that would be as (in)optimal as our current CPU code
(not considering the SIMD and XOP potential, which is irrelevant to this
estimate) roughly is:

11 / 8 / 3700 * 480 * 1600 / 2 = 142 million c/s

This is consistent with the published speed for oclHashcat-* on an
equivalent card - around 270M for SHA-256 - assuming that SHA-512 is
roughly twice slower than SHA-256 when we're on a 32-bit architecture.

We currently get up to 32 million (and that's with dirty hacks).  So a
reasonable goal is for you to make the code 4 to 5 times faster.

...Oh, I just got it to:

Many salts:     38062K c/s real, 38062K c/s virtual
Only one salt:  26270K c/s real, 26270K c/s virtual

by simply adding "#pragma unroll 64" before the last loop in
sha512_block().

A further 3.7x speedup is still desired. ;-)

The next change to make may be to remove the init/update/final and to
invoke the compression function directly.  Also, hard-code the initial
value assignments right in there.

And indeed you need to stop re-transferring the same candidate passwords
when only the salt has changed.

I think the desired 142M c/s may fit in the current formats interface
for the "many salts" case.  We're getting a twice higher speed for
CRC-32 on CPU currently:

Benchmarking: CRC-32 [32/64]... (4xOMP) DONE
Many salts:     281067K c/s real, 70174K c/s virtual
Only one salt:  56844K c/s real, 14271K c/s virtual

or even:

Benchmarking: CRC-32 [32/64]... (8xOMP) DONE
Many salts:     378031K c/s real, 47420K c/s virtual
Only one salt:  37877K c/s real, 4734K c/s virtual

So the interface permits for that.

...after many of these optimizations, we may also proceed to reverse the
last few rounds of SHA-512 itself, including on CPU.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.