john-dev - Re: fast hashes on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120331105638.GA1158@openwall.com>
Date: Sat, 31 Mar 2012 14:56:38 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: fast hashes on GPU

myrice -

On Sat, Mar 31, 2012 at 12:17:54PM +0400, Solar Designer wrote:
> CPU OpenMP 8 threads (probably about 3.7 GHz, the sensor reports 124 W):
> 
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [64/64]... (8xOMP) DONE
> Many salts:     10846K c/s real, 1364K c/s virtual
> Only one salt:  8169K c/s real, 1018K c/s virtual
> 
> Your GPU code (on the GTX-570 1600 MHz):
> 
> $ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=xsha512-cuda
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [64/64]... DONE
> Many salts:     6397K c/s real, 6397K c/s virtual
> Only one salt:  5901K c/s real, 5901K c/s virtual

With cuda_xsha512.h edited to use:

#define BLOCKS 1024
#define THREADS 480

I got the speed to:

Many salts:     11720K c/s real, 11811K c/s virtual
Only one salt:  9830K c/s real, 8192K c/s virtual

which is finally (very) slightly faster than our current CPU code
(although by computing two hashes in SIMD vectors and by using XOP's bit
rotates something like a 3x speedup should be possible on CPU).

BTW, somehow cuda_xsha512_fmt.c uses DOS linefeeds.  Are you developing
on Windows?

There's a bug in init() - sizeof(xsha512_key) is used in place of
sizeof(xsha512_hash).

In xsha512.cu, I had to comment out #include "cuPrintf.cu".

With the sizeof() bug corrected and PLAINTEXT_LENGTH reduced to 20
(arbitrary), I got the speed further up to:

Many salts:     19114K c/s real, 19114K c/s virtual
Only one salt:  15639K c/s real, 15639K c/s virtual

Oh, I also removed the duplicate definitions of PLAINTEXT_LENGTH,
BINARY_SIZE, and SALT_SIZE from cuda_xsha512_fmt.c (they are already
defined in cuda_xsha512.h).

Then I reduced BINARY_SIZE to 8 (a lower value that still makes false
positives very unlikely) and made other relevant changes to struct
xsha512_hash definition, the loop in cmp_one(), and the loop in CUDA
xsha512().  This reduces the amount of data to transfer from GPU.  The
speed went up to:

Many salts:     29212K c/s real, 29212K c/s virtual
Only one salt:  21767K c/s real, 21767K c/s virtual

To implement this properly, we need to transfer just 4 or 8 bytes per
hash initially, but if cmp_all() detects potential matches, then it or
cmp_one() should request the remaining bytes from the GPU (or maybe it
should simply re-compute the hash on CPU).  (This is assuming we don't
try to offload hash comparisons onto GPU yet.)

Some further experiments:

PLAINTEXT_LENGTH 16 gives:

Many salts:     30358K c/s real, 30358K c/s virtual
Only one salt:  22407K c/s real, 22407K c/s virtual

PLAINTEXT_LENGTH 12 gives:

Many salts:     32011K c/s real, 32011K c/s virtual
Only one salt:  23086K c/s real, 23086K c/s virtual

(of course, this is a nasty limitation for actual use).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.