Date: Sat, 31 Mar 2012 14:56:38 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: fast hashes on GPU myrice - On Sat, Mar 31, 2012 at 12:17:54PM +0400, Solar Designer wrote: > CPU OpenMP 8 threads (probably about 3.7 GHz, the sensor reports 124 W): > > Benchmarking: Mac OS X 10.7+ salted SHA-512 [64/64]... (8xOMP) DONE > Many salts: 10846K c/s real, 1364K c/s virtual > Only one salt: 8169K c/s real, 1018K c/s virtual > > Your GPU code (on the GTX-570 1600 MHz): > > $ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=xsha512-cuda > Benchmarking: Mac OS X 10.7+ salted SHA-512 [64/64]... DONE > Many salts: 6397K c/s real, 6397K c/s virtual > Only one salt: 5901K c/s real, 5901K c/s virtual With cuda_xsha512.h edited to use: #define BLOCKS 1024 #define THREADS 480 I got the speed to: Many salts: 11720K c/s real, 11811K c/s virtual Only one salt: 9830K c/s real, 8192K c/s virtual which is finally (very) slightly faster than our current CPU code (although by computing two hashes in SIMD vectors and by using XOP's bit rotates something like a 3x speedup should be possible on CPU). BTW, somehow cuda_xsha512_fmt.c uses DOS linefeeds. Are you developing on Windows? There's a bug in init() - sizeof(xsha512_key) is used in place of sizeof(xsha512_hash). In xsha512.cu, I had to comment out #include "cuPrintf.cu". With the sizeof() bug corrected and PLAINTEXT_LENGTH reduced to 20 (arbitrary), I got the speed further up to: Many salts: 19114K c/s real, 19114K c/s virtual Only one salt: 15639K c/s real, 15639K c/s virtual Oh, I also removed the duplicate definitions of PLAINTEXT_LENGTH, BINARY_SIZE, and SALT_SIZE from cuda_xsha512_fmt.c (they are already defined in cuda_xsha512.h). Then I reduced BINARY_SIZE to 8 (a lower value that still makes false positives very unlikely) and made other relevant changes to struct xsha512_hash definition, the loop in cmp_one(), and the loop in CUDA xsha512(). This reduces the amount of data to transfer from GPU. The speed went up to: Many salts: 29212K c/s real, 29212K c/s virtual Only one salt: 21767K c/s real, 21767K c/s virtual To implement this properly, we need to transfer just 4 or 8 bytes per hash initially, but if cmp_all() detects potential matches, then it or cmp_one() should request the remaining bytes from the GPU (or maybe it should simply re-compute the hash on CPU). (This is assuming we don't try to offload hash comparisons onto GPU yet.) Some further experiments: PLAINTEXT_LENGTH 16 gives: Many salts: 30358K c/s real, 30358K c/s virtual Only one salt: 22407K c/s real, 22407K c/s virtual PLAINTEXT_LENGTH 12 gives: Many salts: 32011K c/s real, 32011K c/s virtual Only one salt: 23086K c/s real, 23086K c/s virtual (of course, this is a nasty limitation for actual use). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.