Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 28 Aug 2015 12:18:00 +0300
From: Solar Designer <>
Subject: Re: LWS and GWS auto-tuning

On Thu, Aug 27, 2015 at 06:51:20PM +0200, magnum wrote:
> That new CPU driver in well looks good. Here's our CPU format with WPAPSK:
> Benchmarking: wpapsk, WPA/WPA2 PSK [PBKDF2-SHA1 256/256 AVX2 8x]... 
> (8xOMP) DONE
> Raw:	13116 c/s real, 1649 c/s virtual
> Here's OpenCL -dev=0
> $ ../run/john -test -form=wpapsk-opencl -dev=0
> Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
> Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL]... DONE
> Raw:	13540 c/s real, 1699 c/s virtual
> The device asked for scalar code so we gave it that, then it was 
> auto-vectorized and actually faster than our intrinsics.
> Here's forcing 8x vector source code:
> $ ../run/john -test -form=wpapsk-opencl -dev=0 -force-vector=8
> Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
> Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL 8x]... DONE
> Raw:	13320 c/s real, 1668 c/s virtual
> Slightly slower than auto-vectorized in this case, but still faster than 
> our intrinsics :)

Yes, it's surprisingly good.  And I am still getting these speeds after
the system reinstall.  However, somehow md5crypt-opencl speed is now
halved, and I can't figure out why (and why wpapsk-opencl isn't).  I'd
suspect that I mixed up Intel package versions and lost AVX2 support
somewhere, but then it would have impacted wpapsk-opencl as well.


Returning to the topic of auto-tuning, there's now some weirdness seen
for md5crypt-opencl on Titan X:

[ run]$ ./john -test -form=md5crypt-opencl -dev=4 -v=4
Device 4: GeForce GTX TITAN X
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=262162 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Calculating best GWS for LWS=32; max. 250ms single kernel invocation.
gws:      1024      448788 c/s   448788000 rounds/s   2.281ms per crypt_all()!
gws:      2048      889865 c/s   889865000 rounds/s   2.301ms per crypt_all()+
gws:      4096     1697005 c/s  1697005000 rounds/s   2.413ms per crypt_all()+
gws:      8192     2740048 c/s  2740048000 rounds/s   2.989ms per crypt_all()+
gws:     16384     3257163 c/s  3257163000 rounds/s   5.030ms per crypt_all()+
gws:     32768     2308052 c/s  2308052000 rounds/s  14.197ms per crypt_all()
gws:     65536     2837811 c/s  2837811000 rounds/s  23.093ms per crypt_all()
Calculating best LWS for GWS=16384
Testing LWS=32 GWS=16384 ... 17.168ms+
Testing LWS=64 GWS=16384 ... 17.135ms
Testing LWS=96 GWS=16320 ... 17.234ms
Testing LWS=128 GWS=16384 ... 17.334ms
Testing LWS=160 GWS=16320 ... 18.372ms
Testing LWS=192 GWS=16320 ... 17.198ms
Testing LWS=224 GWS=16352 ... 18.444ms
Testing LWS=256 GWS=16384 ... 17.224ms
Testing LWS=512 GWS=16384 ... 19.622ms
Testing LWS=1024 GWS=16384 ... 17.458ms
Calculating best GWS for LWS=32; max. 500ms single kernel invocation.
gws:       768      375657 c/s   375657000 rounds/s   2.044ms per crypt_all()!
gws:      1536      751314 c/s   751314000 rounds/s   2.044ms per crypt_all()+
gws:      3072     1467082 c/s  1467082000 rounds/s   2.093ms per crypt_all()+
gws:      6144     2604060 c/s  2604060000 rounds/s   2.359ms per crypt_all()+
gws:     12288     3460581 c/s  3460581000 rounds/s   3.550ms per crypt_all()+
gws:     24576     2880428 c/s  2880428000 rounds/s   8.532ms per crypt_all()
gws:     49152     2916298 c/s  2916298000 rounds/s  16.854ms per crypt_all()
gws:     98304     2922221 c/s  2922221000 rounds/s  33.640ms per crypt_all()
gws:    196608     2950827 c/s  2950827000 rounds/s  66.628ms per crypt_all()
gws:    393216     2988516 c/s  2988516000 rounds/s 131.575ms per crypt_all()
gws:    786432     3010473 c/s  3010473000 rounds/s 261.232ms per crypt_all()
Local worksize (LWS) 32, global worksize (GWS) 12288
Raw:    1228K c/s real, 1228K c/s virtual

Notice how it was up to 3.46M during auto-tuning, but only 1.2M when
benchmarked the tuned settings.  And:

[ run]$ LWS=32 GWS=12288 ./john -test -form=md5crypt-opencl -dev=4 
Device 4: GeForce GTX TITAN X
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:    1228K c/s real, 1228K c/s virtual

[ run]$ LWS=32 GWS=98304 ./john -test -form=md5crypt-opencl -dev=4 
Device 4: GeForce GTX TITAN X
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:    3276K c/s real, 2457K c/s virtual

It's unclear why GWS=12288 appears to be fast during auto-tuning.
Are we using a different mix of password lengths there or something?

Actual cracking at length 8:

LWS=32 GWS=12288:
0g 0:00:00:09  0g/s 0p/s 4824Kc/s 4824KC/s GPU:45б╟C util:98% fan:22% aaaaaaaa..pesaaaaa

LWS=32 GWS=98304:
0g 0:00:00:35  0g/s 0p/s 4386Kc/s 4386KC/s GPU:62б╟C util:99% fan:22% aaaaaaaa..xkpfaaaa

So 12288 is in fact faster for fixed length 8.

BTW, what if one of the GWS figures in the first pass (before LWS
tuning) turned out to be higher than all of those in the second pass?
Would we revert to it (and the original LWS)?  Perhaps we should, since
otherwise we're not making full use of the benchmarks made.  I expect
this to rarely be the case, yet it might happen.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.