|
Message-ID: <20150828091800.GA25596@openwall.com> Date: Fri, 28 Aug 2015 12:18:00 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: LWS and GWS auto-tuning On Thu, Aug 27, 2015 at 06:51:20PM +0200, magnum wrote: > That new CPU driver in well looks good. Here's our CPU format with WPAPSK: > > Benchmarking: wpapsk, WPA/WPA2 PSK [PBKDF2-SHA1 256/256 AVX2 8x]... > (8xOMP) DONE > Raw: 13116 c/s real, 1649 c/s virtual > > Here's OpenCL -dev=0 > > $ ../run/john -test -form=wpapsk-opencl -dev=0 > Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz > Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL]... DONE > Raw: 13540 c/s real, 1699 c/s virtual > > The device asked for scalar code so we gave it that, then it was > auto-vectorized and actually faster than our intrinsics. > > Here's forcing 8x vector source code: > > $ ../run/john -test -form=wpapsk-opencl -dev=0 -force-vector=8 > Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz > Benchmarking: wpapsk-opencl, WPA/WPA2 PSK [PBKDF2-SHA1 OpenCL 8x]... DONE > Raw: 13320 c/s real, 1668 c/s virtual > > Slightly slower than auto-vectorized in this case, but still faster than > our intrinsics :) Yes, it's surprisingly good. And I am still getting these speeds after the system reinstall. However, somehow md5crypt-opencl speed is now halved, and I can't figure out why (and why wpapsk-opencl isn't). I'd suspect that I mixed up Intel package versions and lost AVX2 support somewhere, but then it would have impacted wpapsk-opencl as well. Whatever. Returning to the topic of auto-tuning, there's now some weirdness seen for md5crypt-opencl on Titan X: [solar@...er run]$ ./john -test -form=md5crypt-opencl -dev=4 -v=4 Device 4: GeForce GTX TITAN X Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=262162 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 Calculating best GWS for LWS=32; max. 250ms single kernel invocation. gws: 1024 448788 c/s 448788000 rounds/s 2.281ms per crypt_all()! gws: 2048 889865 c/s 889865000 rounds/s 2.301ms per crypt_all()+ gws: 4096 1697005 c/s 1697005000 rounds/s 2.413ms per crypt_all()+ gws: 8192 2740048 c/s 2740048000 rounds/s 2.989ms per crypt_all()+ gws: 16384 3257163 c/s 3257163000 rounds/s 5.030ms per crypt_all()+ gws: 32768 2308052 c/s 2308052000 rounds/s 14.197ms per crypt_all() gws: 65536 2837811 c/s 2837811000 rounds/s 23.093ms per crypt_all() Calculating best LWS for GWS=16384 Testing LWS=32 GWS=16384 ... 17.168ms+ Testing LWS=64 GWS=16384 ... 17.135ms Testing LWS=96 GWS=16320 ... 17.234ms Testing LWS=128 GWS=16384 ... 17.334ms Testing LWS=160 GWS=16320 ... 18.372ms Testing LWS=192 GWS=16320 ... 17.198ms Testing LWS=224 GWS=16352 ... 18.444ms Testing LWS=256 GWS=16384 ... 17.224ms Testing LWS=512 GWS=16384 ... 19.622ms Testing LWS=1024 GWS=16384 ... 17.458ms Calculating best GWS for LWS=32; max. 500ms single kernel invocation. gws: 768 375657 c/s 375657000 rounds/s 2.044ms per crypt_all()! gws: 1536 751314 c/s 751314000 rounds/s 2.044ms per crypt_all()+ gws: 3072 1467082 c/s 1467082000 rounds/s 2.093ms per crypt_all()+ gws: 6144 2604060 c/s 2604060000 rounds/s 2.359ms per crypt_all()+ gws: 12288 3460581 c/s 3460581000 rounds/s 3.550ms per crypt_all()+ gws: 24576 2880428 c/s 2880428000 rounds/s 8.532ms per crypt_all() gws: 49152 2916298 c/s 2916298000 rounds/s 16.854ms per crypt_all() gws: 98304 2922221 c/s 2922221000 rounds/s 33.640ms per crypt_all() gws: 196608 2950827 c/s 2950827000 rounds/s 66.628ms per crypt_all() gws: 393216 2988516 c/s 2988516000 rounds/s 131.575ms per crypt_all() gws: 786432 3010473 c/s 3010473000 rounds/s 261.232ms per crypt_all() Local worksize (LWS) 32, global worksize (GWS) 12288 DONE Raw: 1228K c/s real, 1228K c/s virtual Notice how it was up to 3.46M during auto-tuning, but only 1.2M when benchmarked the tuned settings. And: [solar@...er run]$ LWS=32 GWS=12288 ./john -test -form=md5crypt-opencl -dev=4 Device 4: GeForce GTX TITAN X Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE Raw: 1228K c/s real, 1228K c/s virtual [solar@...er run]$ LWS=32 GWS=98304 ./john -test -form=md5crypt-opencl -dev=4 Device 4: GeForce GTX TITAN X Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE Raw: 3276K c/s real, 2457K c/s virtual It's unclear why GWS=12288 appears to be fast during auto-tuning. Are we using a different mix of password lengths there or something? Actual cracking at length 8: LWS=32 GWS=12288: 0g 0:00:00:09 0g/s 0p/s 4824Kc/s 4824KC/s GPU:45б╟C util:98% fan:22% aaaaaaaa..pesaaaaa LWS=32 GWS=98304: 0g 0:00:00:35 0g/s 0p/s 4386Kc/s 4386KC/s GPU:62б╟C util:99% fan:22% aaaaaaaa..xkpfaaaa So 12288 is in fact faster for fixed length 8. BTW, what if one of the GWS figures in the first pass (before LWS tuning) turned out to be higher than all of those in the second pass? Would we revert to it (and the original LWS)? Perhaps we should, since otherwise we're not making full use of the benchmarks made. I expect this to rarely be the case, yet it might happen. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.