Date: Tue, 25 Aug 2015 14:42:27 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: LWS and GWS auto-tuning magnum, all - On Tue, Aug 25, 2015 at 09:06:55AM +0300, Solar Designer wrote: > We ought to do something about the auto-tuning. Here are some ideas: > > Maybe have a table of per card type likely optimal LWS (or multipliers > for powers of 2). Actually, this info can typically be queried, and we already had code to do that - but it appeared mostly (or totally?) unused. Specifically, there are opencl_find_best_workgroup() and opencl_find_best_lws() functions in common-opencl.c. The attached patch #if 0's opencl_find_best_workgroup() (perhaps we need to drop it completely, and remove from common-opencl.h too), and revises and makes use of opencl_find_best_lws(). The new logic is, when neither GWS nor LWS env vars are specified: pre-tune GWS (with a lower than usual maximum), tune LWS, and finally tune GWS with the tuned LWS and considering the queried number of compute units. Obviously, this is far from perfect - we're trying to find a maximum of a function of two variables, but are adjusting only one at a time. Yet it appears to work much better than the current approach of tuning GWS only. When either LWS or GWS is specified, then only the other is auto-tuned (once). When both are specified, nothing is auto-tuned. For example, with md5crypt-opencl on GTX TITAN, where the previous approach worked poorly: [solar@...er run]$ ./john -test -form=md5crypt-opencl -dev=5 Device 5: GeForce GTX TITAN Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE Raw: 1984K c/s real, 1984K c/s virtual [solar@...er run]$ time ./john -test -form=md5crypt-opencl -dev=5 -v=4 Device 5: GeForce GTX TITAN Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 Calculating best global worksize (GWS); max. 250ms single kernel invocation. gws: 1024 160637 c/s 160637000 rounds/s 6.374ms per crypt_all()! gws: 2048 306444 c/s 306444000 rounds/s 6.683ms per crypt_all()+ gws: 4096 572829 c/s 572829000 rounds/s 7.150ms per crypt_all()+ gws: 8192 957582 c/s 957582000 rounds/s 8.554ms per crypt_all()+ gws: 16384 989299 c/s 989299000 rounds/s 16.561ms per crypt_all()+ gws: 32768 1225015 c/s 1225015000 rounds/s 26.749ms per crypt_all()+ gws: 65536 1402179 c/s 1402179000 rounds/s 46.738ms per crypt_all()+ Calculating best local worksize (LWS) Testing GWS=65536 LWS=32 ... 190469952ns Testing GWS=65536 LWS=64 ... 107994464ns Testing GWS=65472 LWS=96 ... 93050272ns Testing GWS=65536 LWS=128 ... 92955840ns Testing GWS=65440 LWS=160 ... 94382368ns Testing GWS=65472 LWS=192 ... 93250048ns Testing GWS=65408 LWS=224 ... 95941952ns Testing GWS=65536 LWS=256 ... 93266272ns Testing GWS=65536 LWS=512 ... 93425312ns Testing GWS=65536 LWS=1024 ... 106644352ns Calculating best global worksize (GWS); max. 500ms single kernel invocation. gws: 1344 247774 c/s 247774000 rounds/s 5.424ms per crypt_all()! gws: 2688 465121 c/s 465121000 rounds/s 5.779ms per crypt_all()+ gws: 5376 811578 c/s 811578000 rounds/s 6.624ms per crypt_all()+ gws: 10752 1335447 c/s 1335447000 rounds/s 8.051ms per crypt_all()+ gws: 21504 1963838 c/s 1963838000 rounds/s 10.949ms per crypt_all()+ gws: 43008 1978725 c/s 1978725000 rounds/s 21.735ms per crypt_all() gws: 86016 1985954 c/s 1985954000 rounds/s 43.312ms per crypt_all()+ gws: 172032 1993503 c/s 1993503000 rounds/s 86.296ms per crypt_all() gws: 344064 1996328 c/s 1996328000 rounds/s 172.348ms per crypt_all() gws: 688128 2002809 c/s 2002809000 rounds/s 343.581ms per crypt_all() Local worksize (LWS) 96, global worksize (GWS) 86016 DONE Raw: 1978K c/s real, 1978K c/s virtual real 0m5.642s user 0m3.445s sys 0m2.111s Some other formats show speedups as well. I didn't test all, though. There might be regressions. One known issue is that the LWS tuning probably needs a time limit, in case the device supports a very high maximum LWS. This may be implemented similarly to how GWS tuning's time limit is. Also, this code needs a cleanup. My patch is a hack on top of other hacks. Many formats provide their own idea of their desired LWS and GWS; maybe we should drop most of this, as I suspect they are often less optimal than the new auto-tuning. Even md5crypt-opencl benchmarked above has a boilerplate get_default_workgroup() in it, and the new auto-tuning actually respects this initially (for the initial GWS tuning). Maybe we should instead start right with a device query to determine initial LWS from that. Those get_default_workgroup() copied to multiple format files look ridiculous. Alexander View attachment "john-opencl-auto2.diff" of type "text/plain" (9306 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.