john-dev - Re: LWS and GWS auto-tuning

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150825114227.GA31265@openwall.com>
Date: Tue, 25 Aug 2015 14:42:27 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

magnum, all -

On Tue, Aug 25, 2015 at 09:06:55AM +0300, Solar Designer wrote:
> We ought to do something about the auto-tuning.  Here are some ideas:
> 
> Maybe have a table of per card type likely optimal LWS (or multipliers
> for powers of 2).

Actually, this info can typically be queried, and we already had code to
do that - but it appeared mostly (or totally?) unused.

Specifically, there are opencl_find_best_workgroup() and
opencl_find_best_lws() functions in common-opencl.c.  The attached patch
#if 0's opencl_find_best_workgroup() (perhaps we need to drop it
completely, and remove from common-opencl.h too), and revises and makes
use of opencl_find_best_lws().

The new logic is, when neither GWS nor LWS env vars are specified:
pre-tune GWS (with a lower than usual maximum), tune LWS, and finally
tune GWS with the tuned LWS and considering the queried number of
compute units.  Obviously, this is far from perfect - we're trying to
find a maximum of a function of two variables, but are adjusting only
one at a time.  Yet it appears to work much better than the current
approach of tuning GWS only.

When either LWS or GWS is specified, then only the other is auto-tuned
(once).  When both are specified, nothing is auto-tuned.

For example, with md5crypt-opencl on GTX TITAN, where the previous
approach worked poorly:

[solar@...er run]$ ./john -test -form=md5crypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:    1984K c/s real, 1984K c/s virtual

[solar@...er run]$ time ./john -test -form=md5crypt-opencl -dev=5 -v=4
Device 5: GeForce GTX TITAN
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024      160637 c/s   160637000 rounds/s   6.374ms per crypt_all()!
gws:      2048      306444 c/s   306444000 rounds/s   6.683ms per crypt_all()+
gws:      4096      572829 c/s   572829000 rounds/s   7.150ms per crypt_all()+
gws:      8192      957582 c/s   957582000 rounds/s   8.554ms per crypt_all()+
gws:     16384      989299 c/s   989299000 rounds/s  16.561ms per crypt_all()+
gws:     32768     1225015 c/s  1225015000 rounds/s  26.749ms per crypt_all()+
gws:     65536     1402179 c/s  1402179000 rounds/s  46.738ms per crypt_all()+
Calculating best local worksize (LWS)
Testing GWS=65536 LWS=32 ... 190469952ns
Testing GWS=65536 LWS=64 ... 107994464ns
Testing GWS=65472 LWS=96 ... 93050272ns
Testing GWS=65536 LWS=128 ... 92955840ns
Testing GWS=65440 LWS=160 ... 94382368ns
Testing GWS=65472 LWS=192 ... 93250048ns
Testing GWS=65408 LWS=224 ... 95941952ns
Testing GWS=65536 LWS=256 ... 93266272ns
Testing GWS=65536 LWS=512 ... 93425312ns
Testing GWS=65536 LWS=1024 ... 106644352ns
Calculating best global worksize (GWS); max. 500ms single kernel invocation.
gws:      1344      247774 c/s   247774000 rounds/s   5.424ms per crypt_all()!
gws:      2688      465121 c/s   465121000 rounds/s   5.779ms per crypt_all()+
gws:      5376      811578 c/s   811578000 rounds/s   6.624ms per crypt_all()+
gws:     10752     1335447 c/s  1335447000 rounds/s   8.051ms per crypt_all()+
gws:     21504     1963838 c/s  1963838000 rounds/s  10.949ms per crypt_all()+
gws:     43008     1978725 c/s  1978725000 rounds/s  21.735ms per crypt_all()
gws:     86016     1985954 c/s  1985954000 rounds/s  43.312ms per crypt_all()+
gws:    172032     1993503 c/s  1993503000 rounds/s  86.296ms per crypt_all()
gws:    344064     1996328 c/s  1996328000 rounds/s 172.348ms per crypt_all()
gws:    688128     2002809 c/s  2002809000 rounds/s 343.581ms per crypt_all()
Local worksize (LWS) 96, global worksize (GWS) 86016
DONE
Raw:    1978K c/s real, 1978K c/s virtual


real    0m5.642s
user    0m3.445s
sys     0m2.111s

Some other formats show speedups as well.  I didn't test all, though.
There might be regressions.

One known issue is that the LWS tuning probably needs a time limit, in
case the device supports a very high maximum LWS.  This may be
implemented similarly to how GWS tuning's time limit is.

Also, this code needs a cleanup.  My patch is a hack on top of other hacks.

Many formats provide their own idea of their desired LWS and GWS; maybe
we should drop most of this, as I suspect they are often less optimal
than the new auto-tuning.  Even md5crypt-opencl benchmarked above has a
boilerplate get_default_workgroup() in it, and the new auto-tuning
actually respects this initially (for the initial GWS tuning).  Maybe we
should instead start right with a device query to determine initial LWS
from that.  Those get_default_workgroup() copied to multiple format
files look ridiculous.

Alexander

View attachment "john-opencl-auto2.diff" of type "text/plain" (9306 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.