john-dev - Re: LWS and GWS auto-tuning

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150826224108.GA12389@openwall.com>
Date: Thu, 27 Aug 2015 01:41:08 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

On Wed, Aug 26, 2015 at 10:21:31PM +0200, magnum wrote:
> On 2015-08-26 21:37, Solar Designer wrote:
> >Unfortunately, LWS auto-tuning tries unreasonably high values (like
> >8192) and sometimes fails totally (results in an error from OpenCL and
> >program abort) for some formats when tested with one or the other OpenCL
> >SDK on "well".  Can you look into this, and perhaps commit a fix?
> 
> That's odd, can you name a format?

For example:

$ ./john -test -form=phpass-opencl -dev=0 -v=4
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: phpass-opencl ($P$9 lengths 0 to 15) [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER 
Build log: Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <phpass> was not vectorized
Done.
Calculating best global worksize (GWS); max. 100ms single kernel invocation.
gws:       256       24569 c/s       24569 rounds/s  10.419ms per crypt_all()!
gws:       512       24150 c/s       24150 rounds/s  21.200ms per crypt_all()
gws:      1024       26315 c/s       26315 rounds/s  38.912ms per crypt_all()+
gws:      2048       26323 c/s       26323 rounds/s  77.800ms per crypt_all()
Calculating best local worksize (LWS)
Testing LWS=128 GWS=1024 ... 151.439ms+
Testing LWS=256 GWS=1024 ... 302.382ms
Testing LWS=512 GWS=1024 ... 604.730ms
Testing LWS=1024 GWS=1024 ... 1.209s
Testing LWS=2048 GWS=2048 ...Segmentation fault

and via the other SDK:

$ ./john -test -form=phpass-opencl -dev=4 -v=4
Device 4: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: phpass-opencl ($P$9 lengths 0 to 15) [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER 
Calculating best global worksize (GWS); max. 100ms single kernel invocation.
gws:       256        3982 c/s        3982 rounds/s  64.282ms per crypt_all()!
Calculating best local worksize (LWS)
Testing LWS=1 GWS=256 ... 12.615ms+
Testing LWS=2 GWS=256 ... 16.678ms
Testing LWS=3 GWS=255 ... 22.180ms
Testing LWS=4 GWS=256 ... 25.108ms
Testing LWS=5 GWS=255 ... 31.315ms
Testing LWS=6 GWS=252 ... 37.359ms
Testing LWS=7 GWS=252 ... 43.621ms
Testing LWS=8 GWS=256 ... 43.024ms
Testing LWS=16 GWS=256 ... 85.121ms
Testing LWS=32 GWS=256 ... 169.962ms
Testing LWS=64 GWS=256 ... 339.913ms
Testing LWS=128 GWS=256 ... 679.668ms
Testing LWS=256 GWS=256 ... 1.359s
Testing LWS=512 GWS=512 ... 2.718s
Testing LWS=1024 GWS=1024 ... 3.091s
Calculating best global worksize (GWS); max. 200ms single kernel invocation.
*** glibc detected *** ./john: corrupted double-linked list: 0x0000000003783740 ***

Also:

solar@...l:~/j/bleeding-jumbo/run$ ./john -test -form=md5crypt-opencl -dev=0 -v=4 
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Build log: Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <cryptmd5> was successfully vectorized (8)
Done.
Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024      176233 c/s   176233000 rounds/s   5.810ms per crypt_all()!
gws:      2048      195544 c/s   195544000 rounds/s  10.473ms per crypt_all()+
gws:      4096      215922 c/s   215922000 rounds/s  18.969ms per crypt_all()+
gws:      8192      216427 c/s   216427000 rounds/s  37.850ms per crypt_all()
gws:     16384      216549 c/s   216549000 rounds/s  75.659ms per crypt_all()
gws:     32768      216770 c/s   216770000 rounds/s 151.164ms per crypt_all()
Calculating best local worksize (LWS)
Testing LWS=128 GWS=4096 ... 68.128ms+
Testing LWS=256 GWS=4096 ... 68.043ms
Testing LWS=512 GWS=4096 ... 68.030ms
Testing LWS=1024 GWS=4096 ... 134.773ms
Testing LWS=2048 GWS=4096 ... 206.746ms
Testing LWS=4096 GWS=4096 ... 400.484ms
Testing LWS=8192 GWS=8192 ...OpenCL error (CL_INVALID_VALUE) in file (opencl_cryptmd5_fmt_plug.c) at line (381) - (Copy data back)

but after a few runs failing like the above, I got one that worked to
completion despite of the weird LWS having been tested:

$ ./john -test -form=md5crypt-opencl -dev=0 -v=4
Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Build log: Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <cryptmd5> was successfully vectorized (8)
Done.
Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024      183765 c/s   183765000 rounds/s   5.572ms per crypt_all()!
gws:      2048      174702 c/s   174702000 rounds/s  11.722ms per crypt_all()
gws:      4096      201126 c/s   201126000 rounds/s  20.365ms per crypt_all()+
gws:      8192      208587 c/s   208587000 rounds/s  39.273ms per crypt_all()+
gws:     16384      216536 c/s   216536000 rounds/s  75.663ms per crypt_all()+
gws:     32768      216629 c/s   216629000 rounds/s 151.262ms per crypt_all()
Calculating best local worksize (LWS)
Testing LWS=128 GWS=16384 ... 163.140ms+
Testing LWS=256 GWS=16384 ... 169.023ms
Testing LWS=512 GWS=16384 ... 163.110ms
Testing LWS=1024 GWS=16384 ... 163.121ms
Testing LWS=2048 GWS=16384 ... 163.149ms
Testing LWS=4096 GWS=16384 ... 321.910ms
Testing LWS=8192 GWS=16384 ... 483.195ms
Calculating best global worksize (GWS); max. 500ms single kernel invocation.
gws:      1024      162439 c/s   162439000 rounds/s   6.303ms per crypt_all()!
gws:      2048      178187 c/s   178187000 rounds/s  11.493ms per crypt_all()+
gws:      4096      216142 c/s   216142000 rounds/s  18.950ms per crypt_all()+
gws:      8192      216313 c/s   216313000 rounds/s  37.870ms per crypt_all()
gws:     16384      205236 c/s   205236000 rounds/s  79.829ms per crypt_all()
gws:     32768      216487 c/s   216487000 rounds/s 151.362ms per crypt_all()
gws:     65536      216630 c/s   216630000 rounds/s 302.524ms per crypt_all()
Local worksize (LWS) 128, global worksize (GWS) 4096
DONE
Raw:    217088 c/s real, 27102 c/s virtual

I think the difference here is that it had tuned a higher GWS, so LWS of
8192 was no longer causing an increase in GWS.

It even produced a respectable speed for the CPU (even though with AVX2
intrinsics we can do twice faster).  AMD's SDK can't do that, failing to
vectorize - it only gives 55k c/s, after testing moderately weird LWS:

solar@...l:~/j/bleeding-jumbo/run$ ./john -test -form=md5crypt-opencl -dev=4 -v=4
Device 4: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024       55747 c/s    55747000 rounds/s  18.368ms per crypt_all()!
gws:      2048       55809 c/s    55809000 rounds/s  36.695ms per crypt_all()
gws:      4096       55825 c/s    55825000 rounds/s  73.371ms per crypt_all()
gws:      8192       55852 c/s    55852000 rounds/s 146.672ms per crypt_all()
Calculating best local worksize (LWS)
Testing LWS=1 GWS=1024 ... 91.343ms+
Testing LWS=2 GWS=1024 ... 91.265ms
Testing LWS=3 GWS=1023 ... 91.995ms
Testing LWS=4 GWS=1024 ... 91.217ms
Testing LWS=5 GWS=1020 ... 91.762ms
Testing LWS=6 GWS=1020 ... 93.013ms
Testing LWS=7 GWS=1022 ... 93.539ms
Testing LWS=8 GWS=1024 ... 91.054ms+
Testing LWS=9 GWS=1017 ... 95.011ms
Testing LWS=16 GWS=1024 ... 91.239ms
Testing LWS=32 GWS=1024 ... 91.115ms
Testing LWS=64 GWS=1024 ... 92.196ms
Testing LWS=128 GWS=1024 ... 91.160ms
Testing LWS=256 GWS=1024 ... 161.670ms
Testing LWS=512 GWS=1024 ... 315.758ms
Testing LWS=1024 GWS=1024 ... 614.760ms
Calculating best global worksize (GWS); max. 500ms single kernel invocation.
gws:        64       55350 c/s    55350000 rounds/s   1.156ms per crypt_all()!
gws:       128       55287 c/s    55287000 rounds/s   2.315ms per crypt_all()
gws:       256       55582 c/s    55582000 rounds/s   4.605ms per crypt_all()
gws:       512       55707 c/s    55707000 rounds/s   9.190ms per crypt_all()
gws:      1024       55751 c/s    55751000 rounds/s  18.367ms per crypt_all()
gws:      2048       55801 c/s    55801000 rounds/s  36.701ms per crypt_all()
gws:      4096       55848 c/s    55848000 rounds/s  73.341ms per crypt_all()
gws:      8192       55838 c/s    55838000 rounds/s 146.707ms per crypt_all()
gws:     16384       55846 c/s    55846000 rounds/s 293.374ms per crypt_all()
Local worksize (LWS) 8, global worksize (GWS) 64
DONE
Raw:    55040 c/s real, 6967 c/s virtual

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.