Date: Thu, 27 Aug 2015 01:01:27 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: LWS and GWS auto-tuning On 2015-08-27 00:41, Solar Designer wrote: > On Wed, Aug 26, 2015 at 10:21:31PM +0200, magnum wrote: >> On 2015-08-26 21:37, Solar Designer wrote: >>> Unfortunately, LWS auto-tuning tries unreasonably high values (like >>> 8192) and sometimes fails totally (results in an error from OpenCL and >>> program abort) for some formats when tested with one or the other OpenCL >>> SDK on "well". Can you look into this, and perhaps commit a fix? >> >> That's odd, can you name a format? > > For example: > > $ ./john -test -form=phpass-opencl -dev=0 -v=4 > Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz > Benchmarking: phpass-opencl ($P$9 lengths 0 to 15) [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER > Build log: Compilation started > Compilation done > Linking started > Linking done > Device build started > Device build done > Kernel <phpass> was not vectorized > Done. > Calculating best global worksize (GWS); max. 100ms single kernel invocation. > gws: 256 24569 c/s 24569 rounds/s 10.419ms per crypt_all()! > gws: 512 24150 c/s 24150 rounds/s 21.200ms per crypt_all() > gws: 1024 26315 c/s 26315 rounds/s 38.912ms per crypt_all()+ > gws: 2048 26323 c/s 26323 rounds/s 77.800ms per crypt_all() > Calculating best local worksize (LWS) > Testing LWS=128 GWS=1024 ... 151.439ms+ > Testing LWS=256 GWS=1024 ... 302.382ms > Testing LWS=512 GWS=1024 ... 604.730ms > Testing LWS=1024 GWS=1024 ... 1.209s > Testing LWS=2048 GWS=2048 ...Segmentation fault The device actually supports 8192 per the queries, and that's why it is tried. This is also seen in our list output: Platform version: OpenCL 1.2 Device #0 (0) name: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Device vendor: Intel(R) Corporation Device type: CPU (LE) Device version: OpenCL 1.2 (Build 9756) Driver version: 126.96.36.19956 Native vector widths: char 32, short 16, int 8, long 4 Preferred vector width: char 1, short 1, int 1, long 1 Global Memory: 31.0 GB Global Memory Cache: 256.2 KB Local Memory: 32.0 KB (Global) Max memory alloc. size: 7.0 GB Max clock (MHz): 3500 Profiling timer res.: 1 ns Max Work Group Size: 8192 <---- here! Parallel compute cores: 8 I'm do not think the de-facto limit of 1024 we've been used to is an actual maximum per any specifications. Also, when I tried this it ran just fine through the tests up to 8192 but picked a lower number as best. If it wasn't actually supported, we should get an CL_INVALID_WORK_GROUP_SIZE error and it would have been caught and handled properly. I presume your segfault was unrelated to the work size. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.