john-dev - Re: LWS and GWS auto-tuning

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7aa346d81ba44cf50f18cc986e8ce34b@smtp.hushmail.com>
Date: Fri, 28 Aug 2015 21:53:10 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

On 2015-08-28 11:18, Solar Designer wrote:
> Returning to the topic of auto-tuning, there's now some weirdness seen
> for md5crypt-opencl on Titan X:
(...)
> Notice how it was up to 3.46M during auto-tuning, but only 1.2M when
> benchmarked the tuned settings.  And:

I made some changes to self-test and autotune key setting and it seems 
more stable now. One problem was the autotune key fuzzing would 
spuriously result in shorter keys (an xor resulting in 0), so formats 
that are sensitive to varying lengths would be "hurt". While that 
doesn't quite explain the things you saw fully, it seems quite good now:

$ ../run/john -stress-test -form:md5crypt-opencl -dev=4
Device 4: GeForce GTX TITAN X
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3601K c/s real, 3601K c/s virtual

Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3613K c/s real, 3649K c/s virtual

Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3649K c/s real, 3613K c/s virtual

Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3465K c/s real, 3430K c/s virtual

Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3624K c/s real, 3589K c/s virtual

Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3391K c/s real, 3357K c/s virtual

Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:	3576K c/s real, 3612K c/s virtual

$ ../run/john -test -form:md5crypt-opencl -dev=4 -v:4
Device 4: GeForce GTX TITAN X
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options 
used: -I ../run/kernels -cl-mad-enable -DSM_MAJOR=5 -DSM_MINOR=2 
-cl-nv-verbose -D__GPU__ -DDEVICE_INFO=262162 -DDEV_VER_MAJOR=352 
-DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Calculating best GWS for LWS=32; max. 250ms single kernel invocation.
gws:      1024	    449198 c/s   449198000 rounds/s   2.279ms per 
crypt_all()!
gws:      2048	    892620 c/s   892620000 rounds/s   2.294ms per 
crypt_all()+
gws:      4096	   1686918 c/s  1686918000 rounds/s   2.428ms per 
crypt_all()+
gws:      8192	   2730841 c/s  2730841000 rounds/s   2.999ms per 
crypt_all()+
gws:     16384	   3258489 c/s  3258489000 rounds/s   5.028ms per 
crypt_all()+
gws:     32768	   2326987 c/s  2326987000 rounds/s  14.081ms per crypt_all()
gws:     65536	   2835324 c/s  2835324000 rounds/s  23.114ms per crypt_all()
Calculating best LWS for GWS=16384
Testing LWS=32 GWS=16384 ... 17.250ms+
Testing LWS=64 GWS=16384 ... 17.270ms
Testing LWS=96 GWS=16320 ... 17.203ms
Testing LWS=128 GWS=16384 ... 17.276ms
Testing LWS=160 GWS=16320 ... 18.320ms
Testing LWS=192 GWS=16320 ... 17.198ms+
Testing LWS=224 GWS=16352 ... 18.416ms
Testing LWS=256 GWS=16384 ... 17.128ms+
Testing LWS=288 GWS=16128 ... 18.256ms
Testing LWS=512 GWS=16384 ... 17.887ms
Testing LWS=1024 GWS=16384 ... 17.464ms
Calculating best GWS for LWS=256; max. 500ms single kernel invocation.
gws:      6144	   2579293 c/s  2579293000 rounds/s   2.382ms per 
crypt_all()!
gws:     12288	   3464203 c/s  3464203000 rounds/s   3.547ms per 
crypt_all()+
gws:     24576	   2889977 c/s  2889977000 rounds/s   8.503ms per crypt_all()
gws:     49152	   2170788 c/s  2170788000 rounds/s  22.642ms per crypt_all()
gws:     98304	   2560407 c/s  2560407000 rounds/s  38.393ms per crypt_all()
gws:    196608	   2462964 c/s  2462964000 rounds/s  79.825ms per crypt_all()
gws:    393216	   2556987 c/s  2556987000 rounds/s 153.780ms per crypt_all()
gws:    786432	   2562092 c/s  2562092000 rounds/s 306.949ms per crypt_all()
Local worksize (LWS) 256, global worksize (GWS) 12288
DONE
Raw:	3589K c/s real, 3624K c/s virtual

Speed matches autotune's (actually we get even better speed than predicted).

A similar problem in self-test was that indexes between tested keys 
(indexes that core just skip) were filled with max-length plaintexts, 
now changed to same length as 1st test vector plaintext. Max length 
would slow things down considerably for formats like RAR or 7z, and to 
some extent md5crypt too. But it shouldn't affect benchmark results, 
only time for self-testing.

I'm currently experimenting with sorting keys by length for RAR-opencl 
and these issues made a lot of noise.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.