john-dev - Re: LWS and GWS auto-tuning

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150826065957.GB4229@openwall.com>
Date: Wed, 26 Aug 2015 09:59:57 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

On Tue, Aug 25, 2015 at 08:36:44PM +0200, magnum wrote:
> On 2015-08-25 18:00, magnum wrote:
> >The boost from this is better than I thought. The difference is
> >sometimes 2x and more!
> >
> >Here's my laptop top/bottom 10:
> >(...)
> >The worst ones I'm pretty sure are false/coincidental. I manually
> >re-tested some of the best ones and it's true: They really auto-tune to
> >2.5x faster speed than before.
> >
> >I'm currently testing Tahiti/Titan on super.
> 
> Worst/best 10 for Tahiti (oldoffice failing):

Thanks!  What code version are these benchmarks for?  I ask because some
of the cleanups you made after committing my patch are not no-ops.

Specifically, commit 244d113dce38fcd1ead0f6abf0557863844313b2 with
comment "OpenCL autotune: Drop obsolete functions get_task_max_size()
and get_default_workgroup()." appears to change what LWS is used during
the first GWS auto-tuning run, which also changes the GWS soft-limit for
that run due to how I am calculating it:

		soft_limit = local_work_size * core_count * 128;

For example, when auto-tuning md5crypt-opencl on super's TITAN, my code
ran until GWS 65536 on the first pass:

Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024      160787 c/s   160787000 rounds/s   6.368ms per crypt_all()!
gws:      2048      305993 c/s   305993000 rounds/s   6.692ms per crypt_all()+
gws:      4096      572932 c/s   572932000 rounds/s   7.149ms per crypt_all()+
gws:      8192      956755 c/s   956755000 rounds/s   8.562ms per crypt_all()+
gws:     16384      987580 c/s   987580000 rounds/s  16.590ms per crypt_all()+
gws:     32768     1228546 c/s  1228546000 rounds/s  26.672ms per crypt_all()+
gws:     65536     1417537 c/s  1417537000 rounds/s  46.232ms per crypt_all()+
Calculating best local worksize (LWS)

Yours runs until GWS 262144:

Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024      152045 c/s   152045000 rounds/s   6.734ms per crypt_all()!
gws:      2048      303818 c/s   303818000 rounds/s   6.740ms per crypt_all()+
gws:      4096      537474 c/s   537474000 rounds/s   7.620ms per crypt_all()+
gws:      8192      939022 c/s   939022000 rounds/s   8.723ms per crypt_all()+
gws:     16384     1563478 c/s  1563478000 rounds/s  10.479ms per crypt_all()+
gws:     32768     1523566 c/s  1523566000 rounds/s  21.507ms per crypt_all()
gws:     65536     1579223 c/s  1579223000 rounds/s  41.498ms per crypt_all()+
gws:    131072     1785862 c/s  1785862000 rounds/s  73.394ms per crypt_all()+
gws:    262144     1898972 c/s  1898972000 rounds/s 138.045ms per crypt_all()+
Calculating best local worksize (LWS)

This makes the auto-tuning unnecessarily(?) slower (5.7 vs. 7.0 seconds
for "./john -test -form=md5crypt-opencl -dev=5 -v=4"), and it might also
result in different auto-tuned LWS and GWS.  Luckily, for
md5crypt-opencl it's the same, but I recall that until I introduced the
0.997 threshold it would have made things worse even for this one
format.  (BTW, 0.999 worked as well, but didn't have a safety margin.)
And for other formats, it might be making things worse despite of this
threshold.  In general, I found that going to high GWS values with
unoptimal LWS partially hides this unoptimal choice of LWS, and thus
goes against proper LWS tuning.  It appeared to result in higher than
otherwise optimal LWS appearing very slightly faster (like 0.1%), and
the end result may be similar c/s rate at much higher GWS (as higher GWS
then continues to be used to compensate for unoptimal LWS), which as we
know makes JtR inconvenient to use.  I found that the resulting
difference in optimal GWS can be as high as 10x (under 100k vs. almost a
million for md5crypt).

BTW, on this line:

			max_run_time1 = (max_run_time + 1) / 2;

I also tried 1/10 (so +9 and /10) instead of 1/2, and it worked even
more reliably for md5crypt-opencl (and obviously quicker too), but
unfortunately turned out to be insufficient e.g. for phpass (where 1/10
of 200ms is 20ms, and surprisingly that's more than the very first
kernel run's duration).  I think there's room for improvement here.
Maybe with a few per-format issues fixed, we could use some value
inbetween 1/2 and 1/10 here - e.g., perhaps 1/4 would work well for
almost all formats (meaning 50ms for phpass currently, so still unsafe
for it).

What LWS is being used for the first pass at GWS tuning now?  Where is
it set in code now?

I think we should add reporting of the tentative LWS to the "Calculating
best global worksize (GWS)" lines.

If the tuned LWS happens to be the same as what was used during the
first pass at GWS tuning, then the second pass should start where the
first pass left off (only test higher GWS than the first pass reached).

Maybe the initial LWS should be based on
ocl_device_list[sequence_nr].cores_per_MP unless a given format requests
otherwise - e.g., bcrypt-opencl would, for specific GPUs.

BTW, bcrypt-opencl now bypasses auto-tuning.  Maybe it shouldn't (except
for exact GPUs it's fully aware of), but should instead provide hints.
The same probably applies to many other formats.

How do other password crackers approach this issue?  For example, I
don't recall hearing of oclHashcat doing any auto-tuning.  In cryptocoin
miners, there's an "intensity" setting, which I guess adjusts GWS.
IIRC, oclHashcat has something like it too.  But I think these programs
use some nearly-optimal settings even when the user hasn't increased the
default intensity - so how do they manage?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.