Date: Wed, 26 Aug 2015 09:59:57 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: LWS and GWS auto-tuning On Tue, Aug 25, 2015 at 08:36:44PM +0200, magnum wrote: > On 2015-08-25 18:00, magnum wrote: > >The boost from this is better than I thought. The difference is > >sometimes 2x and more! > > > >Here's my laptop top/bottom 10: > >(...) > >The worst ones I'm pretty sure are false/coincidental. I manually > >re-tested some of the best ones and it's true: They really auto-tune to > >2.5x faster speed than before. > > > >I'm currently testing Tahiti/Titan on super. > > Worst/best 10 for Tahiti (oldoffice failing): Thanks! What code version are these benchmarks for? I ask because some of the cleanups you made after committing my patch are not no-ops. Specifically, commit 244d113dce38fcd1ead0f6abf0557863844313b2 with comment "OpenCL autotune: Drop obsolete functions get_task_max_size() and get_default_workgroup()." appears to change what LWS is used during the first GWS auto-tuning run, which also changes the GWS soft-limit for that run due to how I am calculating it: soft_limit = local_work_size * core_count * 128; For example, when auto-tuning md5crypt-opencl on super's TITAN, my code ran until GWS 65536 on the first pass: Calculating best global worksize (GWS); max. 250ms single kernel invocation. gws: 1024 160787 c/s 160787000 rounds/s 6.368ms per crypt_all()! gws: 2048 305993 c/s 305993000 rounds/s 6.692ms per crypt_all()+ gws: 4096 572932 c/s 572932000 rounds/s 7.149ms per crypt_all()+ gws: 8192 956755 c/s 956755000 rounds/s 8.562ms per crypt_all()+ gws: 16384 987580 c/s 987580000 rounds/s 16.590ms per crypt_all()+ gws: 32768 1228546 c/s 1228546000 rounds/s 26.672ms per crypt_all()+ gws: 65536 1417537 c/s 1417537000 rounds/s 46.232ms per crypt_all()+ Calculating best local worksize (LWS) Yours runs until GWS 262144: Calculating best global worksize (GWS); max. 250ms single kernel invocation. gws: 1024 152045 c/s 152045000 rounds/s 6.734ms per crypt_all()! gws: 2048 303818 c/s 303818000 rounds/s 6.740ms per crypt_all()+ gws: 4096 537474 c/s 537474000 rounds/s 7.620ms per crypt_all()+ gws: 8192 939022 c/s 939022000 rounds/s 8.723ms per crypt_all()+ gws: 16384 1563478 c/s 1563478000 rounds/s 10.479ms per crypt_all()+ gws: 32768 1523566 c/s 1523566000 rounds/s 21.507ms per crypt_all() gws: 65536 1579223 c/s 1579223000 rounds/s 41.498ms per crypt_all()+ gws: 131072 1785862 c/s 1785862000 rounds/s 73.394ms per crypt_all()+ gws: 262144 1898972 c/s 1898972000 rounds/s 138.045ms per crypt_all()+ Calculating best local worksize (LWS) This makes the auto-tuning unnecessarily(?) slower (5.7 vs. 7.0 seconds for "./john -test -form=md5crypt-opencl -dev=5 -v=4"), and it might also result in different auto-tuned LWS and GWS. Luckily, for md5crypt-opencl it's the same, but I recall that until I introduced the 0.997 threshold it would have made things worse even for this one format. (BTW, 0.999 worked as well, but didn't have a safety margin.) And for other formats, it might be making things worse despite of this threshold. In general, I found that going to high GWS values with unoptimal LWS partially hides this unoptimal choice of LWS, and thus goes against proper LWS tuning. It appeared to result in higher than otherwise optimal LWS appearing very slightly faster (like 0.1%), and the end result may be similar c/s rate at much higher GWS (as higher GWS then continues to be used to compensate for unoptimal LWS), which as we know makes JtR inconvenient to use. I found that the resulting difference in optimal GWS can be as high as 10x (under 100k vs. almost a million for md5crypt). BTW, on this line: max_run_time1 = (max_run_time + 1) / 2; I also tried 1/10 (so +9 and /10) instead of 1/2, and it worked even more reliably for md5crypt-opencl (and obviously quicker too), but unfortunately turned out to be insufficient e.g. for phpass (where 1/10 of 200ms is 20ms, and surprisingly that's more than the very first kernel run's duration). I think there's room for improvement here. Maybe with a few per-format issues fixed, we could use some value inbetween 1/2 and 1/10 here - e.g., perhaps 1/4 would work well for almost all formats (meaning 50ms for phpass currently, so still unsafe for it). What LWS is being used for the first pass at GWS tuning now? Where is it set in code now? I think we should add reporting of the tentative LWS to the "Calculating best global worksize (GWS)" lines. If the tuned LWS happens to be the same as what was used during the first pass at GWS tuning, then the second pass should start where the first pass left off (only test higher GWS than the first pass reached). Maybe the initial LWS should be based on ocl_device_list[sequence_nr].cores_per_MP unless a given format requests otherwise - e.g., bcrypt-opencl would, for specific GPUs. BTW, bcrypt-opencl now bypasses auto-tuning. Maybe it shouldn't (except for exact GPUs it's fully aware of), but should instead provide hints. The same probably applies to many other formats. How do other password crackers approach this issue? For example, I don't recall hearing of oclHashcat doing any auto-tuning. In cryptocoin miners, there's an "intensity" setting, which I guess adjusts GWS. IIRC, oclHashcat has something like it too. But I think these programs use some nearly-optimal settings even when the user hasn't increased the default intensity - so how do they manage? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.