Date: Tue, 27 Mar 2012 13:05:41 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: CUDA & OpenCL status On Sun, Mar 25, 2012 at 11:08:41PM +0200, Lukas Odzioba wrote: > 2012/3/25 Solar Designer <solar@...nwall.com>: > > Lukas - meanwhile, I got phpass to 710k c/s on my GTX-570 1600 MHz (up > > from 633k that I reported before) by moving from sm_10 to sm_20 or sm_21 > > for the generated code (the latter is apparently not valid for my card, > > but happens to work - I guess the same code was generated as for sm_20) > > and by increasing BLOCKS from 126*3 to 160*3. I guess 126 was tuned for > > GTX-560, right? This new speed is slightly higher than the published > > speed for hashcat on an equivalent graphics card. > Nice results, good to see them. > 126 was tuned for gtx 460, sm_10 is by default because of backward > compatibility. BTW, we need to document this somewhere - mention that for best performance the arch and/or BLOCKS may need to be adjusted. Maybe we also need to include a way to pick these by graphics card name (for those we've tested or have specific expectations about) or/and have them auto-tuned. Do you know how hashcat handles this? Maybe it has different kernels pre-built for different cards? > Our phpass implementation is quite fine. Yes, thank you for it! Your CUDA phpass code is great. The OpenCL one is not that good, though - it fails for me on Nvidia and it can probably be made twice faster on ATI. > It will give similar results > to hashcat depending on card you use (on 9800gt john is faster, on > gtx460 hashcat is faster, as you stated on gtx570 john is faster). On GTX-570, it is only faster after tuning. BTW, I got it further to 720k c/s by increasing BLOCKS to 1000*3 - the X desktop almost freezes at this setting, though, so it may be too much for someone who actually uses the graphics card as such. Oh, and --test takes a long time with BLOCKS set this high. Perhaps actual cracking would be non-interactive as well - I mean that the status line will appear many seconds after a keypress. Another thing I've tried is common subexpression elimination in H(): (b + c) vs. next round's (c + d). I did this by creating an HH2() where I put c and d as first two arguments to H(), and then I interleaved HH() and HH2(). Somehow this didn't result in any speedup for the CUDA code on GTX-570. I did not verify if the code actually changed or not. In my testing of this trick on CPUs, it sometimes improved and sometimes hurt performance (different CPU archs). I think we may/should keep revisiting this - even if it did not help on one occasion, it may help on another. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.