Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 27 Mar 2012 13:05:41 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: CUDA & OpenCL status

On Sun, Mar 25, 2012 at 11:08:41PM +0200, Lukas Odzioba wrote:
> 2012/3/25 Solar Designer <solar@...nwall.com>:
> > Lukas - meanwhile, I got phpass to 710k c/s on my GTX-570 1600 MHz (up
> > from 633k that I reported before) by moving from sm_10 to sm_20 or sm_21
> > for the generated code (the latter is apparently not valid for my card,
> > but happens to work - I guess the same code was generated as for sm_20)
> > and by increasing BLOCKS from 126*3 to 160*3.  I guess 126 was tuned for
> > GTX-560, right?  This new speed is slightly higher than the published
> > speed for hashcat on an equivalent graphics card.
> Nice results, good to see them.
> 126 was tuned for gtx 460, sm_10 is by default because of backward
> compatibility.

BTW, we need to document this somewhere - mention that for best
performance the arch and/or BLOCKS may need to be adjusted.  Maybe we
also need to include a way to pick these by graphics card name (for
those we've tested or have specific expectations about) or/and have them
auto-tuned.

Do you know how hashcat handles this?  Maybe it has different kernels
pre-built for different cards?

> Our phpass implementation is quite fine.

Yes, thank you for it!  Your CUDA phpass code is great.

The OpenCL one is not that good, though - it fails for me on Nvidia and
it can probably be made twice faster on ATI.

> It will give similar results
> to hashcat depending on card you use (on 9800gt john is faster, on
> gtx460 hashcat is faster, as you stated on gtx570 john is faster).

On GTX-570, it is only faster after tuning.  BTW, I got it further to
720k c/s by increasing BLOCKS to 1000*3 - the X desktop almost freezes
at this setting, though, so it may be too much for someone who actually
uses the graphics card as such.  Oh, and --test takes a long time with
BLOCKS set this high.  Perhaps actual cracking would be non-interactive
as well - I mean that the status line will appear many seconds after a
keypress.

Another thing I've tried is common subexpression elimination in H():
(b + c) vs. next round's (c + d).  I did this by creating an HH2() where
I put c and d as first two arguments to H(), and then I interleaved HH()
and HH2().  Somehow this didn't result in any speedup for the CUDA code
on GTX-570.  I did not verify if the code actually changed or not.  In
my testing of this trick on CPUs, it sometimes improved and sometimes
hurt performance (different CPU archs).  I think we may/should keep
revisiting this - even if it did not help on one occasion, it may help
on another.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.