john-dev - Re: CUDA & OpenCL status

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120327090541.GB14695@openwall.com>
Date: Tue, 27 Mar 2012 13:05:41 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: CUDA & OpenCL status

On Sun, Mar 25, 2012 at 11:08:41PM +0200, Lukas Odzioba wrote:
> 2012/3/25 Solar Designer <solar@...nwall.com>:
> > Lukas - meanwhile, I got phpass to 710k c/s on my GTX-570 1600 MHz (up
> > from 633k that I reported before) by moving from sm_10 to sm_20 or sm_21
> > for the generated code (the latter is apparently not valid for my card,
> > but happens to work - I guess the same code was generated as for sm_20)
> > and by increasing BLOCKS from 126*3 to 160*3.  I guess 126 was tuned for
> > GTX-560, right?  This new speed is slightly higher than the published
> > speed for hashcat on an equivalent graphics card.
> Nice results, good to see them.
> 126 was tuned for gtx 460, sm_10 is by default because of backward
> compatibility.

BTW, we need to document this somewhere - mention that for best
performance the arch and/or BLOCKS may need to be adjusted.  Maybe we
also need to include a way to pick these by graphics card name (for
those we've tested or have specific expectations about) or/and have them
auto-tuned.

Do you know how hashcat handles this?  Maybe it has different kernels
pre-built for different cards?

> Our phpass implementation is quite fine.

Yes, thank you for it!  Your CUDA phpass code is great.

The OpenCL one is not that good, though - it fails for me on Nvidia and
it can probably be made twice faster on ATI.

> It will give similar results
> to hashcat depending on card you use (on 9800gt john is faster, on
> gtx460 hashcat is faster, as you stated on gtx570 john is faster).

On GTX-570, it is only faster after tuning.  BTW, I got it further to
720k c/s by increasing BLOCKS to 1000*3 - the X desktop almost freezes
at this setting, though, so it may be too much for someone who actually
uses the graphics card as such.  Oh, and --test takes a long time with
BLOCKS set this high.  Perhaps actual cracking would be non-interactive
as well - I mean that the status line will appear many seconds after a
keypress.

Another thing I've tried is common subexpression elimination in H():
(b + c) vs. next round's (c + d).  I did this by creating an HH2() where
I put c and d as first two arguments to H(), and then I interleaved HH()
and HH2().  Somehow this didn't result in any speedup for the CUDA code
on GTX-570.  I did not verify if the code actually changed or not.  In
my testing of this trick on CPUs, it sometimes improved and sometimes
hurt performance (different CPU archs).  I think we may/should keep
revisiting this - even if it did not help on one occasion, it may help
on another.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.