john-dev - Re: CUDA & OpenCL status

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120325132337.GA10318@openwall.com>
Date: Sun, 25 Mar 2012 17:23:37 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: CUDA & OpenCL status

On Sun, Mar 25, 2012 at 06:07:44AM +0400, Solar Designer wrote:
> On Thu, Mar 22, 2012 at 11:10:15AM +0200, Milen Rangelov wrote:
> > Problem is that NVidia does not have the rotate. AMD has BITALIGN_INT that
> > does it. With NVidia, rotate() would actually do (a<<s)|(a>>(32-s)). With
> > Fermi you can have the SHL+ADD thing (which I guess is just a 32-bit MAD in
> > fact), but rotate() still does not do the trick. What I do is something
> > like:
> > 
> > #define ROTATE (a<<S)+(a>>(32-s))
> 
> Oh, ADD instead of OR, and them having a 32-bit integer MAD.  This makes
> sense.  I did not realize they had it (I thought it was FP only).

I've just tried this (for phpass) and it didn't result in MADs being
generated.  Instead, I saw the ADDs instead of ORs (indeed), and the
only advantage was from the ADDs being sometimes re-ordered with other
ADDs that we genuinely have in MD5.  (BTW, this optimization may thus be
helpful on CPUs as well - giving the compiler more instruction scheduling
freedom.)  I was compiling with "-arch=sm_21", and there were a couple
of other (unrelated) 32-bit signed int MADs in the resulting code (if I
read it correctly).  I've even tried deliberately using a signed int
there (as opposed to unsigned) - this did not help.  Any idea why I
wasn't getting MADs for this, or how I tell the compiler to use a MAD
more explicitly?

Lukas - meanwhile, I got phpass to 710k c/s on my GTX-570 1600 MHz (up
from 633k that I reported before) by moving from sm_10 to sm_20 or sm_21
for the generated code (the latter is apparently not valid for my card,
but happens to work - I guess the same code was generated as for sm_20)
and by increasing BLOCKS from 126*3 to 160*3.  I guess 126 was tuned for
GTX-560, right?  This new speed is slightly higher than the published
speed for hashcat on an equivalent graphics card.

The sm_10 to sm_20 change made cryptmd5-cuda slightly slower, though.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.