john-dev - Re: optimized mscash2-opencl

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120708085915.GD29336@openwall.com>
Date: Sun, 8 Jul 2012 12:59:15 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: optimized mscash2-opencl

On Sat, Jul 07, 2012 at 05:14:52PM +0530, Sayantan Datta wrote:
> Guess I didn't had much deeper insight into the codes which prevented me
> from moving the two SHA1 from the 10K loops.

BTW, after I was done with my optimizations yesterday, I took a look at
Lukas' CUDA code and found out that its 10k loop is very similar to what
I came up with.  So you could get this idea from there.

> BTW I was expecting much more
> performace, nearly double on 7970 becuse the two SHA1 represented almost
> half of the total computation.

As I wrote earlier, this didn't make a difference on its own because the
optimizer was presumably good enough to move those two SHA-1's out of
the loop anyway.  However, when I had tried changing S30() to use
rotate() (changed one source code line only), I got almost exactly twice
slower performance (below 50k c/s), which was a hint for me about these
SHA-1s.  Apparently, the optimizer is this good only when it is dealing
with pure bitwise ops, ADDs, and shifts, but not with rotate(), which it
probably treats as opaque.

The actual speedup with my patch is mostly due to avoiding the
endianness conversions in the loop.  This also explains why there's
greater speedup on NVIDIA (no bit rotate instruction) than on AMD.

BTW, we still rely on the optimizer to substitute constants for some W[]
elements.  SHA1_digest() could be optimized further, reducing our
reliance on the optimizer, which may be a good thing to do (such as for
different OpenCL SDKs).

> This could mean we are using a lot of global memory

Hardly.  The speed is pretty good.  At 99k c/s, we're at over 2 billion
of SHA-1s per second, whereas hashcat does "2198.8M c/s" at raw SHA-1 on
a 7970, and this almost certainly includes some step reversals.

> With this patch we are at par with
> hashcat on 570 but stll lagging behind on 7970.

Yes.  Your correction to bitselect() brought us to over 90% of hashcat's
speed on 7970, though.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.