john-dev - Re: md5crypt-opencl

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150902170447.GA25579@openwall.com>
Date: Wed, 2 Sep 2015 20:04:47 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: md5crypt-opencl

magnum, all -

On Mon, Aug 24, 2015 at 12:38:03PM +0300, Solar Designer wrote:
> I've experimented with optimizing our md5crypt-opencl in various ways -
> patch attached.

I looked into this some further.  As I suspected from benchmarks on GCN
at different core clock rates (which had surprisingly little effect on
performance), the kernel ends up using global memory in the inner loop.
I've just confirmed this by reviewing the GCN ISA code.  Specifically,
the entire ctx[8] array is in global memory.  That's 512 bytes, so I
think it could as well fit in VGPRs, but somehow the compiler is
unwilling to do that.  Reducing this array to 448 bytes (14 words per
buffer, which can be done since the last two words are the length and
zero) doesn't help (it actually slows things down a little bit).

Do you have an idea how to force use of private memory aka VGPRs for
arrays like this?  I think the limit is 256 VGPRs per work-item, and
we're still sufficiently(?) below that (it'd be 1024 bytes, although
this obviously includes other variables and scratch registers).

Specifying __private explicitly obviously makes no difference.  As an
experiment, I tried moving this to __local without proper allocation for
the entire workgroup and without proper address calculation (so all
work-items were sharing the same ctx, obviously resulting in incorrect
computation).  This resulted in a 1.5x speedup (so ~4.5M c/s on Tahiti),
but obviously failing self-test.  So that's the speed potential.  If we
actually allocate enough local memory, things will be slower because
this will limit concurrency (we actually have less local memory than
private memory), yet it might be faster than what we have now.  Ideally,
though, we'd just force the compiler to use VGPRs here... or is there
some fundamental reason why this won't work?  Such as lacking addressing
modes, but per my earlier reading of GCN arch manual, indexed access to
VGPRs is supported.

Oh, is it possibly because the array is two-dimensional?  Like some
heuristic: "put all arrays with greater than one dimension in global
memory".  It is probably worth trying to turn the array into
single-dimensional and see.

I think we have good chances to beat oclHashcat's speed if we solve
this.  I think oclHashcat doesn't use multiple buffers because the
private or local memory use by 21 or 42 buffers would be too large, but
in our kernel Lukas came up with a more compact representation needing
only 8 buffers.  We just need to get them out of global memory...

And then we could apply the same to SHA-crypt.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.