john-dev - Re: bitslice DES on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121018203126.GE17787@openwall.com>
Date: Fri, 19 Oct 2012 00:31:26 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

On Thu, Oct 18, 2012 at 10:21:11PM +0530, Sayantan Datta wrote:
> On Thu, Oct 18, 2012 at 11:47 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> > It would imply something other than bitslicing for K as well.  And the
> > S-boxes would be represented differently, too.  But I think you should
> > stop thinking in this direction.  I don't expect there's another useful
> > representation inbetween bitslicing and straightforward table lookups.
> 
> The primary reason for thinking in direction is to decrease the  load on
> registers and local memory so that there are more number of inflight
> wavefront.

Sure, but this is a dead end.

> Also decreasing the size of B[] to 16 ints would almost ensure
> that it really stays in private register space. Right now I doubt the B[]
> arrays are stored in register address space

I don't expect them to.  They are not in registers on CPUs as well, and
that's fine.

Well, GPUs' local memory presumably has much higher latency than CPUs'
L1 data cache, and maybe some portion of B, K, or/and E elements can be
in registers - just not all at once.

> because each VGPR is only 256bit wide.

Aren't they 512-bit wide (on GCN)?

> However that depends entirely on how the VGPRs are used. For
> SIMD execution I think it is more logical to assume that each VGPR is
> loaded with data from 16 different kernels and not from only one kernel.

Isn't that the only possibility?

> Still if the array is not in register space we could be loosing lots of
> performance.

Indeed if we could keep everything in registers that would be great.
It's just not realistic.  I'd expect optimal performance to be reached
when we're maximizing the number of in-flight wavefronts up to the point
where we're almost fully using both registers and LDS.  Staying only in
registers and thus having fewer wavefronts is likely slower.  You can
experiment with both of these settings, though - in fact, perhaps you
already have (even if not targeting them intentionally, but just tuning
keys_per_crypt).

> Also it could be possible to use 4 array of 16 ints to
> represent the B[] array of 64 ints.

I think you wouldn't gain anything from that.

>        Since the 96 index array is almost constant the indexing could be
> done prior to execution. However it might require some radical approach
> like pre compiling the kernels manually.

Yes, or patching binary kernels at runtime.  This is something I had in
mind before we started.  I previously tested this on CPU, for a 7%
speedup over what we currently have in JtR.  On GPU, the speedup will
likely be much greater.  You may measure it by temporarily (in a copy of
the code) eliminating E[] and hard-coding the indices for a certain salt
value (it is easiest to do this for salt=0 - just take the indices out
of the LM hash code and use the test hash with ".." for the salt).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.