john-dev - Re: bitslice DES on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120813051935.GA6139@openwall.com>
Date: Mon, 13 Aug 2012 09:19:35 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

On Mon, Aug 13, 2012 at 08:36:46AM +0530, Sayantan Datta wrote:
> What should be the value of DES_BS_EXPAND for GPU implentation ?

First, note that for LM hashes DES_BS_EXPAND is effectively always 0,
regardless of what it's set to (for other hash types).
DES_bs_crypt_LM() does not even include #if's for the other code
version.  This is because there's just one DES computation per LM hash,
so each expanded key bit is accessed exactly once, and thus there's no
gain from preloading those bits and writing them into another array
sequentially (which is what DES_BS_EXPAND does for other hash types).

Now, for DES-based crypt(3) hashes this is trickier.  Due to multiple
iterations of DES per hash computation, there are potential savings from
going through the pointer indirection just once and then loading those
expanded key bits sequentially on each iteration.  However, when the
word size as used in the bitslice DES implementation (e.g., 128-bit on
SSE2) is larger than pointer size (e.g., 32- or 64-bit), this increases
the required fast memory size that key material is being read from in
the loop.  With DES_BS_EXPAND = 0, we read 768 pointers and 56 data
words.  With DES_BS_EXPAND = 1, we read 768 data words.  Luckily, the
latter is still small enough for current CPUs (although it is getting
close to L1 data cache size on some).

On GPU, the available local memory per work-item is a scarce resource.
There might not be enough of it for the keys either way, so you might
end up choosing to use global memory for them - and then reading the 768
expanded key bits sequentially might work better (more data to read, but
those are sequential reads) - or it might not.  The partially
non-sequential reads of 56 adjacent data words would likely be pretty
fast as well, but then the question is where you store the 768-element
array of pointers or indices - and whether it is pointers (will have to
be separate for each work-item) or indices (may be shared and constant).

Note that 768 times 4 bytes (since you use 32-bit integers on GPU) is
already 3 KiB, and you will have some other data as well - so if you do
this expansion and you try to keep the expanded keys in local memory,
the occupancy would be similar to what you're getting with bf-opencl.
This is likely not the way to go.

Thus, it appears that if you want to keep keys in local memory,
DES_BS_EXPAND must be 0.  And then the question is how to implement the
indirection and whether you can have the 768-element array shared
between your work-items.  To have it shared, it won't be an array of
pointers, but rather an array of indices into the 56-bit array.  And
then its elements can be 8-bit, and it will be constant.  This is
probably the way to go _if_ you're able to keep the keys themselves in
local memory without this being too much of a constraint on the number
of in-flight wavefronts.

To summarize, you'd need to try three approaches (and their variations):

1. Keys in global memory, expanded.

2. Keys in global memory, not expanded.

3. Keys in local memory, not expanded.

For LM hashes, it's just #2 or #3 above.  Maybe start with #3?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.