Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 13 Aug 2012 19:54:47 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

On Mon, Aug 13, 2012 at 10:49 AM, Solar Designer <solar@...nwall.com> wrote:

> On Mon, Aug 13, 2012 at 08:36:46AM +0530, Sayantan Datta wrote:
> > What should be the value of DES_BS_EXPAND for GPU implentation ?
>
> First, note that for LM hashes DES_BS_EXPAND is effectively always 0,
> regardless of what it's set to (for other hash types).
> DES_bs_crypt_LM() does not even include #if's for the other code
> version.  This is because there's just one DES computation per LM hash,
> so each expanded key bit is accessed exactly once, and thus there's no
> gain from preloading those bits and writing them into another array
> sequentially (which is what DES_BS_EXPAND does for other hash types).
>
> Now, for DES-based crypt(3) hashes this is trickier.  Due to multiple
> iterations of DES per hash computation, there are potential savings from
> going through the pointer indirection just once and then loading those
> expanded key bits sequentially on each iteration.  However, when the
> word size as used in the bitslice DES implementation (e.g., 128-bit on
> SSE2) is larger than pointer size (e.g., 32- or 64-bit), this increases
> the required fast memory size that key material is being read from in
> the loop.  With DES_BS_EXPAND = 0, we read 768 pointers and 56 data
> words.  With DES_BS_EXPAND = 1, we read 768 data words.  Luckily, the
> latter is still small enough for current CPUs (although it is getting
> close to L1 data cache size on some).
>
> On GPU, the available local memory per work-item is a scarce resource.
> There might not be enough of it for the keys either way, so you might
> end up choosing to use global memory for them - and then reading the 768
> expanded key bits sequentially might work better (more data to read, but
> those are sequential reads) - or it might not.  The partially
> non-sequential reads of 56 adjacent data words would likely be pretty
> fast as well, but then the question is where you store the 768-element
> array of pointers or indices - and whether it is pointers (will have to
> be separate for each work-item) or indices (may be shared and constant).
>
> Note that 768 times 4 bytes (since you use 32-bit integers on GPU) is
> already 3 KiB, and you will have some other data as well - so if you do
> this expansion and you try to keep the expanded keys in local memory,
> the occupancy would be similar to what you're getting with bf-opencl.
> This is likely not the way to go.
>


> Thus, it appears that if you want to keep keys in local memory,
> DES_BS_EXPAND must be 0.  And then the question is how to implement the
> indirection and whether you can have the 768-element array shared
> between your work-items.  To have it shared, it won't be an array of
> pointers, but rather an array of indices into the 56-bit array.  And
> then its elements can be 8-bit, and it will be constant.  This is
> probably the way to go _if_ you're able to keep the keys themselves in
> local memory without this being too much of a constraint on the number
> of in-flight wavefronts.
>
> To summarize, you'd need to try three approaches (and their variations):
>
> 1. Keys in global memory, expanded.
>
> 2. Keys in global memory, not expanded.
>
> 3. Keys in local memory, not expanded.
>
> For LM hashes, it's just #2 or #3 above.  Maybe start with #3?
>
> Alexander
>

I'm currently trying to understand what you just wrote, correct me if I am
wrong.

For DES_BS_EXPAND==0
We have 768 indices for a 56 element array.  The 56 element array is
different for each hash. However the 768 indices are same for all hashes
(Right?). So why can't  we put the 768 indices in local memory? This would
limit the number of in-flight  wavefronts to 64KB/(3KB+other data) =16 or
less which I think should be plenty(considering 7970).

For DES_BS_EXPAND==1
We expand the 56 element array to a 768 element array(Right?). This time
768 element array is different for all hashes(Right?).  So we can't put
them in local memory.

Did I get the problem correct or am I totally wrong?

Regards,
Sayantan

[ CONTENT OF TYPE text/html SKIPPED ]

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ