john-dev - Re: bitslice DES on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120826195143.GC767@openwall.com>
Date: Sun, 26 Aug 2012 23:51:43 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

On Mon, Aug 27, 2012 at 12:44:27AM +0530, Sayantan Datta wrote:
> So you are saying that I should do this portion of the code on cpu.

No.  I wrote "you could do the expansion on CPU side, but the reason why
you won't ..."  This means that you "could", but you "should not".

> I think
> I could do it in the DES_bs_init() and get rid of KSp all together(Right?).

No.  The expansion may only be performed for a specific set of keys,
whereas DES_bs_init() is called just once per program invocation.

However, you may fill an array of pointers (for use on CPU) or indices
(for use on GPU) just once, in DES_bs_init() or even in a standalone
program (then use a static initializer for a const array in your OpenCL
kernel).  In fact, the latter approach might be best, since it makes the
const nature of the indices array very explicit.

> There is another bigger problem than the previous one. I see that address
> stored in E.E[] array  is aliased with the B[] array. Transferring data to
> gpu would change the location of B[] array and result in same kind of
> problem as above.

You will likely similarly want E[] to be shared between work-items (to
save local memory), which means that you'll similarly need to fill it
with indices rather than with pointers.  However, it won't be fully
constant: it'll change between crypt_all() calls.  Perhaps it'd be best
to compute this E[] filled with indices (not pointers) on CPU and
transfer it to GPU on every crypt_all() call.

> The most complex part being the initialization of E.E[]
> array in the set_salt() function. I could port both DES_bs_init() and
> set_salt() to GPU(as different kernel) but it would then result in same
> problem as I would need to transfer data to and from GPU in order
> to synchronize it with the CPU in between the kernel calls. Whenever I
> transfer data from CPU to GPU the address of B[] will change.

I don't understand you here.  You seem to be imagining some problems
that you won't even run into for other reasons anyway.

In the current OpenMP code, E[] is separate for each thread slot because
it stores pointers, as well as to be friendlier to CPU caches.  On GPU,
you'll want to store indices rather than pointers and you'll want E[] to
be shared between the work-items - or at least that's what I'd expect to
run faster.

(A next step would be to customize OpenCL kernels for each of the 4096
salts - maybe by patching binary code.  This will let us get rid of E[].
On CPU, an equivalent change would result in a 7% speedup per my
experiments, but on GPU it might make a lot more of a difference.)

> One more thing I would like to know that immediately after set_salt()
> function which function is called?

Usually crypt_all() is called next, but there may be some special cases.
For example, when there's just one salt value seen among the loaded
hashes, then set_salt() may be called just once - followed by multiple
calls of crypt_all(), set_key() ..., then crypt_all() again.

You could in fact embed set_salt() inside crypt_all() - only doing the
salt setting if the salt has changed since what was previously set.
However, this will result in duplicate work being done (the salt
setting) by each work-item, if you're storing indices rather than
pointers into E[] anyway.  If you choose to be storing pointers into
E[], the this merging of the functions is the way to go.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.