john-dev - Re: bitslice DES on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120819145654.GA2188@openwall.com>
Date: Sun, 19 Aug 2012 18:56:54 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

Sayantan -

On Sun, Aug 19, 2012 at 07:52:19PM +0530, Sayantan Datta wrote:
> I've made some progress with DES BS using the above mentioned
> implementation. The cpu simulation with KPC set to multiples 32 runs fine.

Sounds good.

> Now its time to port things to openCL. I have a few more questions:
> 
> 1.Is it necessary to port DES_bs_set_salt() to gpu?

I think this question shouldn't be asked on its own.  You should also
decide on what you do with E.  Do you make it an array of pointers or
integers?  If integers, then you may do the set_salt() just once and
then reuse that array for all work-items.  Then it makes sense to do
set_salt() on CPU.  However, if you use pointers (or integers with
per-work-item base offsets pre-added to them), then I guess you'd
benefit from having those slightly different E arrays computed by each
work-item - thus, on GPU.

> 2. The DES_bs_crypt_25 function has too much branching in it. Is it
> possible to reduce the number of branching in DES_bs_crypt_25() function?

I think you're wrong about it having "too much" branching.  The
branching that it has is not data-dependent, and there's just one taken
branch per loop iteration.  So this should be GPU-friendly as written.

That said, in general, the amount of branching may be reduced by
unrolling the loop further - like I mentioned when we were discussing
the key schedule.  If you unroll all 16 DES rounds (as opposed to just
2, which the code currently does), you'll have fewer branches per crypt
and you'll also avoid the need for having the 768-element array (the
indices into the 56-element key array will be right in the code).
However, you might exceed instruction cache size, which would be a huge
performance hit.  On typical CPUs, the 2x unroll fits in L1 instruction
cache, but a 16x unroll would not.

This reminded me of the following, although it's not relevant to your
early experiments at all:

Another trick, which I haven't mentioned yet, is to wrap each S-box
invocation in its own tiny loop - over wider virtual vectors.  JtR
currently does this when built for PA-RISC (see pa-risc.h), based on my
testing on my old 712/80 many years ago.  That's because PA-RISC systems
tend to have large yet off-chip L1 caches, combined with larger than
typical for other CPUs instruction prefetch buffers (to compensate for
the off-chipness of L1 cache).  Thus, those tiny loops fit in the 1 KB
instruction prefetch buffer on the 712/80 (PA-7100LC CPU), resulting in
approx. 30% speedup over the code version that we use on other systems.

Now I don't expect that the exact same approach would make sense on
GPUs, but this trick could be used to allow for 16x unrolling the entire
DES loop without caring that much about it fitting in cache or not,
since the individual inner loops will fit anyway.  Those loops don't
have to be per-S-box; they can as well be per several S-box lookups, per
DES round, or per several DES rounds.  Yet you'd be able to have the
larger loop with hard-coded 768 indices around those smaller loops.

A problem with this is that those wider than 32-bit virtual vectors will
need more memory, and there might not be enough local memory.  (This was
not an issue on the PA-RISC system because it did not have any on-chip
data cache anyway, and the off-chip L1 data cache was relatively large.)

As you can see, there are a lot of different approaches you could try
later - and see what works best.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.