john-dev - Re: bitslice DES on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120817050711.GC23259@openwall.com>
Date: Fri, 17 Aug 2012 09:07:11 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

Sayantan -

On Mon, Aug 13, 2012 at 07:54:47PM +0530, Sayantan Datta wrote:
> I'm currently trying to understand what you just wrote, correct me if I am
> wrong.
> 
> For DES_BS_EXPAND==0
> We have 768 indices for a 56 element array.  The 56 element array is
> different for each hash.

Yes.  (In fact, it holds many keys at once in its different bit layers,
for computation of that many different hashes.)

> However the 768 indices are same for all hashes
> (Right?).

Correct.  They're part of the specification of DES.

> So why can't  we put the 768 indices in local memory?

We can.

A potential tradeoff here is between indices and pointers.  Indices can
be shared, but they may need to be scaled to byte offsets (multiply by 4)
and a per work-item base address for the 56-element K array needs to be
added to them.  Likely this is free or cheap.  However, if not then you
could instead have a per work-item 768-element array of pointers, which
would provide K elements directly (no additional base address needs to
be added).  The latter is almost certainly slower on current GPUs, where
the local memory size would become the limiting factor.  (On CPUs, the
latter is faster.)

The scaling of indices to byte offsets you may perform out of the loop
(in fact, they will be constants).  56*4 still fits in 8 bits.  This may
or may not make any difference (depending on available addressing modes
on a particular architecture and costs or lack thereof of making use of
those addressing modes).

BTW, this assumes gather addressing, even though we don't really need it
(the accesses by different work-items executing on a SIMD unit will be
to adjacent addresses).  I wonder if writing the vectorized code more
explicitly would result in savings.  Then the scaling of indices to byte
offsets would be by more than *4, and maybe there would more likely be a
benefit from precomputing it - although then you need larger than 8-bit
elements in the 768-element array to store them.

> For DES_BS_EXPAND==1
> We expand the 56 element array to a 768 element array(Right?). This time
> 768 element array is different for all hashes(Right?).  So we can't put
> them in local memory.

Yes.

> Did I get the problem correct or am I totally wrong?

You mostly got it right.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.