Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 5 Jun 2011 03:14:46 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

David, Yuri -

I made some rather naive comments on DES on GPUs last month.  I've since
discussed the matter with some people more familiar with GPUs, so here's
an update.

First, the best benchmark result I am currently aware of is for MTY, a
tripcode finder, doing 250 Mhashes/sec on ATI 5970.  This translates to
6.25 billion of DES block encryptions per second.  So it is still 3+
times slower than the 22 billion on Virtex-6, but it's by far not as bad
as the numbers for GPUs that we had before.  Our new DES S-box
expressions will improve this number (for the GPU) some further.

http://en.wikipedia.org/wiki/Tripcode
http://sourceforge.jp/projects/naniya/
http://harry.lu/mty/

On Thu, May 12, 2011 at 01:51:54PM +0400, Solar Designer wrote:
> Disclaimer: I am not into GPU programming (yet).  This is why I use
> words such as "apparently" below.
> 
> Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat
> likely to change.  Bitslice implementations of DES with the S-box
> expressions published so far need up to approx. 20 registers, which is
> apparently more than what typical GPUs have available per processing
> element.  Also, apparently accessing a temporary variable spilled to
> memory is relatively slow (compared to accessing the L1 cache in a CPU).

I've since read up on this a bit, mostly with Nvidia focus (even though
AMD/ATI cards are apparently more suitable for the task).  The number
of available registers per stream processor is actually pretty high
(thousands), but instruction latencies are so large that you have to run
a large number of simultaneous threads in order to hide those latencies.
This is why the number of registers actually available per thread turns
out to be pretty low (tens).  Memory accesses are indeed slow, although
that differs by memory type a lot.  On Nvidia, there's so-called shared
memory, which is nearly as fast as registers, but there's very little of
it (its size per stream processor may even be less than the combined
size of the stream processor's registers).  Other types of memory have
huge latencies (hundreds of cycles).

With an architecture like this, not only the temporary values (such as
those computed inside DES S-box expressions), but also DES blocks and
keys need to be stored in registers (or in Nvidia's shared memory, but
there's not more of it).  This gives us something like 64+56+20 = 140
registers needed per bitslice DES instance.  But there are not this many
on Nvidia cards for the otherwise optimal number of threads to run.

On high-end ATI cards, according to the author of MTY, slowdown starts
to occur when a thread needs more than 49 registers, but MTY exceeds
this number anyway (no better choice).  Yet, as stated above, its
overall performance is nevertheless very good (comparable to 8 billion
hashes/second reported for MD5, which is much more GPU-friendly).

> Another change that may or may not happen is GPUs getting more registers
> or/and faster access to temporary variables.  This sounds quite
> realistic to me, but I am not an expert in this area.  Based on what I
> read (about the implementation difficulty for DES on GPUs), I wouldn't
> be surprised if a mere 2x increase in register count results in a 10x
> speedup for DES on GPUs.

I think I was wrong about the above.  If not having enough registers
results in a 10x performance hit and having 2x more registers would
avoid that, then perhaps the performance hit could be reduced from 10x
to 2x by simply running twice fewer threads.  But I might be wrong about
that as well...

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.