Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 20 Sep 2012 08:50:15 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

Sayantan -

On Wed, Sep 19, 2012 at 05:42:40PM +0530, Sayantan Datta wrote:
> On Fri, Sep 14, 2012 at 2:02 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> > So that's about 41M c/s at 25 iterations if you somehow manage to remove
> > the overhead.  In other words, the overhead still corresponds to about
> > 50% of total running time.
> 
> I'm trying to figure out where the overhead lies. I thought it might be the
> cmp_all() function. So I tried using openmp in cmp_all. However there was
> no improvement at all.

Yet this test doesn't mean that cmp_all() doesn't correspond to a
significant portion of the overhead.  OpenMP does not always speed
things up, and there are specific reasons why it tends to perform poorly
inside the bitslice DES cmp_all() (I tried this before).

Also note that cmp_all() normally tests only a few elements of B[] -
e.g., around 5 of them when dealing with 32-bit vector elements - yet
you're transferring the entire 64-element B[] from GPU.  So you'd
probably avoid more overhead by transferring a portion of B[] only
until/unless more of it is actually needed, than by speeding up
cmp_all() itself.

Anyhow, a next step would be to do comparisons on the GPU side anyway -
using the interfaces and approaches myrice experimented with during the
summer.

> This makes me think that the overhead lies primarily
> in set_key which is called maximum number of times. Can you provide any
> suggestions?

set_key() definitely corresponds to a large portion of the overhead.
But a few other things do as well.

Note that LM hashes use the same kind of set_key(), yet they achieve
speeds of 50M to 110M c/s on CPU (on one core).  So the slowdown from
41M (on GPU alone) to <20M (on GPU with CPU side's overhead) can't be
explained by set_key() alone.

> How about doing the set key in GPU using another kernel?

Note that our current set_key() doesn't do anything other than copying
the data bytes in the right places.  Even if you do this on GPU, you'd
have an equivalent amount of work on CPU side anyway, to get the keys
saved into a buffer and transferred to GPU at once.  So that's not a
solution at all.

The solution is to generate keys on the GPU, not just "set" them on the
GPU.  This is also something myrice experimented with during the summer,
and we'll need to do it for descrypt on GPU eventually.

At this time, though, I suggest that you try to improve upon the 41M
without-overhead speed.  It is not good enough anyway (roughly twice
slower than hashcat's actual/usable speed).  I think you need to try
DES_BS_EXPAND=0.  The target without-overhead speed is ~300M.

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.