john-dev - Re: descrypt speed (was: "Failed copy data to gpu" when using fork with descrypt-opencl)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20141031125556.GB7088@openwall.com>
Date: Fri, 31 Oct 2014 15:55:56 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: descrypt speed (was: "Failed copy data to gpu" when using fork with descrypt-opencl)

On Fri, Oct 31, 2014 at 03:31:12AM +0100, magnum wrote:
> On 2014-10-30 16:49, Royce Williams wrote:
> >>Using -fork=4 on a quadcore+HT and GTX980 I got over 82 Mc/s.
> >
> >On my 8-core AMD and GTX970, using fork=2 gets me 52 Mc/s, which is
> >much better than no fork (~35 Mc/s).  fork=3 settles in around 54
> >Mc/s.  Forking more than 3 doesn't materially increase the c/s rate.
> 
> Solar, Sayantan, all,
> 
> Why is this? This is bordering candidate generation bottleneck but 
> that's not quite the problem, is it? So what is the bottleneck? Could we 
> do something to make it faster without forking or *is* it just candidate 
> generation?

Might be bandwidth - I just brought this up in another message.

Another idea is that we may introduce explicit buffering in global
memory, and async processing.

> Also, as far as I understand just from googling, Atom has yet to 
> implement bitslicing. Yet his descrypt exceeds 100M c/s on a single 
> Tahiti (according to 
> https://twitter.com/hashcat/status/160488271267364864). How is that 
> possible? Should we not beat him silly with our bitslicing version?

No, since oclHashcat generates candidate passwords on GPU and we do it
on host.

Also, as discussed before, bitslice DES works great on AMD GCN GPUs for
single DES cracking (and for LM hash cracking), but not so great for
descrypt.  I think this has to do with L1 instruction cache hit rate:
for single DES, most fetches from L1i are reused by multiple wavefronts
executing code near the same addresses, whereas once iterations are
added they get too much out of sync after ~10 iterations or so (per
Sayantan's benchmarks when I was asking him to try hacked descrypt with
lower iteration counts).  For descrypt, it's 25 iterations.  Perhaps a
workaround is possible: e.g. compute partial hashes for 5 iterations,
and do it 5 times - or is there some barrier forcing wavefronts sync
that we can put into the existing OpenCL code easily?  Anyway, we need
candidate password generation on GPU before we can see much of an effect
from such changes.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.