john-dev - Re: bitslice DES on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120817063311.GA23664@openwall.com>
Date: Fri, 17 Aug 2012 10:33:11 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU

On Fri, Aug 17, 2012 at 11:44:10AM +0530, Sayantan Datta wrote:
> This is what I'm trying to do . In fact I'm trying to simulate this
> condition on cpu(before porting to openCL)

Makes sense.

> but I'm kind of stuck with that
> too. I have declared an array of DES_bs_all[] and also made some of the
> necessary adjustments but certainly not all. I guess each instance of
> DES_bs_combined is used for 32 hashes given that I set DES_BS_DEPTH=32. I
> have also set the MAX_KEYS_PER_CRYPT to a multiple of 32.

This sounds correct to me.

> Now what are the
> other global parameters that must be changed for such implementation.  Or
> is it going to be too complex if I proceed this way?

It may be fine.  Whatever is the easiest way for you to learn this stuff.

> >  It will have some
> > similarity to cpt, as well as to DES_bs_nt.  On CPU, we need both of
> > these (and the latter has the former factored into it).  When
> > interfacing to GPU, you will only need one - and only if you choose to
> > keep DES_BS_DEPTH low, which you don't have to.
> 
> So, is it a better idea to use a bigger vector and then process the parts
> of the vector in parallel?

I think you'll need to try both approaches, then decide based on speed
(this means after optimizations and tuning) and source code complexity.
It might even turn out that a hybrid approach works best (DES_BS_DEPTH
inbetween 32 and total number of hashes computed, and a certain number
of structs to reach the total number).  The difference is primarily in
memory layout on the CPU side, and thus in cache usage, ability to use
SIMD on CPU for hash comparisons, and how the data transfers to/from GPU
are structured (one large chunk vs. many scattered chunks).

Maybe you'll prefer to implement data structures and code for the
hybrid approach, and then simply tune its settings - including up to the
two edge cases.

A next step may be to have candidate password generation and/or hash
comparisons on GPU, though, which would make some or all of this
irrelevant - although for some cracking modes I think we'll want to
continue supporting candidate password generation on CPU as well.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.