john-dev - Re: cmp_all on gpu for descrypt-opencl

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130521185114.GA25252@openwall.com>
Date: Tue, 21 May 2013 22:51:14 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: cmp_all on gpu for descrypt-opencl

On Tue, May 21, 2013 at 10:55:04PM +0530, Sayantan Datta wrote:
> I have been able to run cmp_all() on GPU but the results aren't that good.
> In fact we have little performance penalty with cmp_all() on gpu.

I guess you mean "a little", not "little".

> I think the overhead involved is too much and the kernel has lot of
> branching(that depends on user data).

Yes, in cmp_all() the branching in fact depends on user data, unlike in
the current crypt_all().  To reduce the impact of this, the up to 64
iterations loop may be split in two, as I mentioned in a comment here:

http://www.reddit.com/r/crypto/comments/162ufx/research_project_opencl_bitslice_des_bruteforce/

In my testing, the optimal numbers were 20 iterations in the first loop
and 44 in the second.  This was not in our descrypt-opencl code, but it
was in another bitslice DES implementation on GPU, which is similar in
this respect.  I think the same or similar numbers will be optimal for
us as well.  From the reddit comment:

"[...] split this loop in two, only proceeding with the second half if
the first indicates there still might be a match.  These specific speed
numbers are for 20 iterations in the first loop (and 44 in the second).
The first loop's iteration count needs to be high enough that a branch
will actually be made most of the time, letting all SIMDs in a CU fully
skip the second loop."

Also relevant is the speed number obtained at the time - ~6840 Mkeys/s
when computing all 16 rounds of DES and checking the result as above.
This gives us a speed of at least 6840 / 25 = 273M c/s at descrypt, and
more than that if we consider that we'd only do the comparisons once per
25 DES encryptions rather than after each DES encryption.  This is
assuming that the kernel is specialized for a given salt value, but
that's what you're already doing.

Thus, there's lots of room for optimization in your current
descrypt-opencl, but of course these speeds will require generating
candidate passwords on GPU and comparing the hashes on GPU.

> Also the amount of work is very less
> compared to overhead. One possible way to improve performance might be  to
> load all hashes instead of just one per cmp_all() call. But that doesn't
> seems possible with current format.

It is possible with the current (new) format: you implement the
comparisons in crypt_all() instead of in cmp_all().  This is why I added
the "struct db_salt *salt" argument to crypt_all().  That way, you also
avoid a second kernel invocation.

I posted an example of this, for LM hashes on CPU, to john-dev last summer:

http://www.openwall.com/lists/john-dev/2012/07/21/1

You need to look at the changes for DES_bs_crypt_LM() - you may search
that posting for "This integer division may be slow - need to revise" to
find the right place.  (BTW, this comment gives the reason why I haven't
merged this portion of the changes into the core tree... but I think
magnum did merge it into bleeding anyway.)

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.