john-dev - Re: fast hashes on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120331064656.GA32718@openwall.com>
Date: Sat, 31 Mar 2012 10:46:56 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: fast hashes on GPU

On Sat, Mar 31, 2012 at 01:23:24PM +0800, myrice wrote:
> On Mar 31, 2012 5:18 AM, "magnum" <john.magnum@...hmail.com> wrote:
> 
> > Are you saying you want to change the salt interface so you can loop
> > over all loaded salts without leaving GPU?
> 
> Yes, but I am not sure how much performance it will give.

You can do it without changing the formats interface and benchmark code.
Here's how:

The first time set_salt() and crypt_all() calls are made for a given
salt, you process that salt separately (do the actual hashing), but you
also record the salt value.  For your purpose, you'd want to record it
from your GPU code, so that it can access the salt directly next time.
You also record a copy of it from your CPU code.  The second time the
set_salt() and crypt_all() calls are made (for previously seen salts
since no new hashes and thus no new salts may be loaded during
cracking), you do the bulk hashing (for all previously seen salts) on
the very first call to crypt_all().  Yes, you must do it as soon as you
see the same salt value again - including for salts that have not yet
been seen for a second time.  You can't delay processing of the current
salt because cmp_*() and get_hash*() must return correct values right
after the crypt_all() call, and you don't want to delay processing of
further salts because you want the potential speedup from bulk hashing.

Further calls to set_salt() and crypt_all() until a salt is seen for a
third time do not need to invoke any GPU code.  Instead, set_salt()
should set a global variable (on CPU) that will tell further cmp_*() and
get_hash*() calls to use the right set of result buffers, and it may
mark salts (also on CPU) as actually seen again.  Some salts might be
no longer seen if all of their corresponding hashes have been cracked.

When a salt is finally seen for a third time and thus it's time for you
to invoke the GPU code again, you may also use this opportunity to
notify your GPU code of the now-missing salts (so it will learn of that
with a delay of one full set of candidate passwords, but that's OK).

Then you process the third full set of candidate passwords and on in a
similar fashion.

A similar trick may work for recording of hashes to compare against in
cmp_all() and then moving the comparisons onto the GPU starting with the
second set of candidates.  Then you only need to transfer one bit
(literally) per GPU code invocation back to the CPU most of the time
(that bit will mean "no matches" or "there might be matches, please
request detail").  Once this is implemented, we might even want to
adjust the thresholds for use of hash tables such that those are not
used (as that would result in get_hash*() calls being made instead of
cmp_all(), thereby not letting you offload the comparisons onto the
GPU).  That's a separate task, though.

As you can see, with some tricks a lot can be done for speeding up fast
yet salted hashes even within the current formats interface.  To my
knowledge, these have not been attempted yet.  I did not suggest them
until now.

(For fast and saltless hashes - or with very few salts - set_key() will
be a bottleneck, and this is not addressed with these tricks.  A formats
interface change will actually be needed to deal with that.)

For benchmarking, we might have to increase the benchmark duration
(e.g., use --test=10 or more) in order to have many sets of candidate
passwords processed (not just the first set, which has to be treated
specially and is presumably slower to process).

With very large numbers of loaded hashes (such as millions), you may
have to use a different approach or maybe lower max_keys_per_crypt.
Otherwise the programs may become too non-interactive (and you might
even be hitting timeouts imposed by drivers or by CUDA or OpenCL
libraries), or the salts or hashes (if you do the cmp_all() recording
trick as well) might not even fit in GPU memory.  You don't need to
bother about that for your initial implementation, though.

It'd be very nice if one or more of the GSoC candidates attempt this and
demonstrate some results now.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.