john-dev - Re: fast hashes on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120405124401.GB24480@openwall.com>
Date: Thu, 5 Apr 2012 16:44:01 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: fast hashes on GPU

myrice -

On Thu, Apr 05, 2012 at 04:19:49PM +0800, myrice wrote:
> Yes, I using a much larger max_keys_per_crypt. ( 128*KEYS_PER_CRYPT,
> THREADS: 512, BLOCK: 32, sm_13)
> On GTX580, Before:
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE
> Many salts:     70254K c/s real, 70964K c/s virtual
> Only one salt:  27262K c/s real, 27262K c/s virtual
> 
> After:
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE
> Many salts: 72673K c/s real, 72673K c/s virtual
> Only one salt: 27439K c/s real, 27698K c/s virtual

The only change was an increase of max_keys_per_crypt by a factor of
128, correct?

> Enlarge either MAX_KEYS_PER_CRYPT or BLOCK size will reduce the
> performance.

Are you saying that increasing these values _further_ (beyond 128x)
reduces performance?

If so, the reason is probably that the keys no longer fit into the same
cache level on the CPU.  This is also why the salts trick discussed
before does have potential for a little bit of speedup (with a lower
max_keys_per_crypt then, to fit in a smaller and faster CPU cache)
compared to an increased max_keys_per_crypt.  Similarly, the reduced
work set size on GPU (N passwords + M salts instead of N*M passwords
for comparable total processing time per call) may allow for the use of
a faster memory type on GPU.  So the salts trick is worth revisiting
once you feel you're done with easier optimizations.

> I think it is a lot of data transfer that takes time.

Of course transferring more data takes more time, but this is irrelevant
to the slowdown with larger (too large) max_keys_per_crypt because the
relative data transfer sizes per password hashed remain the same.  I've
provided my guess as to the actual cause of the slowdown above.

What are you planning to do next, for XSHA512 or/and for other aspects
of the fast hashes project?  Here are some possible sub-tasks:

1. Work with magnum to get your code written so far into magnum-jumbo.
(I strongly recommend that you take care of this first.)

2. Try to optimize XSHA512 further by optimizing the SHA-512 stuff in it
(e.g., get rid of the init/update/final functions).

3. Try to optimize XSHA512 further by storing/caching the salts in GPU
as discussed.

4. Try to optimize XSHA512 further by storing/caching the loaded hashes
in GPU, so that you can merge cmp_all()'s on-GPU code into crypt_all()'s
and avoid the call to GPU in cmp_all().  With very large
max_keys_per_crypt this won't matter (the calls are infrequent anyway),
but this change may result in the optimal max_keys_per_crypt value
becoming a little bit lower (you'll need to re-tune it), which may
improve CPU cache usage (resulting in a little bit of speedup).

5. Try to support longer passwords without a significant slowdown when
the actual passwords being tested happen to be short.  Right now, we
have to decrease PLAINTEXT_LENGTH to achieve decent speed for the
single-salt case, but this is a nasty limitation for actual use of such
a build of John (another build needs to be used when testing of longer
passwords is desired).  There are ways to keep the data transfers to GPU
smaller most of the time while allowing for longer passwords in the rare
cases when those are present.

6. For correctness, implement computation of bytes 9-64 of the hash
value in cmp_exact().  You'll keep BINARY_SIZE at 8, but in the rare
cases when the 8 bytes match and cmp_exact() is called, you re-decode
the ASCII string, compute the full 512-bit hash value, and compare these
two 512-bit values.  You may perform this computation on CPU (just use
OpenSSL calls).

7. Proceed to implement XSHA (similar to XSHA512, but based on SHA-1).
Since these hashes are a lot faster to compute on GPU, they will
highlight the various bottlenecks much better (and will make the need
for dealing with those more pressing).  (By the way, XSHA should be
added to your proposed GSoC timeline.)

8. Implement XSHA512 (and then XSHA) in OpenCL as well.  Try it on AMD
GPUs.  This may also highlight the bottlenecks better (if run on a card
like 7970, which should be about 3x faster than GTX 580 at SHA-1).
(Yes, it is desirable that you get a fast AMD card as well.)

9. Since XSHA512 and XSHA use only a single instance of SHA-512 or
SHA-1, it should be possible to reverse the last few rounds of the
computation (perform the reverse in binary() on CPU when loading the
hashes).  Then you have fewer rounds to compute at runtime (on GPU).
(BTW, this optimization will apply to CPU-only implementations as well,
so we'll need it there as well, but we'll have to move away from
OpenSSL's to our own code for SHA-512 and SHA-1 first.  If you show how
many rounds may be reversed and how in your GPU code, someone else on
our team may do it for CPU, or vice versa.)

Of course, there will be many more sub-tasks (for the summer if we
proceed to work under a summer program), but the above are some that I
think you can work on now.

How does this sound to you?

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.