Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 6 Apr 2012 00:00:46 +0800
From: myrice <>
Subject: Re: fast hashes on GPU

On Thu, Apr 5, 2012 at 8:44 PM, Solar Designer <> wrote:

> myrice -
> On Thu, Apr 05, 2012 at 04:19:49PM +0800, myrice wrote:
> The only change was an increase of max_keys_per_crypt by a factor of
> 128, correct?
> I used a large max_keys_per_crypt to simulate many salts. That is what you
said in your previous post. For example, with max_keys_per_crypt sets to
128*KEYS_PER_CRYPT, it means we do 128 iterations in one GPU call. The
number of hashes generated in the setting(128*KEYS_PER_CRYPT,THREADS: 512,
BLOCK: 32) is equal to (THREADS: 512, BLOCK: 32*512, MAX_KEYS_PER_CRYPT =
KEYS_PER_CRYPT). So for many salts, it will be very similar. We do many
iterations in one GPU call.

> Enlarge either MAX_KEYS_PER_CRYPT or BLOCK size will reduce the
> > performance.
> Are you saying that increasing these values _further_ (beyond 128x)
> reduces performance?
> Yes, I tried so many combinations. 32, 64, 128 ... 1024 BLOCKS, 512
THREADS, MAX_KEYS_PER_CRYPT = 2, 16, 32..512 * KEYS_PER_CRYPT. Only find
this setting is the best.

If so, the reason is probably that the keys no longer fit into the same
> cache level on the CPU.  This is also why the salts trick discussed
> before does have potential for a little bit of speedup (with a lower
> max_keys_per_crypt then, to fit in a smaller and faster CPU cache)
> compared to an increased max_keys_per_crypt.  Similarly, the reduced
> work set size on GPU (N passwords + M salts instead of N*M passwords
> for comparable total processing time per call) may allow for the use of
> a faster memory type on GPU.  So the salts trick is worth revisiting
> once you feel you're done with easier optimizations.
> Hmm, so we have to further reduce work set size? I will make
an analysis later.

> What are you planning to do next, for XSHA512 or/and for other aspects
> of the fast hashes project?  Here are some possible sub-tasks:
> 1. Work with magnum to get your code written so far into magnum-jumbo.
> (I strongly recommend that you take care of this first.)
> Yes! I have already done this. Waiting for magnum to merge my code.

> 2. Try to optimize XSHA512 further by optimizing the SHA-512 stuff in it
> (e.g., get rid of the init/update/final functions).
> Yes, it should be done. I have tried __forceinline__ and didn't find any
difference on 9600M GS. I will try on GTX 580 later.

3. Try to optimize XSHA512 further by storing/caching the salts in GPU
> as discussed.
Yes. But as my experiment above, I have to analysis if it really improves
performance first.  With many salts per one GPU call, we reduce number of
GPU calls. And we have to reduce work size to make sure all
data(~salts*sizeof(salts)*keys*sizeof(keys)) can put in global memory. The
final result of performance should be examined.

> 4. Try to optimize XSHA512 further by storing/caching the loaded hashes
> in GPU, so that you can merge cmp_all()'s on-GPU code into crypt_all()'s
> and avoid the call to GPU in cmp_all().  With very large
> max_keys_per_crypt this won't matter (the calls are infrequent anyway),
> but this change may result in the optimal max_keys_per_crypt value
> becoming a little bit lower (you'll need to re-tune it), which may
> improve CPU cache usage (resulting in a little bit of speedup).
 Good one. I will try it( and 5, 6 you proposed)

> 7. Proceed to implement XSHA (similar to XSHA512, but based on SHA-1).
> Since these hashes are a lot faster to compute on GPU, they will
> highlight the various bottlenecks much better (and will make the need
> for dealing with those more pressing).  (By the way, XSHA should be
> added to your proposed GSoC timeline.)

Okay. I will add it to my GSoC timeline and implement it later.

8. Implement XSHA512 (and then XSHA) in OpenCL as well.  Try it on AMD
> GPUs.  This may also highlight the bottlenecks better (if run on a card
> like 7970, which should be about 3x faster than GTX 580 at SHA-1).
> (Yes, it is desirable that you get a fast AMD card as well.)

Here are my questions regarding the hardware. I will go abroad for my
graduate study in Sept. I will buy a new laptop. However, it is probably
with a NVidia Card. I also can buy a computer with AMD card. But I don't
know which one is enough for my future development. 7970 is the best but
expensive. I cannot afford for it now(So much money since I have to buy a
new computer. If Google pays me, I think I can :) ). I will look for any
server with a AMD card I could use. And I also want to know whether or not
you could provide me access to a fast card?

> 9. Since XSHA512 and XSHA use only a single instance of SHA-512 or
> SHA-1, it should be possible to reverse the last few rounds of the
> computation (perform the reverse in binary() on CPU when loading the
> hashes).  Then you have fewer rounds to compute at runtime (on GPU).
> (BTW, this optimization will apply to CPU-only implementations as well,
> so we'll need it there as well, but we'll have to move away from
> OpenSSL's to our own code for SHA-512 and SHA-1 first.  If you show how
> many rounds may be reversed and how in your GPU code, someone else on
> our team may do it for CPU, or vice versa.)
It is on my to-do list. :)

> Of course, there will be many more sub-tasks (for the summer if we
> proceed to work under a summer program), but the above are some that I
> think you can work on now.
> How does this sound to you?

It is perfect! I will soon start with some of them. I will update my
timeline on melange with what you said. Many thanks for you help!

Dongdong Li

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.