Date: Fri, 6 Apr 2012 00:00:46 +0800 From: myrice <qqlddg@...il.com> To: john-dev@...ts.openwall.com Subject: Re: fast hashes on GPU On Thu, Apr 5, 2012 at 8:44 PM, Solar Designer <solar@...nwall.com> wrote: > myrice - > > On Thu, Apr 05, 2012 at 04:19:49PM +0800, myrice wrote: > > The only change was an increase of max_keys_per_crypt by a factor of > 128, correct? > > I used a large max_keys_per_crypt to simulate many salts. That is what you said in your previous post. For example, with max_keys_per_crypt sets to 128*KEYS_PER_CRYPT, it means we do 128 iterations in one GPU call. The number of hashes generated in the setting(128*KEYS_PER_CRYPT,THREADS: 512, BLOCK: 32) is equal to (THREADS: 512, BLOCK: 32*512, MAX_KEYS_PER_CRYPT = KEYS_PER_CRYPT). So for many salts, it will be very similar. We do many iterations in one GPU call. > Enlarge either MAX_KEYS_PER_CRYPT or BLOCK size will reduce the > > performance. > > Are you saying that increasing these values _further_ (beyond 128x) > reduces performance? > > Yes, I tried so many combinations. 32, 64, 128 ... 1024 BLOCKS, 512 THREADS, MAX_KEYS_PER_CRYPT = 2, 16, 32..512 * KEYS_PER_CRYPT. Only find this setting is the best. If so, the reason is probably that the keys no longer fit into the same > cache level on the CPU. This is also why the salts trick discussed > before does have potential for a little bit of speedup (with a lower > max_keys_per_crypt then, to fit in a smaller and faster CPU cache) > compared to an increased max_keys_per_crypt. Similarly, the reduced > work set size on GPU (N passwords + M salts instead of N*M passwords > for comparable total processing time per call) may allow for the use of > a faster memory type on GPU. So the salts trick is worth revisiting > once you feel you're done with easier optimizations. > > Hmm, so we have to further reduce work set size? I will make an analysis later. > What are you planning to do next, for XSHA512 or/and for other aspects > of the fast hashes project? Here are some possible sub-tasks: > > 1. Work with magnum to get your code written so far into magnum-jumbo. > (I strongly recommend that you take care of this first.) > > Yes! I have already done this. Waiting for magnum to merge my code. > 2. Try to optimize XSHA512 further by optimizing the SHA-512 stuff in it > (e.g., get rid of the init/update/final functions). > > Yes, it should be done. I have tried __forceinline__ and didn't find any difference on 9600M GS. I will try on GTX 580 later. 3. Try to optimize XSHA512 further by storing/caching the salts in GPU > as discussed. > Yes. But as my experiment above, I have to analysis if it really improves performance first. With many salts per one GPU call, we reduce number of GPU calls. And we have to reduce work size to make sure all data(~salts*sizeof(salts)*keys*sizeof(keys)) can put in global memory. The final result of performance should be examined. > 4. Try to optimize XSHA512 further by storing/caching the loaded hashes > in GPU, so that you can merge cmp_all()'s on-GPU code into crypt_all()'s > and avoid the call to GPU in cmp_all(). With very large > max_keys_per_crypt this won't matter (the calls are infrequent anyway), > but this change may result in the optimal max_keys_per_crypt value > becoming a little bit lower (you'll need to re-tune it), which may > improve CPU cache usage (resulting in a little bit of speedup). > Good one. I will try it( and 5, 6 you proposed) > 7. Proceed to implement XSHA (similar to XSHA512, but based on SHA-1). > Since these hashes are a lot faster to compute on GPU, they will > highlight the various bottlenecks much better (and will make the need > for dealing with those more pressing). (By the way, XSHA should be > added to your proposed GSoC timeline.) Okay. I will add it to my GSoC timeline and implement it later. 8. Implement XSHA512 (and then XSHA) in OpenCL as well. Try it on AMD > GPUs. This may also highlight the bottlenecks better (if run on a card > like 7970, which should be about 3x faster than GTX 580 at SHA-1). > (Yes, it is desirable that you get a fast AMD card as well.) Here are my questions regarding the hardware. I will go abroad for my graduate study in Sept. I will buy a new laptop. However, it is probably with a NVidia Card. I also can buy a computer with AMD card. But I don't know which one is enough for my future development. 7970 is the best but expensive. I cannot afford for it now(So much money since I have to buy a new computer. If Google pays me, I think I can :) ). I will look for any server with a AMD card I could use. And I also want to know whether or not you could provide me access to a fast card? > 9. Since XSHA512 and XSHA use only a single instance of SHA-512 or > SHA-1, it should be possible to reverse the last few rounds of the > computation (perform the reverse in binary() on CPU when loading the > hashes). Then you have fewer rounds to compute at runtime (on GPU). > (BTW, this optimization will apply to CPU-only implementations as well, > so we'll need it there as well, but we'll have to move away from > OpenSSL's to our own code for SHA-512 and SHA-1 first. If you show how > many rounds may be reversed and how in your GPU code, someone else on > our team may do it for CPU, or vice versa.) > It is on my to-do list. :) > > Of course, there will be many more sub-tasks (for the summer if we > proceed to work under a summer program), but the above are some that I > think you can work on now. > > How does this sound to you? It is perfect! I will soon start with some of them. I will update my timeline on melange with what you said. Many thanks for you help! Thanks! Dongdong Li Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.