Date: Sun, 19 Aug 2012 18:56:54 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bitslice DES on GPU Sayantan - On Sun, Aug 19, 2012 at 07:52:19PM +0530, Sayantan Datta wrote: > I've made some progress with DES BS using the above mentioned > implementation. The cpu simulation with KPC set to multiples 32 runs fine. Sounds good. > Now its time to port things to openCL. I have a few more questions: > > 1.Is it necessary to port DES_bs_set_salt() to gpu? I think this question shouldn't be asked on its own. You should also decide on what you do with E. Do you make it an array of pointers or integers? If integers, then you may do the set_salt() just once and then reuse that array for all work-items. Then it makes sense to do set_salt() on CPU. However, if you use pointers (or integers with per-work-item base offsets pre-added to them), then I guess you'd benefit from having those slightly different E arrays computed by each work-item - thus, on GPU. > 2. The DES_bs_crypt_25 function has too much branching in it. Is it > possible to reduce the number of branching in DES_bs_crypt_25() function? I think you're wrong about it having "too much" branching. The branching that it has is not data-dependent, and there's just one taken branch per loop iteration. So this should be GPU-friendly as written. That said, in general, the amount of branching may be reduced by unrolling the loop further - like I mentioned when we were discussing the key schedule. If you unroll all 16 DES rounds (as opposed to just 2, which the code currently does), you'll have fewer branches per crypt and you'll also avoid the need for having the 768-element array (the indices into the 56-element key array will be right in the code). However, you might exceed instruction cache size, which would be a huge performance hit. On typical CPUs, the 2x unroll fits in L1 instruction cache, but a 16x unroll would not. This reminded me of the following, although it's not relevant to your early experiments at all: Another trick, which I haven't mentioned yet, is to wrap each S-box invocation in its own tiny loop - over wider virtual vectors. JtR currently does this when built for PA-RISC (see pa-risc.h), based on my testing on my old 712/80 many years ago. That's because PA-RISC systems tend to have large yet off-chip L1 caches, combined with larger than typical for other CPUs instruction prefetch buffers (to compensate for the off-chipness of L1 cache). Thus, those tiny loops fit in the 1 KB instruction prefetch buffer on the 712/80 (PA-7100LC CPU), resulting in approx. 30% speedup over the code version that we use on other systems. Now I don't expect that the exact same approach would make sense on GPUs, but this trick could be used to allow for 16x unrolling the entire DES loop without caring that much about it fitting in cache or not, since the individual inner loops will fit anyway. Those loops don't have to be per-S-box; they can as well be per several S-box lookups, per DES round, or per several DES rounds. Yet you'd be able to have the larger loop with hard-coded 768 indices around those smaller loops. A problem with this is that those wider than 32-bit virtual vectors will need more memory, and there might not be enough local memory. (This was not an issue on the PA-RISC system because it did not have any on-chip data cache anyway, and the off-chip L1 data cache was relatively large.) As you can see, there are a lot of different approaches you could try later - and see what works best. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.