Date: Fri, 19 Oct 2012 00:31:26 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bitslice DES on GPU On Thu, Oct 18, 2012 at 10:21:11PM +0530, Sayantan Datta wrote: > On Thu, Oct 18, 2012 at 11:47 AM, Solar Designer <solar@...nwall.com> wrote: > > > It would imply something other than bitslicing for K as well. And the > > S-boxes would be represented differently, too. But I think you should > > stop thinking in this direction. I don't expect there's another useful > > representation inbetween bitslicing and straightforward table lookups. > > The primary reason for thinking in direction is to decrease the load on > registers and local memory so that there are more number of inflight > wavefront. Sure, but this is a dead end. > Also decreasing the size of B to 16 ints would almost ensure > that it really stays in private register space. Right now I doubt the B > arrays are stored in register address space I don't expect them to. They are not in registers on CPUs as well, and that's fine. Well, GPUs' local memory presumably has much higher latency than CPUs' L1 data cache, and maybe some portion of B, K, or/and E elements can be in registers - just not all at once. > because each VGPR is only 256bit wide. Aren't they 512-bit wide (on GCN)? > However that depends entirely on how the VGPRs are used. For > SIMD execution I think it is more logical to assume that each VGPR is > loaded with data from 16 different kernels and not from only one kernel. Isn't that the only possibility? > Still if the array is not in register space we could be loosing lots of > performance. Indeed if we could keep everything in registers that would be great. It's just not realistic. I'd expect optimal performance to be reached when we're maximizing the number of in-flight wavefronts up to the point where we're almost fully using both registers and LDS. Staying only in registers and thus having fewer wavefronts is likely slower. You can experiment with both of these settings, though - in fact, perhaps you already have (even if not targeting them intentionally, but just tuning keys_per_crypt). > Also it could be possible to use 4 array of 16 ints to > represent the B array of 64 ints. I think you wouldn't gain anything from that. > Since the 96 index array is almost constant the indexing could be > done prior to execution. However it might require some radical approach > like pre compiling the kernels manually. Yes, or patching binary kernels at runtime. This is something I had in mind before we started. I previously tested this on CPU, for a 7% speedup over what we currently have in JtR. On GPU, the speedup will likely be much greater. You may measure it by temporarily (in a copy of the code) eliminating E and hard-coding the indices for a certain salt value (it is easiest to do this for salt=0 - just take the indices out of the LM hash code and use the test hash with ".." for the salt). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.