Date: Thu, 27 Jun 2013 18:31:37 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Re: latest version of descrypt project on reddit On Thu, Jun 27, 2013 at 07:23:44PM +0530, Sayantan Datta wrote: > I profiled the kernel using codeXL. > > When we have all 16 rounds unrolled put under single iteration , cache > hit is +99%. This is supported by the evidence that there is 0% memory > unit stalls and very little fetch from video memory.This corresponds to > the first case(4694Mkeys/s). > > Next when we put the 16 rounds of des in a 25 iter loop the cache hit > suddenly drops to 1%. Now the memory unit is stalled 23% of the time > and video memory fetches are increased by nearly 100x. This would be > the second case(117 Mkeys/s). This makes me wonder: what if instead of the 25 iter loop, you unroll the entire thing - that is, repeat the 16 rounds 25 times. I think 16 rounds already exceed I-cache size, yet with non-looping kernel like that the hardware somehow manages to avoid most cache misses - perhaps by reusing each portion of fetched code many times (due to the high GWS). Perhaps we can make use of this feature for the entire descrypt just as easily? > We can increase the cache hit back to +90% with almost 0 memory unit > stalls if we can somehow unroll only 4 rounds and put it under > appropriate iteration count. That's tough. We can easily do it for 8 rounds (in fact, that's what we already do in descrypt-opencl), but when we reduce to 4, we have to use non-constant indices into B. BTW, can you check the I-cache hit rate for the current descrypt-opencl? > With 4 rounds unrolled and put under a 100 iteration loop. > > Run time: 0.112433 s > Rate: 298.438078 Mkeys/s > Time to search keyspace: 2794.549328 days Yeah, so that's roughly our target speed for descrypt-opencl. > Also I did some testing with non-hardcoded unroll of 4 rounds where the > bitsliced keys are indirectedly accessed by GPRs. The result was similar > to what we are currently getting in our des implementation. This is my "we have to use non-constant indices into B" above, right? Can you check the I-cache hit rate for this version of the code as well? > Now under actual scenario we can't hardcode 4 rounds alone. Is there > anything else we can do to increase the cache hit and reduce memory unit > stalls? That crazy idea with full 16*25 = 400 rounds unroll, along with very high GWS (perhaps higher than what we use in descrypt-opencl now). Give it a try. It may totally kill performance on other GPUs, but based on hints so far I expect that GCN might perform very well. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.