Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 27 Jun 2013 18:31:37 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Re: latest version of descrypt project on reddit

On Thu, Jun 27, 2013 at 07:23:44PM +0530, Sayantan Datta wrote:
> I profiled the kernel using codeXL.
> 
> When we have all 16 rounds unrolled put under single iteration , cache 
> hit is +99%. This is supported by the evidence that there is 0% memory 
> unit stalls and very little fetch from video memory.This corresponds to 
> the first case(4694Mkeys/s).
> 
> Next when we put the 16 rounds of des in a 25 iter loop the cache hit 
> suddenly drops to 1%.  Now the memory unit is stalled 23% of the time 
> and video memory fetches are increased by nearly 100x.  This would be 
> the second case(117 Mkeys/s).

This makes me wonder: what if instead of the 25 iter loop, you unroll
the entire thing - that is, repeat the 16 rounds 25 times.

I think 16 rounds already exceed I-cache size, yet with non-looping
kernel like that the hardware somehow manages to avoid most cache misses -
perhaps by reusing each portion of fetched code many times (due to the
high GWS).  Perhaps we can make use of this feature for the entire
descrypt just as easily?

> We can increase the cache hit back to +90% with almost 0 memory unit 
> stalls if we can somehow unroll only 4 rounds and put it under 
> appropriate iteration count.

That's tough.  We can easily do it for 8 rounds (in fact, that's what we
already do in descrypt-opencl), but when we reduce to 4, we have to use
non-constant indices into B[].

BTW, can you check the I-cache hit rate for the current descrypt-opencl?

> With 4 rounds unrolled and put under a 100 iteration loop.
> 
> Run time: 0.112433 s
> Rate: 298.438078 Mkeys/s
> Time to search keyspace: 2794.549328 days

Yeah, so that's roughly our target speed for descrypt-opencl.

> Also I did some testing with non-hardcoded unroll of 4 rounds where the 
> bitsliced keys are indirectedly accessed by GPRs. The result was similar 
> to what we are currently getting in our des implementation.

This is my "we have to use non-constant indices into B[]" above, right?

Can you check the I-cache hit rate for this version of the code as well?

> Now under actual scenario we can't hardcode 4 rounds alone. Is there 
> anything else we can do to increase the cache hit and reduce memory unit 
> stalls?

That crazy idea with full 16*25 = 400 rounds unroll, along with very
high GWS (perhaps higher than what we use in descrypt-opencl now).  Give
it a try.  It may totally kill performance on other GPUs, but based on
hints so far I expect that GCN might perform very well.

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ