john-dev - Re: Re: latest version of descrypt project on reddit

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130627143137.GA22821@openwall.com>
Date: Thu, 27 Jun 2013 18:31:37 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Re: latest version of descrypt project on reddit

On Thu, Jun 27, 2013 at 07:23:44PM +0530, Sayantan Datta wrote:
> I profiled the kernel using codeXL.
> 
> When we have all 16 rounds unrolled put under single iteration , cache 
> hit is +99%. This is supported by the evidence that there is 0% memory 
> unit stalls and very little fetch from video memory.This corresponds to 
> the first case(4694Mkeys/s).
> 
> Next when we put the 16 rounds of des in a 25 iter loop the cache hit 
> suddenly drops to 1%.  Now the memory unit is stalled 23% of the time 
> and video memory fetches are increased by nearly 100x.  This would be 
> the second case(117 Mkeys/s).

This makes me wonder: what if instead of the 25 iter loop, you unroll
the entire thing - that is, repeat the 16 rounds 25 times.

I think 16 rounds already exceed I-cache size, yet with non-looping
kernel like that the hardware somehow manages to avoid most cache misses -
perhaps by reusing each portion of fetched code many times (due to the
high GWS).  Perhaps we can make use of this feature for the entire
descrypt just as easily?

> We can increase the cache hit back to +90% with almost 0 memory unit 
> stalls if we can somehow unroll only 4 rounds and put it under 
> appropriate iteration count.

That's tough.  We can easily do it for 8 rounds (in fact, that's what we
already do in descrypt-opencl), but when we reduce to 4, we have to use
non-constant indices into B[].

BTW, can you check the I-cache hit rate for the current descrypt-opencl?

> With 4 rounds unrolled and put under a 100 iteration loop.
> 
> Run time: 0.112433 s
> Rate: 298.438078 Mkeys/s
> Time to search keyspace: 2794.549328 days

Yeah, so that's roughly our target speed for descrypt-opencl.

> Also I did some testing with non-hardcoded unroll of 4 rounds where the 
> bitsliced keys are indirectedly accessed by GPRs. The result was similar 
> to what we are currently getting in our des implementation.

This is my "we have to use non-constant indices into B[]" above, right?

Can you check the I-cache hit rate for this version of the code as well?

> Now under actual scenario we can't hardcode 4 rounds alone. Is there 
> anything else we can do to increase the cache hit and reduce memory unit 
> stalls?

That crazy idea with full 16*25 = 400 rounds unroll, along with very
high GWS (perhaps higher than what we use in descrypt-opencl now).  Give
it a try.  It may totally kill performance on other GPUs, but based on
hints so far I expect that GCN might perform very well.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.