john-dev - Re: Re: latest version of descrypt project on reddit

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <51CC882C.9040707@gmail.com>
Date: Fri, 28 Jun 2013 00:15:00 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Re: latest version of descrypt project on reddit

On Thursday 27 June 2013 08:01 PM, Solar Designer wrote:
> On Thu, Jun 27, 2013 at 07:23:44PM +0530, Sayantan Datta wrote:
>> I profiled the kernel using codeXL.
>>
>> When we have all 16 rounds unrolled put under single iteration , cache
>> hit is +99%. This is supported by the evidence that there is 0% memory
>> unit stalls and very little fetch from video memory.This corresponds to
>> the first case(4694Mkeys/s).
>>
>> Next when we put the 16 rounds of des in a 25 iter loop the cache hit
>> suddenly drops to 1%.  Now the memory unit is stalled 23% of the time
>> and video memory fetches are increased by nearly 100x.  This would be
>> the second case(117 Mkeys/s).
> This makes me wonder: what if instead of the 25 iter loop, you unroll
> the entire thing - that is, repeat the 16 rounds 25 times.
>
> I think 16 rounds already exceed I-cache size, yet with non-looping
> kernel like that the hardware somehow manages to avoid most cache misses -
> perhaps by reusing each portion of fetched code many times (due to the
> high GWS).  Perhaps we can make use of this feature for the entire
> descrypt just as easily?

400 round unroll with much higher GWS:
Run time: 6.897884 s
Rate: 77.831245 Mkeys/s
Time to search keyspace: 10715.490027 days

Stats: Cache hit 92%, FetchSize:891MB , 0% memory stall.
This is little harder to explain. On one hand we have very good cache 
hit but on the other hand we have extremely large fetch size. In best 
case scenario we had cache hit +90% and fetch size around 50KB(for 4 
round unroll under 100 iterations).

>> We can increase the cache hit back to +90% with almost 0 memory unit
>> stalls if we can somehow unroll only 4 rounds and put it under
>> appropriate iteration count.
> That's tough.  We can easily do it for 8 rounds (in fact, that's what we
> already do in descrypt-opencl), but when we reduce to 4, we have to use
> non-constant indices into B[].
>
> BTW, can you check the I-cache hit rate for the current descrypt-opencl?

Current kernel has a cache hit of 26% for fully hardcoded kernel , but 
this includes all cache hits and not just the i-cache. However I am 
mostly sure it is mainly i-cache because there no or very little data 
being fetched from global memory.    Also the amount of fetch is very 
very high nearly 13GB.

For a hardcoded but not fully unrolled kernel cache hit is just 2%. But 
amount of fetch is also low, nearly around 238MB.

Note that these kernels also include compare and password generators.

Regards,
Sayantan

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.