john-dev - Re: Mask mode for GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130617172221.GC30970@openwall.com>
Date: Mon, 17 Jun 2013 21:22:21 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Mask mode for GPU

On Mon, Jun 17, 2013 at 10:10:35AM +0530, Sayantan Datta wrote:
> In a little experiment I simply put the kernel inside a 10 iter loop. 
> Nothing else is changed but somehow I was only able to 20M c/s(with 
> updated *pcount) whereas I can get around
> 80M c/s with just one loop.  I think it may be problem with the opencl 
> compiler trying to unroll the 10 loops causing i-cache overrun.

Oh.  Can you investigate that and try to get around it perhaps with
different source code layout?

BTW, out of different GPUs, AMD GCN GPUs appeared to tolerate the code
exceeding I-cache size pretty well - although perhaps not in all cases.
I think the DES cracker referenced from that Reddit thread I mentioned
did exceed I-cache size, but it ran pretty fast on GCN GPUs (and only on
those...), I guess due to each fetch into the I-cache being made use of
multiple times (by multiple wavefronts).

> In my approach I was only going to generate 32 keys per work item and 
> avoid any loop inside the kernel for descrypt. Whereas in your approach 
> we would be generating 26*26 passwords per kernel requiring (26*26)/ 32 
> kernel iterations per kernel invocation.

Oh, I overlooked the fact that with bitslice DES you obviously need to
process a multiple of 32 candidate passwords per work-item.

A drawback of your approach is that you'll probably end up doing modulo
division of get_global_id(), which may be slow.  Also, you may have to
compute all of the GPU-generated characters (such as 2 or 3) in each
work-item, whereas with my suggested approach you'd only update one
character in most loop iterations.

A hybrid approach might work best - leaving the somewhat slow computation
mentioned above outside of a loop iterating over one or two characters.
Yes, you'd need to deal with the 80M to 20M slowdown you identified
above somehow... in fact, since 80M is ~3x+ less than our target speed
anyway, you'd need to figure out why this code runs slower than it
should and try to fix it either way.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.