john-dev - Re: Idea to increase plaintext length for GPU based hashes

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5147C4C8.70201@openbwall.com>
Date: Mon, 18 Mar 2013 21:52:08 -0400
From: Brian Wallace <bwall@...nbwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Idea to increase plaintext length for GPU based hashes


On 03/18/2013 09:21 PM, magnum wrote:
> It's due to transfer latency, not the memory use itself. That was 
> probably what you meant, but anyway it matters for my ideas below. 
Yeah, what I meant is the time it takes to transfer it.
> I've had similar thoughts. For some reason I rejected the idea so far because I did not think it will fly IRL. I'm not sure how I came to that conclusion, we should definitely try it. The greatest obstacle might be set_key() performance - this function is a bottle-neck already. Maybe it won't get much slower though?
I'm not sure if it would be slower.  If a pointer is maintained to the 
position of the last character added to a (max keys * max size) buffer, 
as well an integer defining the offset, we can do the writes with little 
computation.  Since the size of the PointerObject would be static, a 
single buffer could be used, where the data is put after where the last 
PointerObject would fill.
> Another approach (not necessarily mutex to yours) would be to split the transfer. Let's say we have a work size of 1M. At, say, the 256K'th call to set_key(), it could initiate a transfer of this first fourth of keys to GPU. This transfer will not stall the host side, it will take place while we continue with the next 256K keys. And so on. If we can balance this properly we should get rid of much of the transfer delay. Maybe we should split it in 8 or 16, maybe less.
>
> Also, most devices are capable of transfering data in one or two directions while running a kernel. So instead of this in crypt_all():
>
> 	clEnqueueNDRangeKernel( 1M keys );
> 	clEnqueueReadBuffer( 1M results );
>
> We could possibly do this:
>
> 	clEnqueueNDRangeKernel( 256K keys );
> 	clEnqueueReadBuffer( 256K results );
> 	clEnqueueNDRangeKernel( 256K keys );
> 	clEnqueueReadBuffer( 256K results );
> 	clEnqueueNDRangeKernel( 256K keys );
> 	clEnqueueReadBuffer( 256K results );
> 	clEnqueueNDRangeKernel( 256K keys );
> 	clEnqueueReadBuffer( 256K results );
We can combine these ideas, where we split the buffers, but then fill 
them up at the beginning with PointerObjects then the keys in the second 
half.  It keeps the memory transfer calls to a single relevant one, but 
also allows us to take advantage of the transfer/computation concurrency 
that can be used.  Not to mention, we can use both these techniques to 
not only compute keys faster, but allow bigger keys. :)

Theoretically of course.  I might have to etch some time in my schedule 
to test these.
-Brian Wallace

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.