john-dev - Re: Idea to increase plaintext length for GPU based hashes

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d55fb9ae50c13c1e455a3217628b6a40@smtp.hushmail.com>
Date: Tue, 19 Mar 2013 02:21:29 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Idea to increase plaintext length for GPU based hashes

On 18 Mar, 2013, at 19:26 , Brian Wallace <bwall@...nbwall.com> wrote:
> One issue I am finding with many of the GPU based formats Jumbo supports is that they have quite small max plain text values.  The reason for this is that I think in general, there is a big performance hit when scaling up the max password length because of the additional memory required.  

It's due to transfer latency, not the memory use itself. That was probably what you meant, but anyway it matters for my ideas below.

> For the most part, we use a single buffer to store the plain texts, generally with a static sized char array then a value describing the length.  This solution proposes a second buffer be used as well as changing the first buffer.
> 
> The first will be filled with static sized objects that contain the length of the password and the offset of the password in the other buffer.(Pointer buffer)
> struct PointerObject{
>    unsigned int length;
>    unsigned int offset;
> }
> 
> The second will have the passwords lined back to back(no need for a terminating null), and will not be a static size(different for each executed kernel.(Data buffer)

I've had similar thoughts. For some reason I rejected the idea so far because I did not think it will fly IRL. I'm not sure how I came to that conclusion, we should definitely try it. The greatest obstacle might be set_key() performance - this function is a bottle-neck already. Maybe it won't get much slower though?

Another approach (not necessarily mutex to yours) would be to split the transfer. Let's say we have a work size of 1M. At, say, the 256K'th call to set_key(), it could initiate a transfer of this first fourth of keys to GPU. This transfer will not stall the host side, it will take place while we continue with the next 256K keys. And so on. If we can balance this properly we should get rid of much of the transfer delay. Maybe we should split it in 8 or 16, maybe less.

Also, most devices are capable of transfering data in one or two directions while running a kernel. So instead of this in crypt_all():

	clEnqueueNDRangeKernel( 1M keys );
	clEnqueueReadBuffer( 1M results );

We could possibly do this:

	clEnqueueNDRangeKernel( 256K keys );
	clEnqueueReadBuffer( 256K results );
	clEnqueueNDRangeKernel( 256K keys );
	clEnqueueReadBuffer( 256K results );
	clEnqueueNDRangeKernel( 256K keys );
	clEnqueueReadBuffer( 256K results );
	clEnqueueNDRangeKernel( 256K keys );
	clEnqueueReadBuffer( 256K results );

I haven't tried any of the this. I did try juggling two sets of input & output buffers but that did not make for much improvement. Maybe the above would do better.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.