Date: Mon, 18 Mar 2013 21:52:08 -0400 From: Brian Wallace <bwall@...nbwall.com> To: john-dev@...ts.openwall.com Subject: Re: Idea to increase plaintext length for GPU based hashes On 03/18/2013 09:21 PM, magnum wrote: > It's due to transfer latency, not the memory use itself. That was > probably what you meant, but anyway it matters for my ideas below. Yeah, what I meant is the time it takes to transfer it. > I've had similar thoughts. For some reason I rejected the idea so far because I did not think it will fly IRL. I'm not sure how I came to that conclusion, we should definitely try it. The greatest obstacle might be set_key() performance - this function is a bottle-neck already. Maybe it won't get much slower though? I'm not sure if it would be slower. If a pointer is maintained to the position of the last character added to a (max keys * max size) buffer, as well an integer defining the offset, we can do the writes with little computation. Since the size of the PointerObject would be static, a single buffer could be used, where the data is put after where the last PointerObject would fill. > Another approach (not necessarily mutex to yours) would be to split the transfer. Let's say we have a work size of 1M. At, say, the 256K'th call to set_key(), it could initiate a transfer of this first fourth of keys to GPU. This transfer will not stall the host side, it will take place while we continue with the next 256K keys. And so on. If we can balance this properly we should get rid of much of the transfer delay. Maybe we should split it in 8 or 16, maybe less. > > Also, most devices are capable of transfering data in one or two directions while running a kernel. So instead of this in crypt_all(): > > clEnqueueNDRangeKernel( 1M keys ); > clEnqueueReadBuffer( 1M results ); > > We could possibly do this: > > clEnqueueNDRangeKernel( 256K keys ); > clEnqueueReadBuffer( 256K results ); > clEnqueueNDRangeKernel( 256K keys ); > clEnqueueReadBuffer( 256K results ); > clEnqueueNDRangeKernel( 256K keys ); > clEnqueueReadBuffer( 256K results ); > clEnqueueNDRangeKernel( 256K keys ); > clEnqueueReadBuffer( 256K results ); We can combine these ideas, where we split the buffers, but then fill them up at the beginning with PointerObjects then the keys in the second half. It keeps the memory transfer calls to a single relevant one, but also allows us to take advantage of the transfer/computation concurrency that can be used. Not to mention, we can use both these techniques to not only compute keys faster, but allow bigger keys. :) Theoretically of course. I might have to etch some time in my schedule to test these. -Brian Wallace
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.