Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 24 Jun 2012 01:26:11 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: async key transfers to GPU (was: Weekly report 1)

myrice, Samuele -

On Tue, Apr 17, 2012 at 08:34:45PM +0400, Solar Designer wrote:
> On Tue, Apr 17, 2012 at 03:10:07PM +0800, myrice wrote:
> > 4. Tried async cpy on GPU but no performance gains, still keep tuning.
> 
> IIRC, what you tried was not supposed to result in any speedup because
> your GPU code was invoked and required the data to be already available
> right after you started the async copy - so you had it waiting for data
> right at that point anyway.
> 
> Lukas' code was different: IIRC, he split the buffered candidate
> passwords in three smaller chunks, where two of the three may in fact be
> transferred to the GPU asynchronously while the previous chunk is being
> processed.  You may implement that too, and I suggest that you make the
> number of chunks to use configurable and try values larger than 3 (e.g.,
> 10 might be reasonable - letting you hide the latency for 9 out of 10
> transfers while hopefully not exceeding the size of a CPU data cache).

While thinking of formats interface enhancements to make this more
efficient, I realized that full efficiency may already be achieved by
splitting the set of keys in only two chunks and starting transfer of
the first chunk when set_key() is called for index ==
max_keys_per_crypt / 2 - 1.  Do it right from that set_key() call.
Then crypt_all() will start by initiating transfer of the second chunk
and hashing of the first chunk (which may be already in GPU by that
time), and then proceed to hash the second chunk (its transfer to GPU
may complete while the first chunk is being hashed).  (You'll need to
handle the special case when fewer than max_keys_per_crypt or even fewer
than max_keys_per_crypt / 2 keys are tried per crypt_all() call.  Not
optimize for this case, but just make sure it works properly as well,
without real async transfers then.  This is not difficult.)

I will likely make this more explicit in the formats interface, which is
needed to support trickier things such as overlapping CPU/GPU
computation for WPA-PSK and MSCash2, but for now the hack suggested
above should just work to fully hide the latency of transfers to GPU as
long as we're able to generate and transfer the candidate passwords fast
enough at all.  Can you please try it out?

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.