john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150819165135.GA17072@openwall.com>
Date: Wed, 19 Aug 2015 19:51:35 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

Agnieszka,

On Wed, Aug 19, 2015 at 07:40:57AM +0300, Solar Designer wrote:
> On Wed, Aug 19, 2015 at 07:10:42AM +0300, Solar Designer wrote:
> > I think the modulo division operations are causing a lot of latency:
> > 
> > [solar@...er opencl]$ fgrep % argon2*.cl
> > argon2d_kernel.cl:              reference_block_offset = (phi % r);
> > argon2i_kernel.cl:              uint reference_block_index = addresses[0] % r;
> > argon2i_kernel.cl:              uint reference_block_index = addresses[i] % r;
> 
> You might also achieve speedup by moving these operations up in code, to
> be performed as soon as their input data is available.  Maybe the
> compiler already does it for you, or maybe not.

Moreover, you may also prefetch the data pointed to by the index from
global memory sooner.  You have limited local or private memory to
prefetch to, but you probably do have it allocated for one block anyway,
and you can start fetching it sooner.  Or you can prefetch() into global
memory cache:

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/prefetch.html

If you continue to use ulong2, then in 2d you may prefetch after 9 out
of 16 BLAKE2's.  With ulong8, you may prefetch after 12 out of 16.  With
ulong16, you can't... yet it might be optimal for other reasons (twice
higher parallelism over ulong8 yet with the same total concurrent
instance count).

In 2i, you may prefetch whenever you like (which should be when you
determine is the optimal time to prefetch, so that the prefetched data
is available in time for its use yet isn't thrown out of global memory
cache before it's used), regardless of how much parallelism in
ComputeBlock you exploit.

> For 2i, there's no way those 256 modulo operations would be run
> concurrently from one work-item.  And besides, to run them concurrently
> you'd need to provide storage for the results (you already have that
> addresses[] array) and then it's no better than copying this data e.g.
> from a larger array (holding all precomputed indices) in global memory.

You could probably pack more instances of 2i per GPU by reducing the
size of this addresses[] array, fetching smaller groups of indices from
global memory at a time (than are being computed at a time now).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.