Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 19 Aug 2015 07:10:43 +0300
From: Solar Designer <>
Subject: Re: PHC: Argon2 on GPU

On Wed, Aug 19, 2015 at 04:47:32AM +0200, Agnieszka Bielec wrote:
> I have argon from github from
> branch master. should I work now on 'enhcance' branch?

You'll need to switch at some point, but since you already did so much
work on the original Argon2, I think it makes sense for you to stick
with it for a while longer.

Regarding optimizations, especially for GPU:

I think the modulo division operations are causing a lot of latency:

[ opencl]$ fgrep % argon2*.cl              reference_block_offset = (phi % r);              uint reference_block_index = addresses[0] % r;              uint reference_block_index = addresses[i] % r;

On GPU, this means having local and private memory tied up to code that
doesn't actually touch that memory but is instead doing this division
operation for many cycles in a row.  This is extremely wasteful.  I think
this might explain the unexpectedly poor performance on AMD GCN.  (Maybe
NVIDIA has relatively low latency integer division hardware.)

For Argon2i, you should be able to easily optimize this overhead out,
since all the indices are known in advance (they are the same each
time, by design, as required to avoid cache timing leaks).  You should
also be able to optimize out the hashing that produces those indices
(before the modulo division), but that's relatively minor (yet by all
means make this optimization as well if you do precompute the indices).

This means you will need some memory to store those indices in (1536 of
them for our current benchmarks? meaning something like 3 KB?), but this
memory can be shared between different concurrent hash computations.

For Argon2d, optimizing this is not easy, and the speedup potential is
lower.  Yet there could be ways:

Fast Division Algorithm with a Small Lookup Table

"This paper presents a new division algorithm, which requires two
multiplication operations and a single lookup in a small table.  The
division algorithm takes two steps.  The table lookup and the first
multiplication are processed concurrently in the first step, and the
second multiplication is executed in the next step.  This divider uses a
single multiplier and a lookup table with 2/sup m/(2m+1) bits to produce
2 m-bit results that are guaranteed correct to one ulp.  By using a
multiplier and a 12.5 KB lookup table, the basic algorithm generates a
24-bit result in two cycles."

It might also be practical to adapt methods normally used for
floating-point to producing precise integer results (there might need to
be some trial and error to confirm or come up with precise results, and
you'll need to implement this in code too):

Someone with a GPU working on a much simpler subset of the problem:

(just to illustrate the problem of slow integer division on GPUs).

Before you spend a lot of time on this, I suggest that you replace this
modulo operation with something simpler (and wrong), yet in some ways
similar, e.g.:

static inline uint32_t wrap(uint64_t x, uint32_t n)
	uint64_t a = (x + n) & (n - 1);
	uint64_t b = x & n;
	uint64_t c = (x << 1) & n;
	return ((a << 1) + b + c) >> 2;

(and its OpenCL equivalent, with proper data types).  Of course, this
revision of Argon2 won't match Argon2's normal test vectors, but you
should be able to see roughly what performance you could get if you
later optimize the division.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.