Date: Wed, 19 Aug 2015 19:39:09 +0300
From: Solar Designer <>
Subject: Re: PHC: Argon2 on GPU


As it has just been mentioned on the PHC list, you need to try
exploiting the parallelism inside ComputeBlock.  There are two groups of
8 BLAKE2 rounds.  In each of the groups, the 8 rounds may be computed in
parallel.  When your kernel is working on ulong2, I think it won't fully
exploit this parallelism, except that the parallelism may allow for
better pipelining within those ulong2 lanes (not stalling further
instructions since their input data is separate and thus is readily

I think you may try working on ulong16 or ulong8 instead.  I expect
ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
more parallelism apparent to the OpenCL compiler and allocates it to one
work-item.  So please try both and see which works best.

With this, you'd launch groups of 8 or 4 BLAKE2 rounds on those wider
vectors, and then between the two groups of 8 in ComputeBlock you'd need
to shuffle vector elements (moving them between two vectors of ulong8 if
you use that type) instead of shuffling state[] elements like you do now
(and like the original Argon2 code did).

The expectation is that a single kernel invocation will then make use of
more SIMD width (2x512- or 512-bit instead of the current 128-bit), yet
only the same amount of local and private memory as it does now.  So
you'd pack as many of these kernels per GPU as you do now, but they will
run faster (up to 8x faster) since they'd process 8 or 4 BLAKE2 rounds
in parallel rather than sequentially.

Of course, once you've sped this up, other parts of code may become the
new bottlenecks.  In particular, the modulo operation may become even
more important to optimize as well.  You can, and should, quickly test
whether or not it is a bottleneck for a given kernel on a given GPU by
replacing it with that wrap() function I posted.


