Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 29 Aug 2015 08:29:53 +0300
From: Solar Designer <>
Subject: Re: PHC: Argon2 on GPU


On Sat, Aug 29, 2015 at 07:47:07AM +0300, Solar Designer wrote:
> However, you're right - we might be bumping into the memory bus in the
> current code anyway.  Maybe the coalescing isn't good enough.  Or maybe
> we get too many spills to global memory - we could review the generated
> code to see where they are and where they go to, or just get rid of them
> without such analysis if we can.

Another related problem, and something to correct as well, is that this
kernel is huge and we're getting too many L1i cache misses.  The kernel
as a whole is ~100k PTX instructions, which (depending on target ISA)
may be ~800 KB.  This obviously exceeds the caches (even L2), by far,
however it is unclear how large our most performance critical loop is.
You could identify that loop's code size (including any functions it
calls, if not inlined), and/or try to reduce it (e.g., cut down on the
unrolling and inlining overall, or do it selectively).

In fact, even if the most performance critical loop fits in cache, or if
we make it fit eventually, the size of the full kernel also matters.

For comparison, the size of our md5crypt kernel is under 8k PTX
instructions total, and even at that size inlining of md5_digest() or
partially unrolling the main 1000 iterations loop isn't always optimal.
In my recent experiments, I ended up not inlining md5_digest(), but
unrolling the loop 2x on AMD and 4x on NVIDIA.  Greater unrolling slowed
things down on our HD 7990's GPUs, so large kernel size might be a
reason why your Argon2 kernels perform worse on the AMD GPUs.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.