Date: Sat, 29 Aug 2015 08:29:53 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU Agnieszka, On Sat, Aug 29, 2015 at 07:47:07AM +0300, Solar Designer wrote: > However, you're right - we might be bumping into the memory bus in the > current code anyway. Maybe the coalescing isn't good enough. Or maybe > we get too many spills to global memory - we could review the generated > code to see where they are and where they go to, or just get rid of them > without such analysis if we can. Another related problem, and something to correct as well, is that this kernel is huge and we're getting too many L1i cache misses. The kernel as a whole is ~100k PTX instructions, which (depending on target ISA) may be ~800 KB. This obviously exceeds the caches (even L2), by far, however it is unclear how large our most performance critical loop is. You could identify that loop's code size (including any functions it calls, if not inlined), and/or try to reduce it (e.g., cut down on the unrolling and inlining overall, or do it selectively). In fact, even if the most performance critical loop fits in cache, or if we make it fit eventually, the size of the full kernel also matters. For comparison, the size of our md5crypt kernel is under 8k PTX instructions total, and even at that size inlining of md5_digest() or partially unrolling the main 1000 iterations loop isn't always optimal. In my recent experiments, I ended up not inlining md5_digest(), but unrolling the loop 2x on AMD and 4x on NVIDIA. Greater unrolling slowed things down on our HD 7990's GPUs, so large kernel size might be a reason why your Argon2 kernels perform worse on the AMD GPUs. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.