john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150829064848.GA30978@openwall.com>
Date: Sat, 29 Aug 2015 09:48:49 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

Agnieszka,

On Sat, Aug 29, 2015 at 08:29:53AM +0300, Solar Designer wrote:
> You could identify that loop's code size (including any functions it
> calls, if not inlined), and/or try to reduce it (e.g., cut down on the
> unrolling and inlining overall, or do it selectively).
> 
> In fact, even if the most performance critical loop fits in cache, or if
> we make it fit eventually, the size of the full kernel also matters.
> 
> For comparison, the size of our md5crypt kernel is under 8k PTX
> instructions total, and even at that size inlining of md5_digest() or
> partially unrolling the main 1000 iterations loop isn't always optimal.
> In my recent experiments, I ended up not inlining md5_digest(), but
> unrolling the loop 2x on AMD and 4x on NVIDIA.  Greater unrolling slowed
> things down on our HD 7990's GPUs, so large kernel size might be a
> reason why your Argon2 kernels perform worse on the AMD GPUs.

Per this recent discussion, not inlining of functions isn't supported in
AMD OpenCL currently:

https://community.amd.com/thread/170309

So I am puzzled why I appeared to have any performance difference from
including or omitting the "inline" keyword on md5_digest().  I'll need
to re-test this, preferably reviewing the generated code.  When
targeting NVIDIA, I am indeed getting the exact same PTX code regardless
of whether I include the inline keyword or not.

"realhet", who commented in that thread, wrote a GCN ISA assembler, so
he would know.  It's one of the tools we have listed at:

http://openwall.info/wiki/john/development/GPU-low-level

And it seems I was wrong about the 8k PTX instructions - that might have
been for another kernel or something.  Our md5crypt kernel is at around
4k PTX instructions currently.

However, function calls in OpenCL do seem to be supported on NVIDIA, as
seen from reviewing the PTX code for your Argon2 kernels.  You don't
have your functions explicitly marked "inline", but most are inlined
anyway - yet a few are not:

$ fgrep .func kernel.out
.func Initialize
.func blake2b_update(
.func blake2b_final(
.func blake2b(
.func Initialize(

$ fgrep -A1 call.uni kernel.out | head -8
        call.uni 
        blake2b_update, 
--
        call.uni 
        blake2b_update, 
--
        call.uni 
        blake2b_update, 

You could want to look into ways to make more of the infrequent function
calls to actually be calls rather than inlining.  Ideally, there would
be a keyword to prevent inlining, but I am not aware of one.  Maybe
there's a compiler switch, and then explicit "inline" would start to
matter.  Please look into this.

As to loop unrolling, there's "#pragma unroll N", and when you specify
N=1 so "#pragma unroll 1" I think it prevents unrolling.  As an
experiment, I tried adding "#pragma unroll 1" before all loops in
argon2d_kernel.cl, and the PTX instruction count reduced - but not a
lot.  With uses of BLAKE2_ROUND_NO_MSG_V macros also put into loops:

#pragma unroll 1
        for (i = 0; i < 64; i += 8) {
                BLAKE2_ROUND_NO_MSG_V(state[i], state[i+1],
                    state[i+2], state[i+3],
                    state[i+4], state[i+5],
                    state[i+6], state[i+7]);
        }

#pragma unroll 1
        for (i = 0; i < 8; i++) {
                BLAKE2_ROUND_NO_MSG_V(state[i], state[i+8],
                    state[i+16], state[i+24],
                    state[i+32], state[i+40],
                    state[i+48], state[i+56]);
        }

I got the PTX instruction count down from ~100k to ~80k.  No speedup,
though.  (But not much slowdown either.)

We need to figure out why it doesn't get lower.  ~80k is still a lot.
Are there many inlined functions and unrolled loops in the .h files?

Maybe some pre- and/or post-processing should be kept on host to make
the kernel simpler and smaller.  This is bad in terms of Amdahl's law,
but it might help us figure things out initially.

BTW, it would be helpful to have some Perl scripts or such to analyze
the PTX code.  Even counting the instructions is a bit tricky since many
of the lines are not instructions.  "sort -u ... | wc -l" gives an
estimate (and this is what I have been using) due to a new virtual
register number being allocated each time (so even if the same
instruction is used multiple times, it appears as different - and that's
as we want it for counting).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.