john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150823073900.GA16465@openwall.com>
Date: Sun, 23 Aug 2015 10:39:00 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 23, 2015 at 10:21:53AM +0300, Solar Designer wrote:
> For bit rotates, we appear to be getting things like this:
> 
>         {
>         .reg .b64 %lhs;
>         .reg .b64 %rhs;
>         shl.b64         %lhs, %rd12449, 1;
>         shr.b64         %rhs, %rd12449, 63;
>         add.u64         %rd12450, %lhs, %rhs;
>         }
> 
> This probably translates to at least 6 native instructions.  There ought
> to be more efficient ways, such as involving bfe or/and bfi instructions:
> 
> http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe

Actually, what we need for newer NVIDIAs (ours are just new enough) is
funnel shift:

http://stackoverflow.com/questions/12767113/funnel-shift-what-is-it
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf

AMD had the same thing under the name of bitalign for ages, and we can
simply use rotate() there:

https://community.amd.com/thread/158497

> For starters, we should try OpenCL's rotate() and see if it translates
> to decent PTX code these days.  Right now, opencl_blake2.h: rotr64()
> does not yet use rotate(), while opencl_blake2-round-no-msg.h does.  We
> should switch both to use the same approach, at least to make reviewing
> the generated PTX code easier.

Actually, now that I specifically grep the Argon2 PTX code for shf, I
see that rotate() gives it to us:

        shf.r.wrap.b32  %r287, %r286, %r285, 24;
        shf.r.wrap.b32  %r288, %r285, %r286, 24;

and so on for other BLAKE2 rotate counts.

So those shl/shr/add sequences must have come from opencl_blake2.h:
rotr64().  This is what I mean by "make reviewing the generated PTX code
easier" - if we used rotate() everywhere, I wouldn't be misled into
thinking that unoptimal code was generated.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.