Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 23 Aug 2015 10:39:00 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 23, 2015 at 10:21:53AM +0300, Solar Designer wrote:
> For bit rotates, we appear to be getting things like this:
> 
>         {
>         .reg .b64 %lhs;
>         .reg .b64 %rhs;
>         shl.b64         %lhs, %rd12449, 1;
>         shr.b64         %rhs, %rd12449, 63;
>         add.u64         %rd12450, %lhs, %rhs;
>         }
> 
> This probably translates to at least 6 native instructions.  There ought
> to be more efficient ways, such as involving bfe or/and bfi instructions:
> 
> http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe

Actually, what we need for newer NVIDIAs (ours are just new enough) is
funnel shift:

http://stackoverflow.com/questions/12767113/funnel-shift-what-is-it
http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf

AMD had the same thing under the name of bitalign for ages, and we can
simply use rotate() there:

https://community.amd.com/thread/158497

> For starters, we should try OpenCL's rotate() and see if it translates
> to decent PTX code these days.  Right now, opencl_blake2.h: rotr64()
> does not yet use rotate(), while opencl_blake2-round-no-msg.h does.  We
> should switch both to use the same approach, at least to make reviewing
> the generated PTX code easier.

Actually, now that I specifically grep the Argon2 PTX code for shf, I
see that rotate() gives it to us:

        shf.r.wrap.b32  %r287, %r286, %r285, 24;
        shf.r.wrap.b32  %r288, %r285, %r286, 24;

and so on for other BLAKE2 rotate counts.

So those shl/shr/add sequences must have come from opencl_blake2.h:
rotr64().  This is what I mean by "make reviewing the generated PTX code
easier" - if we used rotate() everywhere, I wouldn't be misled into
thinking that unoptimal code was generated.

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ