Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 23 Aug 2015 10:39:00 +0300
From: Solar Designer <>
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 23, 2015 at 10:21:53AM +0300, Solar Designer wrote:
> For bit rotates, we appear to be getting things like this:
>         {
>         .reg .b64 %lhs;
>         .reg .b64 %rhs;
>         shl.b64         %lhs, %rd12449, 1;
>         shr.b64         %rhs, %rd12449, 63;
>         add.u64         %rd12450, %lhs, %rhs;
>         }
> This probably translates to at least 6 native instructions.  There ought
> to be more efficient ways, such as involving bfe or/and bfi instructions:

Actually, what we need for newer NVIDIAs (ours are just new enough) is
funnel shift:

AMD had the same thing under the name of bitalign for ages, and we can
simply use rotate() there:

> For starters, we should try OpenCL's rotate() and see if it translates
> to decent PTX code these days.  Right now, opencl_blake2.h: rotr64()
> does not yet use rotate(), while opencl_blake2-round-no-msg.h does.  We
> should switch both to use the same approach, at least to make reviewing
> the generated PTX code easier.

Actually, now that I specifically grep the Argon2 PTX code for shf, I
see that rotate() gives it to us:

        shf.r.wrap.b32  %r287, %r286, %r285, 24;
        shf.r.wrap.b32  %r288, %r285, %r286, 24;

and so on for other BLAKE2 rotate counts.

So those shl/shr/add sequences must have come from opencl_blake2.h:
rotr64().  This is what I mean by "make reviewing the generated PTX code
easier" - if we used rotate() everywhere, I wouldn't be misled into
thinking that unoptimal code was generated.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.