Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 23 Aug 2015 11:02:24 +0300
From: Solar Designer <>
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 23, 2015 at 10:21:53AM +0300, Solar Designer wrote:
> For starters, we should try OpenCL's rotate() and see if it translates
> to decent PTX code these days.  Right now, opencl_blake2.h: rotr64()
> does not yet use rotate(), while opencl_blake2-round-no-msg.h does.  We
> should switch both to use the same approach, at least to make reviewing
> the generated PTX code easier.

The body of rotr64() should be:

	return rotate(w, (ulong)(64 - c));

(or we can turn it into a macro, not to rely on the inlining).

Unfortunately, when we're dealing with 64-bit types, the generated PTX
code includes extra mov's:

        .reg .b32 %dummy;
        mov.b64         {%r15,%dummy}, %rd82;
        .reg .b32 %dummy;
        mov.b64         {%dummy,%r16}, %rd82;
        shf.r.wrap.b32  %r17, %r16, %r15, 24;
        shf.r.wrap.b32  %r18, %r15, %r16, 24;

These are simply to extract the 32-bit halves as needed for the shf
instructions.  The mov's should be gone and proper registers
substituted right into the shf instructions in the final ISA code -
however, I am not sure this is what is actually happening (depends on
how good the translator from PTX to native ISA is).

I think this also serves to illustrate why working with 32-bit values or
vector elements at OpenCL source level is a safer bet... although then
we'd need to find and use the right intrinsics for funnel shift in
OpenCL.  AMD has it as amd_bitalign(), but I don't know if NVIDIA has an
equivalent now (maybe the same funnel shift intrinsics names as they use
in CUDA?)


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.