john-dev - Re: PHC: yescrypt on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHXN__sipQuMY9TZ86dOH2utVEFtX71SVFDsh2ZofOfO+A@mail.gmail.com>
Date: Sat, 25 Jul 2015 18:17:06 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: yescrypt on GPU

2015-07-25 14:22 GMT+02:00 Solar Designer <solar@...nwall.com>:
> On Sat, Jul 25, 2015 at 12:08:29AM +0200, Agnieszka Bielec wrote:
>> I had in my code
>>
>> for()
>> {
>>      copy to private
>>      some operiations on private
>>      copy to global
>> }
>>
>> i changed this code to
>>
>> memset(this private array,0,size of private array)//because I noticed
>> when I was working on parallel that kernel can slow down after using
>> uninitialized array
>> for()
>> {
>>      some operations on private
>> }
>>
>> and runned with --skip-self-test and speed was the same, even without
>> this memset.
>
> OK.  Why would you incur any accesses to an uninitialized array, though?
>
> yescrypt fully initializes its S-boxes with non-zero data before the
> very first invocation of pwxform, which uses them.

this was only experiment, I wanted to know the speed without copying
data from global memory

>
>> this is big array 8KB but I have in another place copying
>> 64 B and this also decreases speed even when copying 8KB is turned off
>
> Moving a 64 bytes array from global to private decreases speed?  That's
> surprising if so.  Is this 64 bytes array frequently accessed?  Which
> one is it?  The current sub-block buffer in pwxform?  You should keep it
> in private, I think.

in pwxform, 64 bytes - only once they are used

>
> The S-boxes should likely be in local on AMD and in private on NVIDIA,
> although you do in fact need to test with them in global as well - in
> fact, ideally you'd have this tri-state choice auto-tuned at runtime,
> since the optimal one will likely vary across GPUs (even similar ones).
>
> yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
> (8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
> bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
> reduces parallelism, but OTOH 4 such pwxform lanes are computed in
> parallel, which increases parallelism.  This is with yescrypt's current
> default pwxform settings.  We previously found that it's more optimal to
> keep bcrypt's S-boxes in local or private (depending on GPU) rather than
> in global, but the differences of pwxform (with particular settings) vs.
> bcrypt might change this on some GPUs.  Also, yescrypt's other uses of
> global memory (for its large V array) might make use of global memory for
> the S-boxes as well more optimal, since those other accesses to global
> memory might limit the overall latency reduction possible with moving the
> S-boxes to local or private memory, thereby skewing the balance towards
> keeping them in global.

today I was removing getting smaller and smaller parts of the code to
track down the slow part
and this is indeed in pwxform, and when I have
            x0 += p0[0];
            x0 ^= p1[0];
[...]
            x1 += p0[1];
            x1 ^= p1[1];

commented out speed is the same with copying and without
(but I have another version of pwxform using vectors now)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.