Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 25 Jul 2015 18:17:06 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: yescrypt on GPU

2015-07-25 14:22 GMT+02:00 Solar Designer <solar@...nwall.com>:
> On Sat, Jul 25, 2015 at 12:08:29AM +0200, Agnieszka Bielec wrote:
>> I had in my code
>>
>> for()
>> {
>>      copy to private
>>      some operiations on private
>>      copy to global
>> }
>>
>> i changed this code to
>>
>> memset(this private array,0,size of private array)//because I noticed
>> when I was working on parallel that kernel can slow down after using
>> uninitialized array
>> for()
>> {
>>      some operations on private
>> }
>>
>> and runned with --skip-self-test and speed was the same, even without
>> this memset.
>
> OK.  Why would you incur any accesses to an uninitialized array, though?
>
> yescrypt fully initializes its S-boxes with non-zero data before the
> very first invocation of pwxform, which uses them.

this was only experiment, I wanted to know the speed without copying
data from global memory

>
>> this is big array 8KB but I have in another place copying
>> 64 B and this also decreases speed even when copying 8KB is turned off
>
> Moving a 64 bytes array from global to private decreases speed?  That's
> surprising if so.  Is this 64 bytes array frequently accessed?  Which
> one is it?  The current sub-block buffer in pwxform?  You should keep it
> in private, I think.

in pwxform, 64 bytes - only once they are used

>
> The S-boxes should likely be in local on AMD and in private on NVIDIA,
> although you do in fact need to test with them in global as well - in
> fact, ideally you'd have this tri-state choice auto-tuned at runtime,
> since the optimal one will likely vary across GPUs (even similar ones).
>
> yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
> (8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
> bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
> reduces parallelism, but OTOH 4 such pwxform lanes are computed in
> parallel, which increases parallelism.  This is with yescrypt's current
> default pwxform settings.  We previously found that it's more optimal to
> keep bcrypt's S-boxes in local or private (depending on GPU) rather than
> in global, but the differences of pwxform (with particular settings) vs.
> bcrypt might change this on some GPUs.  Also, yescrypt's other uses of
> global memory (for its large V array) might make use of global memory for
> the S-boxes as well more optimal, since those other accesses to global
> memory might limit the overall latency reduction possible with moving the
> S-boxes to local or private memory, thereby skewing the balance towards
> keeping them in global.

today I was removing getting smaller and smaller parts of the code to
track down the slow part
and this is indeed in pwxform, and when I have
            x0 += p0[0];
            x0 ^= p1[0];
[...]
            x1 += p0[1];
            x1 ^= p1[1];

commented out speed is the same with copying and without
(but I have another version of pwxform using vectors now)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.