Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 25 Jul 2015 14:22:04 +0200
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: yescrypt on GPU

On Sat, Jul 25, 2015 at 12:08:29AM +0200, Agnieszka Bielec wrote:
> I had in my code
> 
> for()
> {
>      copy to private
>      some operiations on private
>      copy to global
> }
> 
> i changed this code to
> 
> memset(this private array,0,size of private array)//because I noticed
> when I was working on parallel that kernel can slow down after using
> uninitialized array
> for()
> {
>      some operations on private
> }
> 
> and runned with --skip-self-test and speed was the same, even without
> this memset.

OK.  Why would you incur any accesses to an uninitialized array, though?

yescrypt fully initializes its S-boxes with non-zero data before the
very first invocation of pwxform, which uses them.

> this is big array 8KB but I have in another place copying
> 64 B and this also decreases speed even when copying 8KB is turned off

Moving a 64 bytes array from global to private decreases speed?  That's
surprising if so.  Is this 64 bytes array frequently accessed?  Which
one is it?  The current sub-block buffer in pwxform?  You should keep it
in private, I think.

The S-boxes should likely be in local on AMD and in private on NVIDIA,
although you do in fact need to test with them in global as well - in
fact, ideally you'd have this tri-state choice auto-tuned at runtime,
since the optimal one will likely vary across GPUs (even similar ones).

yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
(8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
reduces parallelism, but OTOH 4 such pwxform lanes are computed in
parallel, which increases parallelism.  This is with yescrypt's current
default pwxform settings.  We previously found that it's more optimal to
keep bcrypt's S-boxes in local or private (depending on GPU) rather than
in global, but the differences of pwxform (with particular settings) vs.
bcrypt might change this on some GPUs.  Also, yescrypt's other uses of
global memory (for its large V array) might make use of global memory for
the S-boxes as well more optimal, since those other accesses to global
memory might limit the overall latency reduction possible with moving the
S-boxes to local or private memory, thereby skewing the balance towards
keeping them in global.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.