john-dev - Re: PHC: yescrypt on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151006001609.GA4754@openwall.com>
Date: Tue, 6 Oct 2015 03:16:09 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: yescrypt on GPU

Agnieszka,

It looks like your work on optimizing yescrypt on GPU ended with:

On Sat, Jul 25, 2015 at 06:17:06PM +0200, Agnieszka Bielec wrote:
> 2015-07-25 14:22 GMT+02:00 Solar Designer <solar@...nwall.com>:
> >> this is big array 8KB but I have in another place copying
> >> 64 B and this also decreases speed even when copying 8KB is turned off
> >
> > Moving a 64 bytes array from global to private decreases speed?  That's
> > surprising if so.  Is this 64 bytes array frequently accessed?  Which
> > one is it?  The current sub-block buffer in pwxform?  You should keep it
> > in private, I think.
> 
> in pwxform, 64 bytes - only once they are used
> 
> > The S-boxes should likely be in local on AMD and in private on NVIDIA,
> > although you do in fact need to test with them in global as well - in
> > fact, ideally you'd have this tri-state choice auto-tuned at runtime,
> > since the optimal one will likely vary across GPUs (even similar ones).
> >
> > yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
> > (8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
> > bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
> > reduces parallelism, but OTOH 4 such pwxform lanes are computed in
> > parallel, which increases parallelism.  This is with yescrypt's current
> > default pwxform settings.  We previously found that it's more optimal to
> > keep bcrypt's S-boxes in local or private (depending on GPU) rather than
> > in global, but the differences of pwxform (with particular settings) vs.
> > bcrypt might change this on some GPUs.  Also, yescrypt's other uses of
> > global memory (for its large V array) might make use of global memory for
> > the S-boxes as well more optimal, since those other accesses to global
> > memory might limit the overall latency reduction possible with moving the
> > S-boxes to local or private memory, thereby skewing the balance towards
> > keeping them in global.
> 
> today I was removing getting smaller and smaller parts of the code to
> track down the slow part
> and this is indeed in pwxform, and when I have
>             x0 += p0[0];
>             x0 ^= p1[0];
> [...]
>             x1 += p0[1];
>             x1 ^= p1[1];
> 
> commented out speed is the same with copying and without
> (but I have another version of pwxform using vectors now)

Looking at your code now, I see that you indeed store the 64-byte X
array in pwxform in global memory.  This is weird.

Then you have several versions of the functions, with one of two sets
chosen with "#define SP_COPY" (which is commented out by default).
With SP_COPY, your pwxform works on S-boxes in private memory.  Without
SP_COPY, it works on S-boxes in global memory.

At least for AMD, we need to try a version with S-boxes in local memory,
which you don't appear to have ever tried.  (For bcrypt's 4 KB S-boxes,
we got 2x+ better speeds on GCN with them in local than in global memory.
They're too large for private on GCN, where it's 1 KB per work-item max.)

There may also be correlation between where you choose to keep the
S-boxes and where it's optimal to keep other buffers.  With the S-boxes
in global memory anyway, it is more likely optimal to keep other things
in global memory as well, to maximize the occupation rather than
minimize latency of every one computation (which you can't do much about
when the S-box lookups incur high latency anyway).  So you need to try
multiple combinations to properly test the different approaches.  You
can't test S-boxes in global vs. local memory on its own, without making
changes to where other stuff is stored, and conclusively state that one
is faster than the other - you need to perform more tests, with other
changes, in order to draw such conclusions.

As you're probably aware from the PHC list, I am working on finalizing
yescrypt now, and I need to choose between 5 (or maybe 6) pwxform rounds
with 12 KB S-boxes, and 4 (or maybe 5) pwxform rounds with 12 KB L1 and
128 KB (or maybe 64 KB) L2 S-boxes:

http://thread.gmane.org/gmane.comp.security.phc/3244/focus=3495

If all GPU attack implementations would happen to find it optimal to
keep the S-boxes in global memory anyway, then +1 round (e.g. 5 total)
with 12 KB is preferable over the 12 KB + 128 KB version with -1 round
(e.g. 4 total).  (The 12 KB + 128 KB version might work better against
some FPGAs and ASICs, though.  As well as future GPUs with more local
memory per computation power, if such appear.)  However, if even at
12 KB local memory is still more optimal than global for the S-boxes on
at least some GPUs, then that's a reason for me to go for the 12 KB +
128 KB option (even if it costs removing a round).  (In this paragraph,
I am using "local memory" to refer to any on-die memory.)

A secondary question is how often to write to the S-boxes.  I can do it
just once per sub-block, or after every pwxform round (except for the
last round, which is followed by a write anyway), or twice per sub-block
if the two-level S-boxes are used.

So having better results, for better optimized code, would be helpful
right now.  And you'll need to revise it to include the 12 KB version,
instead of the current 8 KB.  I may provide it to you soon if you're
ready to work on this.  Please let me know.

Then there's also this weird trick I just posted about to the PHC list:

http://thread.gmane.org/gmane.comp.security.phc/2938/focus=3496

where a BSTY miner implementation author chose to split the S-box
lookups across multiple work-items.  He chose 16, but I think 4 would be
optimal (or maybe 8, if additionally splitting loads of the two S-boxes
across two sets of 4 work-items).  This might in fact speed them up, so
might be worth trying (as an extra option, on top of 3 main ones).
He reported 372 h/s at 2 MB (N=2048 r=8) on HD 7750.  Scaling to 7970,
this could be up to 372*2048/512*1000/800 = 1860, but probably a lot
less than that in practice (7750's narrower memory bus might be a better
fit).  Your reported best result is 914 for 1.5 MB (r=6), so seemingly
much slower than his:

http://www.openwall.com/lists/john-dev/2015/07/27/6

We have a 7750 (a version with DDR3 memory, though) in "well", so you
may try your code on it and compare against the 372 figure directly.
And like I wrote, his byte-granular loads are likely not optimal, with
uint likely more optimal.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.