john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150823053722.GA15084@openwall.com>
Date: Sun, 23 Aug 2015 08:37:22 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Fri, Aug 21, 2015 at 05:40:42PM +0200, Agnieszka Bielec wrote:
> 2015-08-20 22:34 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > You could start by experimenting with a much simpler than Argon2 yet in
> > some ways similar kernel: implement some trivial operation like XOR on
> > different vector widths and see whether/how this changes the assembly.
> > Then make it slightly less trivial (just enough to prevent the compiler
> > from optimizing things out) and add uses of private or local memory,
> > and see if you can make it run faster by using wider vectors per the
> > same private or local memory usage.
> 
> I tested (only 960m)
> -copying memory from __private to __private
> - from __global to __private
> -xoring private tables with __prrivate tables
> 
> using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16)
> 
> in generated PTX code ulong4 and ulong8 were changed to ulong2

I've just read up on this.  It turns out that vectors wider than 128-bit
are not supported in PTX:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#vectors

"Vectors cannot exceed 128-bits in length"

So it's no surprise that ulong4 and ulong8 resulted in PTX code that
used ulong2 alike instructions.

Maybe we could bypass this limitation by programming at GPU ISA level,
such as with MaxAs, or maybe not.  We'd need to become familiar with a
given GPU ISA (and maybe more) to answer this.

When we're programming in OpenCL (or even in PTX or IL), we appear to be
stuck with the SIMT rather than SIMD model.  We want to access wider
SIMD from one instance of Argon2, but we're only given SIMT instead.
What we can possibly do is accept this model, and work within it to
achieve our goal.  We can have a few threads (work-items) running in
lock-step anyway (because they're physically being run on wider SIMD
than we can access) communicate via shared (local) memory frequently,
during each block computation.  If we're confident enough of them
running in lock-step, we could even forgo explicit synchronization, even
though this is considered risky (not future-proof):

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#warp-synchronous
https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp-/

In OpenCL, I think we'd use barrier(CLK_LOCAL_MEM_FENCE).  We already
use it in some kernels in JtR.

If we do use explicit synchronization, then I wonder if it'd kill the
performance (since we'd do it very frequently) or if it possibly would
be just fine (since there's not much to synchronize anyway).

Explicit local memory barrier might be needed (even when the underlying
hardware is simply a wider SIMD) because the compiler might otherwise
reorder some store instructions to be after the loads that we intend to
load data already modified by those stores.  In other words, even if a
memory barrier isn't needed at hardware level, it might still be needed
at compiler level.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.