john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150823061511.GA15246@openwall.com>
Date: Sun, 23 Aug 2015 09:15:11 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

Agnieszka,

Your current Argon2 kernels use global and private memory only.  They
don't use local memory.

While private memory might be larger and faster on specific devices, I
think that not making any use of local memory is wasteful.  By using
both private and local memory at once, we should be able to optimally
pack more concurrent Argon2 instances per GPU and thereby hide more of
the various latencies.

You should try moving some of the arrays from private to local memory.

Here's a related finding:

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#shared-memory-bandwidth

"[...] shared memory bandwidth in SMX is twice that of Fermi's SM.  This
bandwidth increase is exposed to the application through a configurable
new 8-byte shared memory bank mode.  When this mode is enabled, 64-bit
(8-byte) shared memory accesses (such as loading a double-precision
floating point number from shared memory) achieve twice the effective
bandwidth of 32-bit (4-byte) accesses.  Applications that are sensitive
to shared memory bandwidth can benefit from enabling this mode as long
as their kernels' accesses to shared memory are for 8-byte entities
wherever possible."

Argon2's accesses are wider than and are a multiple of 8 bytes, so I
think we need to enable this mode.  Please try to find out how to enable
it, and whether it possibly gets enabled automatically e.g. when the
kernel uses specific data types.

I think it could benefit many more of our kernels.  So this is important
to figure out and learn to use regardless of Argon2.

Yet another relevant finding is that, per the tuning guides, Kepler and
Maxwell do not use L1 caches for global memory (they only use L2), but
there's a compiler option to change this behavior (enable use of both L1
and L2 caches for global memory).  We could give this a try (if we find
how to do this for OpenCL) and see if it improves or hurts performance,
especially if we end up not using local memory anyway (for whatever
reason) and have no or few register spills (where L1 cache for local
memory could have been helpful).  I don't expect this to be of much
help, though - most likely the default is actually optimal for us,
unless we don't use local memory at all (not even implicitly via spills).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.