john-dev - interleaving on GPUs (was: PHC: Parallel in OpenCL)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150823050806.GA14945@openwall.com>
Date: Sun, 23 Aug 2015 08:08:06 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: interleaving on GPUs (was: PHC: Parallel in OpenCL)

Back in May, I posted:

On Tue, May 26, 2015 at 12:30:18AM +0300, Solar Designer wrote:
> On Mon, May 25, 2015 at 11:08:51PM +0200, Agnieszka Bielec wrote:
> > Does interleaving make sense on gpu ?
> 
> Not at OpenCL source level.  The compiler and hardware take care of
> interleaving for you, as long as LWS and GWS are high enough.  So please
> don't bother doing it manually.
> 
> BTW, Intel's OpenCL when targeting CPU also manages to interleave things
> well, at least in simple cases.  In fact, for a PHP mt_rand() seed
> cracker competing with my php_mt_seed, it did that so well that until I
> increased the interleave factor from 4x to 8x in php_mt_seed (in C), the
> other cracker (in OpenCL) ran faster on a Core i5 CPU (lacking HT, so
> needing that higher than 4x interleave factor to hide the SIMD integer
> multiply latency).
> 
> So I currently see no need for interleaving at OpenCL source level.
> 
> Interleaving is something we use when writing in C (including with
> intrinsics) or in assembly.

I just read this about NVIDIA's Kepler (such as the old GTX TITAN that
we have in super):

http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#device-utilization-and-occupancy

"Also note that Kepler GPUs can utilize ILP in place of
thread/warp-level parallelism (TLP) more readily than Fermi GPUs can.
Furthermore, some degree of ILP in conjunction with TLP is required by
Kepler GPUs in order to approach peak single-precision performance,
since SMX's warp scheduler issues one or two independent instructions
from each of four warps per clock.  ILP can be increased by means of, for
example, processing several data items concurrently per thread or
unrolling loops in the device code, though note that either of these
approaches may also increase register pressure."

Note that they explicitly mention "processing several data items
concurrently per thread".  So it appears that when targeting Kepler, up
to 2x interleaving at OpenCL kernel source level could make sense.

(For the original Argon2, this also means that its abundant
instruction-level parallelism within one instance is actually helpful
for Kepler, with no need for us to do any interleaving of multiple
instances of Argon2.)

On Maxwell (such as the newer GTX Titan X), interleaving shouldn't be
needed anymore:

http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#smm-occupancy

"The power-of-two number of CUDA Cores per partition simplifies
scheduling, as each of SMM's warp schedulers issue to a dedicated set of
CUDA Cores equal to the warp width.  Each warp scheduler still has the
flexibility to dual-issue (such as issuing a math operation to a CUDA
Core in the same cycle as a memory operation to a load/store unit), but
single-issue is now sufficient to fully utilize all CUDA Cores."

I interpret the above as meaning that 2x interleaving can still benefit
Maxwell at low occupancy, but we can simply increase the occupancy and
achieve the same effect instead.

Summary: we could want to experiment with 2x interleaving in OpenCL
kernels when targeting Kepler, but not when that would also double the
kernel's private and local memory usage (so not for the PHC finalists,
but rather for memory-less things like MD5).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.