Date: Sun, 23 Aug 2015 08:08:06 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: interleaving on GPUs (was: PHC: Parallel in OpenCL) Back in May, I posted: On Tue, May 26, 2015 at 12:30:18AM +0300, Solar Designer wrote: > On Mon, May 25, 2015 at 11:08:51PM +0200, Agnieszka Bielec wrote: > > Does interleaving make sense on gpu ? > > Not at OpenCL source level. The compiler and hardware take care of > interleaving for you, as long as LWS and GWS are high enough. So please > don't bother doing it manually. > > BTW, Intel's OpenCL when targeting CPU also manages to interleave things > well, at least in simple cases. In fact, for a PHP mt_rand() seed > cracker competing with my php_mt_seed, it did that so well that until I > increased the interleave factor from 4x to 8x in php_mt_seed (in C), the > other cracker (in OpenCL) ran faster on a Core i5 CPU (lacking HT, so > needing that higher than 4x interleave factor to hide the SIMD integer > multiply latency). > > So I currently see no need for interleaving at OpenCL source level. > > Interleaving is something we use when writing in C (including with > intrinsics) or in assembly. I just read this about NVIDIA's Kepler (such as the old GTX TITAN that we have in super): http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#device-utilization-and-occupancy "Also note that Kepler GPUs can utilize ILP in place of thread/warp-level parallelism (TLP) more readily than Fermi GPUs can. Furthermore, some degree of ILP in conjunction with TLP is required by Kepler GPUs in order to approach peak single-precision performance, since SMX's warp scheduler issues one or two independent instructions from each of four warps per clock. ILP can be increased by means of, for example, processing several data items concurrently per thread or unrolling loops in the device code, though note that either of these approaches may also increase register pressure." Note that they explicitly mention "processing several data items concurrently per thread". So it appears that when targeting Kepler, up to 2x interleaving at OpenCL kernel source level could make sense. (For the original Argon2, this also means that its abundant instruction-level parallelism within one instance is actually helpful for Kepler, with no need for us to do any interleaving of multiple instances of Argon2.) On Maxwell (such as the newer GTX Titan X), interleaving shouldn't be needed anymore: http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#smm-occupancy "The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores." I interpret the above as meaning that 2x interleaving can still benefit Maxwell at low occupancy, but we can simply increase the occupancy and achieve the same effect instead. Summary: we could want to experiment with 2x interleaving in OpenCL kernels when targeting Kepler, but not when that would also double the kernel's private and local memory usage (so not for the PHC finalists, but rather for memory-less things like MD5). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.