Date: Sun, 23 Aug 2015 08:37:22 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU On Fri, Aug 21, 2015 at 05:40:42PM +0200, Agnieszka Bielec wrote: > 2015-08-20 22:34 GMT+02:00 Solar Designer <solar@...nwall.com>: > > You could start by experimenting with a much simpler than Argon2 yet in > > some ways similar kernel: implement some trivial operation like XOR on > > different vector widths and see whether/how this changes the assembly. > > Then make it slightly less trivial (just enough to prevent the compiler > > from optimizing things out) and add uses of private or local memory, > > and see if you can make it run faster by using wider vectors per the > > same private or local memory usage. > > I tested (only 960m) > -copying memory from __private to __private > - from __global to __private > -xoring private tables with __prrivate tables > > using ulong, ulong2, ulong4, ulong8 (I was getting empty kernel using ulong16) > > in generated PTX code ulong4 and ulong8 were changed to ulong2 I've just read up on this. It turns out that vectors wider than 128-bit are not supported in PTX: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#vectors "Vectors cannot exceed 128-bits in length" So it's no surprise that ulong4 and ulong8 resulted in PTX code that used ulong2 alike instructions. Maybe we could bypass this limitation by programming at GPU ISA level, such as with MaxAs, or maybe not. We'd need to become familiar with a given GPU ISA (and maybe more) to answer this. When we're programming in OpenCL (or even in PTX or IL), we appear to be stuck with the SIMT rather than SIMD model. We want to access wider SIMD from one instance of Argon2, but we're only given SIMT instead. What we can possibly do is accept this model, and work within it to achieve our goal. We can have a few threads (work-items) running in lock-step anyway (because they're physically being run on wider SIMD than we can access) communicate via shared (local) memory frequently, during each block computation. If we're confident enough of them running in lock-step, we could even forgo explicit synchronization, even though this is considered risky (not future-proof): http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#warp-synchronous https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp-/ In OpenCL, I think we'd use barrier(CLK_LOCAL_MEM_FENCE). We already use it in some kernels in JtR. If we do use explicit synchronization, then I wonder if it'd kill the performance (since we'd do it very frequently) or if it possibly would be just fine (since there's not much to synchronize anyway). Explicit local memory barrier might be needed (even when the underlying hardware is simply a wider SIMD) because the compiler might otherwise reorder some store instructions to be after the loads that we intend to load data already modified by those stores. In other words, even if a memory barrier isn't needed at hardware level, it might still be needed at compiler level. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.