Date: Thu, 20 Aug 2015 23:34:03 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU On Thu, Aug 20, 2015 at 08:04:20PM +0200, Agnieszka Bielec wrote: > 2015-08-19 18:39 GMT+02:00 Solar Designer <solar@...nwall.com>: > > I think you may try working on ulong16 or ulong8 instead. I expect > > ulong8 to match the current GPU hardware best, but OTOH ulong16 makes > > more parallelism apparent to the OpenCL compiler and allocates it to one > > work-item. So please try both and see which works best. > > I created something using ulong8, it's almost not noticeable better > speed in my laptop but worse on super both cards, no idea if this is > what you wanted ( I think that not ), you can take a look on branch > vector8 This is a step towards what I meant, but you're not quite there yet. You need to convert more of the processing to ulong8 (or ulong16). For example, you still have "ulong2 ref_block;" in ComputeBlock_pgg(), but it should become an array of ulong 8 too. And so on. Yes, this means that you either have to convert the callers to using this wider vector type as well, or you have to convert between the vectors somewhere (which likely results in performance loss). You should also use the wider vector type for the global memory references and in the kernel parameter list. The only shuffling of ulong2's inside/between ulong8's should be between the two groups of 8 BLAKE2 rounds. Right now, you also have conversion from ulong2 to ulong8 before the first group of 8 BLAKE2 rounds - it should go away when you optimize this code further as I suggested above. Also, the shuffling can probably be optimized. Right now, you keep the full block in state and you also have 8 ulong8's storing half a block at a time. You may instead have 16 ulong8's storing the entire block. Yes, the shuffling might require some temporary storage, but you don't necessarily have to write the entire block to a temporary array of ulong2's - perhaps there's a more efficient way for the specific kind of shuffling that is being done. Also, we're optimizing this blindfolded, and that's wrong. We should be reviewing the generated code. You may patch common-opencl.c: opencl_build_kernel_opt() to invoke opencl_build() like this: opencl_build(sequential_id, opts, 1, "kernel.out"); instead of the current: opencl_build(sequential_id, opts, 0, NULL); Then when targeting NVIDIA cards it dumps PTX assembly to the filename specified there. It looks something like this, just much larger: http://arrayfire.com/demystifying-ptx-code/ You could start by experimenting with a much simpler than Argon2 yet in some ways similar kernel: implement some trivial operation like XOR on different vector widths and see whether/how this changes the assembly. Then make it slightly less trivial (just enough to prevent the compiler from optimizing things out) and add uses of private or local memory, and see if you can make it run faster by using wider vectors per the same private or local memory usage. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.