Date: Wed, 2 Sep 2015 20:04:47 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: md5crypt-opencl magnum, all - On Mon, Aug 24, 2015 at 12:38:03PM +0300, Solar Designer wrote: > I've experimented with optimizing our md5crypt-opencl in various ways - > patch attached. I looked into this some further. As I suspected from benchmarks on GCN at different core clock rates (which had surprisingly little effect on performance), the kernel ends up using global memory in the inner loop. I've just confirmed this by reviewing the GCN ISA code. Specifically, the entire ctx array is in global memory. That's 512 bytes, so I think it could as well fit in VGPRs, but somehow the compiler is unwilling to do that. Reducing this array to 448 bytes (14 words per buffer, which can be done since the last two words are the length and zero) doesn't help (it actually slows things down a little bit). Do you have an idea how to force use of private memory aka VGPRs for arrays like this? I think the limit is 256 VGPRs per work-item, and we're still sufficiently(?) below that (it'd be 1024 bytes, although this obviously includes other variables and scratch registers). Specifying __private explicitly obviously makes no difference. As an experiment, I tried moving this to __local without proper allocation for the entire workgroup and without proper address calculation (so all work-items were sharing the same ctx, obviously resulting in incorrect computation). This resulted in a 1.5x speedup (so ~4.5M c/s on Tahiti), but obviously failing self-test. So that's the speed potential. If we actually allocate enough local memory, things will be slower because this will limit concurrency (we actually have less local memory than private memory), yet it might be faster than what we have now. Ideally, though, we'd just force the compiler to use VGPRs here... or is there some fundamental reason why this won't work? Such as lacking addressing modes, but per my earlier reading of GCN arch manual, indexed access to VGPRs is supported. Oh, is it possibly because the array is two-dimensional? Like some heuristic: "put all arrays with greater than one dimension in global memory". It is probably worth trying to turn the array into single-dimensional and see. I think we have good chances to beat oclHashcat's speed if we solve this. I think oclHashcat doesn't use multiple buffers because the private or local memory use by 21 or 42 buffers would be too large, but in our kernel Lukas came up with a more compact representation needing only 8 buffers. We just need to get them out of global memory... And then we could apply the same to SHA-crypt. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.