Date: Thu, 20 Aug 2015 06:30:10 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU On Thu, Aug 20, 2015 at 04:53:55AM +0300, Solar Designer wrote: > On Wed, Aug 19, 2015 at 07:41:02PM +0200, Agnieszka Bielec wrote: > > ptxas info : Function properties for FillSegment > > ptxas . 0 bytes stack frame, 17400 bytes spill stores, > > 19352 bytes spill loads > > ptxas info : Function properties for GenerateAddresses > > ptxas . 0 bytes stack frame, 7780 bytes spill stores, > > 11648 bytes spill loads > > The spills in FillSegment and GenerateAddresses are pretty bad. Where > do they come from, and why so much? In FillSegment you use 1 KB per > work-item for addresses, in GenerateAddresses you use 2 KB for two > blocks. GenerateAddresses is called from FillSegment, so adds its > private memory needs on top of FillSegment's. There's also 1 KB ref_block in ComputeBlock and in ComputeBlock_pgg. On super's -dev=5, I was getting: ptxas info : Function properties for FillSegment ptxas . 8216 bytes stack frame, 9708 bytes spill stores, 7776 bytes spill loads ptxas info : Function properties for GenerateAddresses ptxas . 6104 bytes stack frame, 4056 bytes spill stores, 4124 bytes spill loads I've optimized this to: ptxas info : Function properties for FillSegment ptxas . 4408 bytes stack frame, 5984 bytes spill stores, 4020 bytes spill loads ptxas info : Function properties for GenerateAddresses ptxas . 1304 bytes stack frame, 388 bytes spill stores, 400 bytes spill loads with the attached patch. As it is, it provides no speedup for me (in fact, there's very slight slowdown), but it should illustrate to you what to optimize. I expect that once you convert those uint operations to work on ulong2 all the time, you'll see slight speedup. (The changes in performance seen from these code changes are relatively minor because GenerateAddresses corresponds to a relatively small part of the total running time. There is a significant reduction in global memory usage, though, as seen via nvidia-smi.) In fact, those typecasts between ulong2 and uint pointers are probably disallowed, as they violate strict aliasing rules. Also, your code heavily depends on the architecture being little-endian (just like Argon2's original code did, which is a known bug). You should try to avoid that as you proceed to optimize your OpenCL kernels. You'll find that avoiding endianness dependencies goes along with avoiding strict aliasing violations and achieving better speed as well (since the kernel would use its full allocated SIMD width all the time, rather than only part of the time). BTW, out_tmp in Initialize() appears to be twice larger than it needs to be: ulong2 out_tmp[BLOCK_SIZE/8]; ulong2 is 16 bytes, but you divide by 8. Or is this on purpose? Why? Alexander View attachment "john-argon2i-opencl-opt1.diff" of type "text/plain" (3036 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.