Date: Tue, 13 Oct 2015 20:37:26 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: LOP3.LUT On 2015-10-13 10:43, magnum wrote: > Most formats now has LOP3.LUT alternatives and seem to work fine now. > Some don't get any boost (just meaning the toolchain did a good job > already) but I think md5crypt is the only one getting a definite > performance regression (and still has it disabled). We should get to the > bottom of that. BTW it would be very nice having CUDA 7.5 on super. Comparison of md5crypt kernel compiled with bitselect vs. with explicit LOP3.LUT for the function primitives: Bitselect: ptxas info : 0 bytes gmem, 54 bytes cmem ptxas info : Compiling entry function 'cryptmd5' for 'sm_52' ptxas info : Function properties for cryptmd5 ptxas . 592 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 38 registers, 344 bytes cmem, 268 bytes cmem Explicit LOP3.LUT: ptxas info : 0 bytes gmem, 54 bytes cmem ptxas info : Compiling entry function 'cryptmd5' for 'sm_52' ptxas info : Function properties for cryptmd5 ptxas . 592 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 37 registers, 344 bytes cmem, 260 bytes cmem explicit bitselect PTX #lines 4293 4375 ISA #lines 4214 4177 DEPBAR 56 62 LOP32I 31 33 LOP3 372 372 .reuse 235 349 LOP3 w/ .reuse 95 103 IADD32 420 400 IADD3 381 383 Less DEPBAR should be a good thing but I think the much lower ".reuse" number is not, and this may be the main problem. But we can't specify which registers to use! Perhaps the register scheduling when using inline PTX lop3 will improve over time. After reading some forum posts about register slots I actually tried using alternate lop3 immediates, shuffling x, y and z around. I could only conclude it *does* sometimes matter... but the chance of actually controlling the situation appears pretty slim to me. LOP3 immediates used: explicit: 0x39, 0x96, 0xca, 0xe4 (just the ones used in my functions). bitselect: 0x4b, 0x96, 0xac, 0xb8, 0xca. For reference, the natural truth table for just a bitselect is 0xd8 and alternatives when shuffling x, y and z around are 0xac, 0xb8, 0xca, 0xe2 and 0xe4. And 0x96 is (x ^ y ^ z) in any order. That leaves 0x4b to investigate. Doing so, I think I located that section in PTX vs. ISA but I don't get what is happening. And I gave up this at that point. On another note I find it strange that the difference in 2-op adds doesn't match the difference in 3-op adds at all. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.