|
|
Message-ID: <b1b3b182303b7930a45c176c49eb5611@smtp.hushmail.com>
Date: Tue, 13 Oct 2015 20:37:26 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: LOP3.LUT
On 2015-10-13 10:43, magnum wrote:
> Most formats now has LOP3.LUT alternatives and seem to work fine now.
> Some don't get any boost (just meaning the toolchain did a good job
> already) but I think md5crypt is the only one getting a definite
> performance regression (and still has it disabled). We should get to the
> bottom of that. BTW it would be very nice having CUDA 7.5 on super.
Comparison of md5crypt kernel compiled with bitselect vs. with explicit
LOP3.LUT for the function primitives:
Bitselect:
ptxas info : 0 bytes gmem, 54 bytes cmem[3]
ptxas info : Compiling entry function 'cryptmd5' for 'sm_52'
ptxas info : Function properties for cryptmd5
ptxas . 592 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
ptxas info : Used 38 registers, 344 bytes cmem[0], 268 bytes cmem[2]
Explicit LOP3.LUT:
ptxas info : 0 bytes gmem, 54 bytes cmem[3]
ptxas info : Compiling entry function 'cryptmd5' for 'sm_52'
ptxas info : Function properties for cryptmd5
ptxas . 592 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
ptxas info : Used 37 registers, 344 bytes cmem[0], 260 bytes cmem[2]
explicit bitselect
PTX #lines 4293 4375
ISA #lines 4214 4177
DEPBAR 56 62
LOP32I 31 33
LOP3 372 372
.reuse 235 349
LOP3 w/ .reuse 95 103
IADD32 420 400
IADD3 381 383
Less DEPBAR should be a good thing but I think the much lower ".reuse"
number is not, and this may be the main problem. But we can't specify
which registers to use! Perhaps the register scheduling when using
inline PTX lop3 will improve over time. After reading some forum posts
about register slots I actually tried using alternate lop3 immediates,
shuffling x, y and z around. I could only conclude it *does* sometimes
matter... but the chance of actually controlling the situation appears
pretty slim to me.
LOP3 immediates used:
explicit: 0x39, 0x96, 0xca, 0xe4 (just the ones used in my functions).
bitselect: 0x4b, 0x96, 0xac, 0xb8, 0xca.
For reference, the natural truth table for just a bitselect is 0xd8 and
alternatives when shuffling x, y and z around are 0xac, 0xb8, 0xca, 0xe2
and 0xe4. And 0x96 is (x ^ y ^ z) in any order. That leaves 0x4b to
investigate. Doing so, I think I located that section in PTX vs. ISA but
I don't get what is happening. And I gave up this at that point.
On another note I find it strange that the difference in 2-op adds
doesn't match the difference in 3-op adds at all.
magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.