Date: Wed, 7 Oct 2015 01:47:06 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: LOP3.LUT On 2015-10-06 02:32, magnum wrote: > I implemented a shared lop3_lut(a, b, c, imm) function in de6c7c6 but > it's not enabled anywhere yet: I only tested md5crypt so far and it got > about 5% performance loss. I also tried only using it for one function > at a time but any of them results in performance loss - even F and G > which are both pure bitselects otherwise. I was expecting no difference > at all, at worst. Here's a PTX diff with *only* F changed from bitselect() to inline asm (I replaced all register numbers to <num> for simpler diff): @@ -190,142 +190,130 @@ add.s32 %r<num>, %r<num>, -117830708; shf.l.wrap.b32 %r<num>, %r<num>, %r<num>, 12; add.s32 %r<num>, %r<num>, %r<num>; - and.b32 %r<num>, %r<num>, %r<num>; - not.b32 %r<num>, %r<num>; - and.b32 %r<num>, %r<num>, -271733879; - or.b32 %r<num>, %r<num>, %r<num>; + mov.u32 %r<num>, -271733879; + // inline asm + lop3.b32 %r<num>, %r<num>, %r<num>, %r<num>, 228; + // inline asm ld.local.u32 %r<num>, [%rd4+72]; add.s32 %r<num>, %r<num>, %r<num>; So if I read it right we replace "and, not, and immediate, or" with "mov immediate, lop3". I can't see why that would decrease speed with 1%? Even if the version with no inline PTX does end up as LOP3 (it should) - why does the explicit version get slower? Since we don't have CUDA 7.5 installed on super I can't look at the resulting ISA - ptxas won't assemble this one, for some reason not even the version without inline lop3.lut. It does assemble some other kernels, and I have seen separate logic instructions in PTX end up as LOP3 in the ISA. But for this comparison I'll need to continue my digging somewhere else, later. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.