john-dev - Re: LOP3.LUT

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <698e0e81b96d435566dd109f4cef695f@smtp.hushmail.com>
Date: Wed, 7 Oct 2015 01:47:06 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: LOP3.LUT

On 2015-10-06 02:32, magnum wrote:
> I implemented a shared lop3_lut(a, b, c, imm) function in de6c7c6 but
> it's not enabled anywhere yet: I only tested md5crypt so far and it got
> about 5% performance loss. I also tried only using it for one function
> at a time but any of them results in performance loss - even F and G
> which are both pure bitselects otherwise. I was expecting no difference
> at all, at worst.

Here's a PTX diff with *only* F changed from bitselect() to inline asm 
(I replaced all register numbers to <num> for simpler diff):

@@ -190,142 +190,130 @@
         add.s32         %r<num>, %r<num>, -117830708;
         shf.l.wrap.b32  %r<num>, %r<num>, %r<num>, 12;
         add.s32         %r<num>, %r<num>, %r<num>;
-       and.b32         %r<num>, %r<num>, %r<num>;
-       not.b32         %r<num>, %r<num>;
-       and.b32         %r<num>, %r<num>, -271733879;
-       or.b32          %r<num>, %r<num>, %r<num>;
+       mov.u32         %r<num>, -271733879;
+       // inline asm
+       lop3.b32 %r<num>, %r<num>, %r<num>, %r<num>, 228;
+       // inline asm
         ld.local.u32    %r<num>, [%rd4+72];
         add.s32         %r<num>, %r<num>, %r<num>;

So if I read it right we replace "and, not, and immediate, or" with "mov 
immediate, lop3". I can't see why that would decrease speed with 1%? 
Even if the version with no inline PTX does end up as LOP3 (it should) - 
why does the explicit version get slower?

Since we don't have CUDA 7.5 installed on super I can't look at the 
resulting ISA - ptxas won't assemble this one, for some reason not even 
the version without inline lop3.lut. It does assemble some other 
kernels, and I have seen separate logic instructions in PTX end up as 
LOP3 in the ISA. But for this comparison I'll need to continue my 
digging somewhere else, later.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.