Date: Tue, 6 Oct 2015 02:32:17 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: LOP3.LUT (was: Re: SHA-1 H()) On 2015-09-14 23:06, Solar Designer wrote: > On Mon, Sep 14, 2015 at 10:39:40PM +0200, magnum wrote: >> BTW do you think we could use inline PTX to define a LOP3.LUT >> instruction on nvidia, like you did with the funnel shifts? > > Yes, I thought of this too. We could want to check the generated code > first (it might already be using LOP3.LUT everywhere it should), or we > could just do the inline asm right away to ensure we'll always have > LOP3.LUT there no matter how the compiler might be changed. I implemented a shared lop3_lut(a, b, c, imm) function in de6c7c6 but it's not enabled anywhere yet: I only tested md5crypt so far and it got about 5% performance loss. I also tried only using it for one function at a time but any of them results in performance loss - even F and G which are both pure bitselects otherwise. I was expecting no difference at all, at worst. >> Or would it >> possibly be worse than having the optimizer miss one or two, due to the >> caveats of inline asm? > > I saw no drawbacks from using inline PTX asm, since instruction > scheduling is performed in the PTX to ISA translation anyway. > > This is very different from inline asm in C code compiled for a CPU, > where using inline asm for tiny pieces of code (such as for individual > instructions) breaks the C compiler's instruction scheduling. Something did not end up well. I'll compare resulting PTX and ISA and try to figure out what happens. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.