Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 6 Oct 2015 02:32:17 +0200
From: magnum <>
Subject: LOP3.LUT (was: Re: SHA-1 H())

On 2015-09-14 23:06, Solar Designer wrote:
> On Mon, Sep 14, 2015 at 10:39:40PM +0200, magnum wrote:
>> BTW do you think we could use inline PTX to define a LOP3.LUT
>> instruction on nvidia, like you did with the funnel shifts?
> Yes, I thought of this too.  We could want to check the generated code
> first (it might already be using LOP3.LUT everywhere it should), or we
> could just do the inline asm right away to ensure we'll always have
> LOP3.LUT there no matter how the compiler might be changed.

I implemented a shared lop3_lut(a, b, c, imm) function in de6c7c6 but 
it's not enabled anywhere yet: I only tested md5crypt so far and it got 
about 5% performance loss. I also tried only using it for one function 
at a time but any of them results in performance loss - even F and G 
which are both pure bitselects otherwise. I was expecting no difference 
at all, at worst.

>> Or would it
>> possibly be worse than having the optimizer miss one or two, due to the
>> caveats of inline asm?
> I saw no drawbacks from using inline PTX asm, since instruction
> scheduling is performed in the PTX to ISA translation anyway.
> This is very different from inline asm in C code compiled for a CPU,
> where using inline asm for tiny pieces of code (such as for individual
> instructions) breaks the C compiler's instruction scheduling.

Something did not end up well. I'll compare resulting PTX and ISA and 
try to figure out what happens.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.