john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150602110124.GA20487@openwall.com>
Date: Tue, 2 Jun 2015 14:01:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

magnum -

On Mon, Jun 01, 2015 at 01:37:14PM +0200, magnum wrote:
> Or perhaps as soon as we use interleaving, things like tmp[SIMD_PARA] 
> end up being stack arrays? That should hurt a lot.

This is quite possible.  In general, one of the things limiting the
interleaving factor is register pressure - and the compiler might in
fact do a worse job at register allocation when we use arrays.

> Actually, here's a bug we have: Using the wide loops as in SHA2, we 
> don't need to use "tmp[i]" at all - we do fine with just "tmp".

Huh?  Doesn't this defeat interleaving, replacing it with sequential
processing, because our source code sort of hints to the compiler to
reuse the same register across instances?  Or are we hoping that the
compiler or the CPU will recognize that we're reusing the variable, and
actually allocate a new register or a new rename register, respectively?
The compiler might and a CPU capable of register renaming at all
probably will, but didn't we intend to reduce rather then increase our
reliance on luck?

I just took a look at commit cde0fb470f35ef6dc5949d3b11137dd27ca2672b,
and it does look as problematic as I had thought from reading your
message. :-(

> I tried this but there was very little difference (but to the better).

This suggests one of two things, or maybe an in-between:

1. Interleaving in that code never really worked, so breaking it further
does not hurt further.

-OR-

2. The compiler and/or the CPU are so good that interleaving still works
even despite of us trying to kill it so hard.

We could also try making the temporary scope of those variables
explicit, by defining them inside of e.g. SHA512_PARA_DO(i) { ... }
block, etc.  Then the compiler, after having unrolled this loop, might
have a better idea that it can substitute different registers for the
different loop iterations.  Or it might not.

IIRC, I did try experimenting with the temporary scope approach when I
first introduced BF_X2, and it didn't work as well as keeping separate
variables at the outer scope.

> I tried changing MD4/5 and SHA1 to use fewer, wider loops similar to 
> SHA2 and consequently use single temps instead of arrays. There was 
> about 4% boost for MD4/MD5 but SHA1 got slightly worse. Why?

Luck.

Would it be reasonable for us to try my usual approach, with separate
variables at the outer scope (inside the hashing function, but not
inside the individual steps)?  And if those are in fact separate
variables rather than array elements, this implies manual or cpp level
loop unrolling.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.