Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 11 Jun 2015 19:52:31 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

On 2015-06-11 15:58, Lei Zhang wrote:
>
>> On Jun 11, 2015, at 3:30 PM, magnum <john.magnum@...hmail.com> wrote:
>>
>> Now we're getting somewhere. What if you build the "unrolled" topic branch instead, using para 2 (I think I didn't add code for higher para yet). This will be manually unrolled. How many vmovdqu can you see in that? Do you see other differences compared to the bleeding code (at same para)?
>
> The manually unrolled version generates significantly longer asm code, with ~8000 instructions in SSESHA256body. This number is ~5000 in the auto-unrolled version.
>
> The number of vmovdqu is also a lot bigger, that is 1170. It's only 260 in the auto-unrolled version. The register pressure seems to be much heavier when loops are fully unrolled.
>
> Then performance (pbkdf2-hmac-sha256, x2):
>
> [auto-unrolled]
> Raw:	235 c/s real, 235 c/s virtual
> [fully-unrolled]
> Raw:	133 c/s real, 133 c/s virtual
>
> Specs: laptop, icc, OpenMP disabled, turboboost disabled
>
> We can see the fully unrolled one is much slower. I think register pressure is playing a big role here.

That version is manually interleaved x2, with no loop constructs and 
with non-array temp variables. We'll see what Solar says, but I presume 
this means we can just forget about interleaving SHA-2 on x86. And from 
what I've gathered I believe this also goes for SHA-1.

magnum


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ