Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 9 Jun 2015 20:46:32 +0800
From: Lei Zhang <>
Subject: Re: Interleaving of intrinsics

> On Jun 4, 2015, at 10:22 PM, magnum <> wrote:
> On 2015-06-02 13:01, Solar Designer wrote:
>> Would it be reasonable for us to try my usual approach, with separate
>> variables at the outer scope (inside the hashing function, but not
>> inside the individual steps)?  And if those are in fact separate
>> variables rather than array elements, this implies manual or cpp level
>> loop unrolling.
> I tried this out with MD5 and SHA256 in a topic branch. It doesn't seem to make any difference compared to loops and arrays.
> Also, other tests (before that) indicate per-line loops vs. block loops for interleaving does not make any difference either, at least not for gcc. Perhaps it does for icc (as tested on super), but all results are so fluctuating and inconclusive I just get more confused the more I test. Perhaps turbo boost and stuff are playing up.
> Perhaps Lei can make some conclusions from generated asm code. I think that's the only way of telling what actually happens.
> Maybe we under-estimate the compilers. I'm starting to think MD4 and MD5 interleaves fine poorly coded or not, while SHA1/SHA2 formats simply does not interleave well regardless of coding. If that's the case it would be a relief in a way: We could just keep the readable and straight-forward code...
> magnum

I tried to see the 'size' of sse-intrinsics.o under different interleaving factors and compiled by clang and icc respectively.

lei-mac:src lei$ size clang/*
__TEXT	__DATA	__OBJC	others	dec	hex
122863	0	0	26572	149435	247bb	clang/x1.o
127951	0	0	28699	156650	263ea	clang/x2.o
128479	0	0	28614	157093	265a5	clang/x3.o
127679	0	0	28527	156206	2622e	clang/x4.o

lei-mac:src lei$ size icc/*
__TEXT	__DATA	__OBJC	others	dec	hex
102084	7545	0	50442	160071	27147	icc/x1.o
113012	9799	0	49375	172186	2a09a	icc/x2.o
113348	9799	0	51275	174422	2a956	icc/x3.o
114740	9799	0	53235	177774	2b66e	icc/x4.o

It seems clang refuses to unroll some loops when interleaving factor is increased to 4. But icc unrolls just fine. 

icc has a relatively unique feature, that is giving out optimization report while compiling. I further investigated the reports given by icc under different interleaving factors and counted the number of loops fully unrolled.

interleaving	loops unrolled
x1			215
x2			225
x3			225
x4			225

We can see the number of loops unrolled doesn't change under interleaving factors 2-4. It's a bit less under x1, which I guess is because icc thinks some loops that iterate only once are not worth unrolling.

I haven't experimented with gcc, but I think it's quite possible that the *_PARA_DO() approach doesn't eventually lead to fully unrolled code. Explicit unrolling may be needed for interleaving. As for the precise implication on performance, I'm not very clear yet.


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ