Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 19 May 2010 08:07:28 +0200
From: Simon Marechal <>
Subject: Re: C compiler generated SSE2 code

Le 19/05/2010 00:38, Solar Designer a écrit :
>> This does speak for itself :) The icc does disentangle the whole stuff, 
>> but is still faster with 3 loops (only 2 in the sample).
> I think you need to disentangle the source code rather than leave that
> for the compiler.  Specifically, I'd remove the "unneeded" MD5_PARA_DO
> loops.  Instead, I'd define macros around primitives such as xor, which
> would perform the required number of instances of the operation.  They
> would use constants for the array indices - or, if that does not work
> well enough, even use individual local variables instead of array
> elements.  This is more similar to what I have in MD5_std.c, where I use
> separate local variables for the two instances of MD5:
> 	MD5_word a0, b0 = Cb, c0 = Cc, d0;
> 	MD5_word a1, b1, c1, d1;
> 	MD5_word u, v;
> I understand that you like to be able to easily adjust the number of
> instances that you mix, but you'll have to achieve that by defining your
> xor, etc. macros differently for common instance counts (say, 2 vs. 3).

When it gets to 3, IIRC, icc doesn't disentangle the code and builds
something far more effective than with 2. I noticed that when looking at
the compiled code of BarsWF and wondered how the author got such a good
register scheduling.

Without "PARA_DO" stuff, it might get more friendly to gcc. I still
believe shipping an effective .S file generated for example with icc
would be better. But I'm not sure about licensing issue with the free
icc version ...

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.