Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Tue, 21 May 2013 18:32:48 -0400
From: Alain Espinosa <>
Subject: Re: 5x intrinsics?

On 5/21/13, magnum <> wrote:
> I see Alain's NT format is "5x" for 32-bit SSE2 builds, ie. it does 4x in
> SSE2 plus 1x in non-SSE. I presume these are interleaved for hiding latency
> so doing that extra 1x more or less for free. Would this be theoretically
> and practically worthwhile for the intrinsics? Maybe it'd just get very
> messy. I can't remember any discussion on this matter...

In my testing with a Pentium 4 this have a very small speedup. With
faster SSE engines (beginning with Core 2 Duo) the 32 bits
implementation 'probably' will be slower than a SSE2 only
implementation. In 64 bits we interleave 2 SSE2 (2*4x) that will
result in a good speed-up. I try a 3*4x SSE2 implementation there
wasn't any performance gain (i try this with Core 2 Duos). Again, with
more vector ports in recent CPUs we may test this again. An improve
over the 64 bits SSE2 implementation is the use of non-destructive
source with AVX. Also to consider with upcoming Intel CPUs is an AVX2
implementation with 4*8x (using non-destructive source and some
temporal memory use for rotating). Probably will provide a speedup
given that the CPUs have more ports and better memory engine.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.