john-dev - Re: 5x intrinsics?

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130521213922.GA2588@openwall.com>
Date: Wed, 22 May 2013 01:39:22 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: 5x intrinsics?

On Tue, May 21, 2013 at 11:17:02PM +0200, magnum wrote:
> I see Alain's NT format is "5x" for 32-bit SSE2 builds, ie. it does 4x in SSE2 plus 1x in non-SSE. I presume these are interleaved for hiding latency so doing that extra 1x more or less for free. Would this be theoretically and practically worthwhile for the intrinsics? Maybe it'd just get very messy. I can't remember any discussion on this matter.

Apparently, it did provide some advantage in Alain's testing, but when I
tried 5x and other modes for bitslice DES, they were actually slower on
the build targets and CPUs I tested on (mostly 64-bit, though).  You may
easily test this again by editing the #if's in x86-sse.h and x86-64.h as
appropriate.  The corresponding code in DES_bs_b.c is all in place.
Even weird combinations like this are supported:

#elif 0
/* 384-bit as 256+64+64 */
#define DES_BS_NO_AVX128
#define DES_BS_VECTOR_SIZE              8
#define DES_BS_VECTOR                   6
#define DES_BS_ALGORITHM_NAME           "DES 256/256 AVX-16 + 64/64 MMX + 64/64"

Indeed, even if some combination is slightly slower for DES, it doesn't
mean it won't be slightly faster for e.g. MD4/MD5/SHA-1 as implemented
in sse-intrinsics.c - so we may add support for such things in there and
give it a try.

> Perhaps the 64-bit CPU's SSE2 registers are not actually separate from the GP ones?

They are separate.  Also, starting with Pentium 4, SSE* registers are
separate from MMX, so we may inter-mix SSE2+ and MMX instructions.
(IIRC, on Pentium 3's implementation of SSE, the registers overlapped
with MMX's.)  The 384-bit weirdness above actually works correctly, it
just was not any faster on Core i7-2600K that I tested it on.

> I'm not good at these things but I guess that could be the reason this is only done in 32-bit code.

Indeed: in 32-bit mode, we have only 8 registers of each type, so hiding
the latencies in this way becomes more important (than it is with 16
registers, where we can avoid the stalls by better instruction scheduling
within instances using registers of the same type).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.