john-users - C compiler generated SSE2 code (was: clang benchmarks)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100517233248.GB9735@openwall.com>
Date: Tue, 18 May 2010 03:32:48 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: C compiler generated SSE2 code (was: clang benchmarks)

On Mon, May 17, 2010 at 11:55:18AM +0200, bartavelle@...quise.net wrote:
> Now with my MD5 implementation (uses SSE intrinsics):

Can you upload it to the wiki, please?

http://openwall.info/wiki/john/patches

> gcc 15696
> icc 32364
> clang 19644

It's a bit weird that gcc performs so poorly here.  I am getting
near-perfect SSE2 code for bitslice DES (an unreleased revision of the
source code) with gcc 4.5.0.  With properly tweaked compiler options
(primarily to control function inlining), it slightly outperforms the
hand-crafted SSE2 assembly code currently in JtR, in fact.

Also, I found gcc's "statement expressions" extension very handy for
mixing SSE2/MMX/native instructions (for virtual vector sizes of 192 and
256 bits) in expressions and function calls:

http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Statement-Exprs.html

Surprisingly, this extension is supported by Sun Studio as well (I did
not check other compilers).

It let me define things such as:

#define x(p, q) ({ vtype t; vxor(t, *(vtype *)&e[p] ed, *(vtype *)&k[q] kd); t; })
#define y(p, q) ({ vtype t; vxor(t, *(vtype *)&b[p] bd, *(vtype *)&k[q] kd); t; })
#define z(r) ((vtype *)&b[r] bd)

and call the DES S-box functions (usually to be inlined) as:

	s1(x(0, 0), x(1, 1), x(2, 2),
		x(3, 3), x(4, 4), x(5, 5),
		z(40), z(48), z(54), z(62));

vtype, vxor(), etc. could be defined as:

#elif DES_BS_VECTOR == 4
typedef struct {
	__m128i f;
	__m64 g;
	long h;
} vtype;

#define vxor(dst, a, b) \
	(dst).f = _mm_xor_si128((a).f, (b).f); \
	(dst).g = _mm_xor_si64((a).g, (b).g); \
	(dst).h = (a).h ^ (b).h;

That's for 256-bit vectors.  A minor difficulty with 192-bit vectors was
that they needed to be 256-bit aligned (for the SSE2 portion to be
128-bit aligned), which required changes to other source files - yet I
got around this difficulty and tried those out as well (both kinds).

Overall, this did not provide a speedup (on most CPUs the code became
slower per-bit, although this could be different on future CPUs), but I
was pleased with the low cost (my time) of this experiment.  The
assembly code generated by gcc looked reasonable (a nice mix of
instructions, no obviously unneeded moves).

This approach could be of more benefit for other hash types, where
there's insufficient parallelism otherwise (perhaps the DES S-boxes had
sufficient parallelism to almost fully exploit SSE2, which is why mixing
in 64-bit instructions would slow things down per-bit most of the time).

> gcc version 4.3.2 (Debian 4.3.2-1.1)

You could want to try 4.5.0 (build it from source).

http://openwall.info/wiki/internal/gcc-local-build

On the other hand, with properly tuned source code with SSE2 intrinsics,
even going from 4.5.0 to 3.4.5 (yes, this old!) resulted in only a 10%
slowdown at bitslice DES for me.  So perhaps there's something to tweak
in your source code to make it gcc-friendly.

One curious feature of gcc (not found in Sun Studio at least) is that it
is able to generate SSE2 instructions for 128-bit bitwise ops even if
the source code does not use intrinsics explicitly (that is, if it uses
the usual C operators for the bitwise ops, but the operands are of
128-bit vector types).  gcc 4.5.0 generates almost(?) the same code
(near-perfect) whether you use intrinsics or not.  On the other hand,
without explicit use of intrinsics, gcc 3.4.5 generates awful code.
Sun Studio refuses to compile such source code (but works fine when the
source code does use intrinsics).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.