Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 8 Sep 2015 13:17:14 +0300
From: Solar Designer <>
Subject: Re: md5crypt mmxput*()

On Sat, Sep 05, 2015 at 05:51:51AM +0300, Solar Designer wrote:
> magnum, Jim, Simon, Lei -
> Some speedup for md5crypt on CPU might be possible through vectorizing
> the mmxput*() functions, or through use of SHLD/SHRD instructions
> (available since 386) or other archs' equivalents (I think ARM has this
> too) in mmxput3() when not vectorized (somehow gcc does not do it for
> us).  These functions are similar to buf_update() in,
> where I've added uses of amd_bitalign() and NVIDIA's funnel shifter
> recently (analogous to SHLD/SHRD), and which obviously is processed on
> the SIMD units on GPUs (can do it on CPUs as well, although no SHLD/SHRD
> then, unless a given CPU architecture has them in SIMD form as well -
> need to look into that).

I looked into making mmxput3() use SHLD/SHRD, and found this comment:

"We can convince the current compiler to generate shrd by constructing
((((unsigned long long)a)<<32) | b) >> n"

I tried doing this, but since I'm on x86_64 I actually got a 64-bit
shift instead.  Good news: it's also faster.  Before:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw:    228864 c/s real, 28608 c/s virtual


Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw:    231424 c/s real, 28928 c/s virtual

Patched attached.  I only tested this on bull so far.  I put an #if in
there enabling the 64-bit approach on any 64-bit arch as well as on
32-bit x86 (expecting that SHRD would be generated, as per the comment
above) - but this needs to be tested on different machines and with
different versions of gcc (I wouldn't be surprised if there's a
regression with some old version).

I think further speedup is possible by using a switch statement to make
the shift counts into constants (we have an if anyway, we'll just
replace it with a switch) like has.  And indeed by
vectorizing this.  But for a trivial patch, the above speedup isn't bad.

In fact, with a switch there might not be a speedup from the 64-bit
shift anymore (except on 32-bit x86, where it should enable SHRD).
I think the speedup comes from one of the shift counts becoming a
constant now (the constant 32), but with a switch all of them would be
constants anyway.  So maybe that #if would need to be revised then.

If we're not going to vectorize this soon, then maybe the next steps are
to try switch and then to tune the #if based on testing on different
CPUs and compiler versions.

... or just test the attached patch a bit more and commit it.

Maybe we should enable the 64-bit approach for ARM as well (IIRC, it has
a similar instruction to 386's SHRD).


diff --git a/src/simd-intrinsics.c b/src/simd-intrinsics.c
index 1f863d5..7307bb8 100644
--- a/src/simd-intrinsics.c
+++ b/src/simd-intrinsics.c
@@ -456,19 +456,24 @@ static MAYBE_INLINE void mmxput3(void *buf, unsigned int bid,
 				noffd = noff & (~3);
+#if (ARCH_SIZE >= 8) || defined(__i386__)
+#define BITALIGN(hi, lo, s) ((((uint64_t)(hi) << 32) | (lo)) >> (s))
+#define BITALIGN(hi, lo, s) (((hi) << (32 - (s))) | ((lo) >> (s)))
 				((unsigned int*)(nbuf+noffd*VS32))[i+0*VS32] &=
 				((unsigned int*)(nbuf+noffd*VS32))[i+0*VS32] |=
 					(((unsigned int*)src)[i+j*4*VS32+0*VS32] << dec);
-				((unsigned int*)(nbuf+noffd*VS32))[i+1*VS32] =
-					(((unsigned int*)src)[i+j*4*VS32+1*VS32] << dec) |
-					(((unsigned int*)src)[i+j*4*VS32+0*VS32] >> (32-dec));
-				((unsigned int*)(nbuf+noffd*VS32))[i+2*VS32] =
-					(((unsigned int*)src)[i+j*4*VS32+2*VS32] << dec) |
-					(((unsigned int*)src)[i+j*4*VS32+1*VS32] >> (32-dec));
-				((unsigned int*)(nbuf+noffd*VS32))[i+3*VS32] =
-					(((unsigned int*)src)[i+j*4*VS32+3*VS32] << dec) |
-					(((unsigned int*)src)[i+j*4*VS32+2*VS32] >> (32-dec));
+				((unsigned int*)(nbuf+noffd*VS32))[i+1*VS32] = BITALIGN(
+					((unsigned int*)src)[i+j*4*VS32+1*VS32],
+					((unsigned int*)src)[i+j*4*VS32+0*VS32], 32 - dec);
+				((unsigned int*)(nbuf+noffd*VS32))[i+2*VS32] = BITALIGN(
+					((unsigned int*)src)[i+j*4*VS32+2*VS32],
+					((unsigned int*)src)[i+j*4*VS32+1*VS32], 32 - dec);
+				((unsigned int*)(nbuf+noffd*VS32))[i+3*VS32] = BITALIGN(
+					((unsigned int*)src)[i+j*4*VS32+3*VS32],
+					((unsigned int*)src)[i+j*4*VS32+2*VS32], 32 - dec);
 				((unsigned int*)(nbuf+noffd*VS32))[i+4*VS32] &=
 				((unsigned int*)(nbuf+noffd*VS32))[i+4*VS32] |=

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ