Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [day] [month] [year] [list]
Date: Wed, 21 Mar 2012 09:13:54 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: MD5 optimizations

magnum, all -

Searching the web for info on whether folks are already using SSSE3
PSHUFB for S-boxes or not (it turns out that yes, at least for AES), I
found these forum threads:

http://hashcat.net/forum/thread-153.html
http://www.freerainbowtables.com/phpBB3/viewtopic.php?f=6&t=904&start=60#p15387
http://www.cryptohaze.com/forum/viewtopic.php?f=4&t=147#p865

These mention some MD5 optimizations that I previously did not consider.

There are some rotates by 16 in MD5.  The attached patch optimizes those
for SSE2 (two instructions) and SSSE3 (one instruction).  Either of
these gives a speedup of around 1% on the 2xE5420 system I use for
testing, with the SSSE3 version being slightly faster.  (There's little
point in testing this on my Bulldozer, because it is XOP-capable and is
likely faster running the XOP code instead.)

I only tested this with gcc so far, so the resulting code is slower than
the icc precompiled version in all of my tests.  I think the *.S files
need to be re-generated with icc (perhaps for SSE2 only for simplicity?)
after applying this patch.

Another possible optimization is a common subexpression elimination in
round 3:

http://hashcat.net/forum/thread-153-post-709.html#pid709

but it might not always be helpful ("The second optimization is not good
because it trades a PXOR rd,rs for a MOVDQA rd,rs. Now with AVX this
might be useful because it let's you do VPXOR rd,rs1,rs2 (rd = rs1^rs2).
Even with AVX you need more registers because you now need two temp
registers per interlace ..." from a comment by Sc00bz).

I haven't tried that one out yet.  Anyone?

Alexander

diff --git a/src/sse-intrinsics.c b/src/sse-intrinsics.c
index 9cb301a..3edfa05 100644
--- a/src/sse-intrinsics.c
+++ b/src/sse-intrinsics.c
@@ -17,8 +17,18 @@
 #include "MD5_std.h"
 
 #ifndef __XOP__
+#ifdef __SSSE3__
+#include <tmmintrin.h>
+#define rot16_mask _mm_set_epi64x(0x0d0c0f0e09080b0aL, 0x0504070601000302UL)
 #define _mm_roti_epi32(a, s) \
-	_mm_or_si128(_mm_slli_epi32((a), (s)), _mm_srli_epi32((a), 32-(s)))
+	((s) == 16 ? _mm_shuffle_epi8((a), rot16_mask) : \
+	_mm_or_si128(_mm_slli_epi32((a), (s)), _mm_srli_epi32((a), 32-(s))))
+#else
+#define _mm_roti_epi32(a, s) \
+	((s) == 16 ? \
+	_mm_shufflelo_epi16(_mm_shufflehi_epi16((a), 0xb1), 0xb1) : \
+	_mm_or_si128(_mm_slli_epi32((a), (s)), _mm_srli_epi32((a), 32-(s))))
+#endif
 #endif
 
 #ifndef MMX_COEF

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ