Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [day] [month] [year] [list]
Date: Sat, 5 Sep 2015 09:09:12 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: MD4 G()

magnum, Sayantan -

MD4 G() is the same as SHA-2 Maj(), yet we've been using unoptimized
expression for it so far.

The attached patch improves the speed for pbkdf2-hmac-md4-opencl on
Tahiti from:

Local worksize (LWS) 64, global worksize (GWS) 524288
DONE
Speed for cost 1 (iterations) of 1000
Raw:    3994K c/s real, 104857K c/s virtual

to:

Local worksize (LWS) 64, global worksize (GWS) 524288
DONE
Speed for cost 1 (iterations) of 1000
Raw:    4537K c/s real, 94371K c/s virtual

or if I let it auto-tune to higher GWS (which it previously would not):

Local worksize (LWS) 64, global worksize (GWS) 2097152
DONE
Speed for cost 1 (iterations) of 1000
Raw:    4592K c/s real, 125829K c/s virtual

On one core in FX-8120, I got improvement (with the previously posted
patch) from:

Benchmarking: Raw-MD4 [MD4 128/128 XOP 4x2]... DONE
Raw:    36863K c/s real, 36863K c/s virtual

to:

Benchmarking: Raw-MD4 [MD4 128/128 XOP 4x2]... DONE
Raw:    39233K c/s real, 39233K c/s virtual

although some of the speedup, namely to:

Benchmarking: Raw-MD4 [MD4 128/128 XOP 4x2]... DONE
Raw:    37509K c/s real, 37509K c/s virtual

came from enabling use of H2, which was previously disabled for 2x
interleaving.  The new speed of 39233K is finally better than raw-md5's,
which is at most (over several benchmark invocations):

Benchmarking: Raw-MD5 [MD5 128/128 XOP 4x2]... DONE
Raw:    37918K c/s real, 37918K c/s virtual

Yet the difference is surprisingly small, suggesting that there's
still room for speeding up our MD4 on CPU.

It may be worth experimenting with different orderings of x, y, z to
G().  Maybe some of the 6 will result in lower optimal GWS or/and better
performance than others.  (The same applies to SHA-1 and SHA-2.)

nt_kernel.cl and mscash_kernel.cl (any others?) will need separate
patches.  mscash_kernel.cl doesn't even use bitselect() for F(), and
doesn't use rotate().  They should be made to use opencl_md4.h macros.

Alexander

diff --git a/src/opencl_md4.h b/src/opencl_md4.h
index cbd5c2d..4a698e3 100644
--- a/src/opencl_md4.h
+++ b/src/opencl_md4.h
@@ -16,20 +16,23 @@
 
 #include "opencl_misc.h"
 
+/* The basic MD4 functions */
 #ifdef USE_BITSELECT
 #define MD4_F(x, y, z)	bitselect((z), (y), (x))
 #else
 #define MD4_F(x, y, z)	((z) ^ ((x) & ((y) ^ (z))))
 #endif
 
+#ifdef USE_BITSELECT
+#define MD4_G(x, y, z)	bitselect((x), (y), (z) ^ (x))
+#else
+#define MD4_G(x, y, z)	(((x) & ((y) | (z))) | ((y) & (z)))
+#endif
+
 #define MD4_H(x, y, z)	(((x) ^ (y)) ^ (z))
 #define MD4_H2(x, y, z)	((x) ^ ((y) ^ (z)))
 
 
-/* The basic MD4 functions */
-#define MD4_G(x, y, z)	(((x) & ((y) | (z))) | ((y) & (z)))
-
-
 /* The MD4 transformation for all three rounds. */
 #define MD4STEP(f, a, b, c, d, x, s)  \
 	(a) += f((b), (c), (d)) + (x); \

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ