
Date: Sat, 5 Sep 2015 09:09:12 +0300 From: Solar Designer <solar@...nwall.com> To: johndev@...ts.openwall.com Subject: MD4 G() magnum, Sayantan  MD4 G() is the same as SHA2 Maj(), yet we've been using unoptimized expression for it so far. The attached patch improves the speed for pbkdf2hmacmd4opencl on Tahiti from: Local worksize (LWS) 64, global worksize (GWS) 524288 DONE Speed for cost 1 (iterations) of 1000 Raw: 3994K c/s real, 104857K c/s virtual to: Local worksize (LWS) 64, global worksize (GWS) 524288 DONE Speed for cost 1 (iterations) of 1000 Raw: 4537K c/s real, 94371K c/s virtual or if I let it autotune to higher GWS (which it previously would not): Local worksize (LWS) 64, global worksize (GWS) 2097152 DONE Speed for cost 1 (iterations) of 1000 Raw: 4592K c/s real, 125829K c/s virtual On one core in FX8120, I got improvement (with the previously posted patch) from: Benchmarking: RawMD4 [MD4 128/128 XOP 4x2]... DONE Raw: 36863K c/s real, 36863K c/s virtual to: Benchmarking: RawMD4 [MD4 128/128 XOP 4x2]... DONE Raw: 39233K c/s real, 39233K c/s virtual although some of the speedup, namely to: Benchmarking: RawMD4 [MD4 128/128 XOP 4x2]... DONE Raw: 37509K c/s real, 37509K c/s virtual came from enabling use of H2, which was previously disabled for 2x interleaving. The new speed of 39233K is finally better than rawmd5's, which is at most (over several benchmark invocations): Benchmarking: RawMD5 [MD5 128/128 XOP 4x2]... DONE Raw: 37918K c/s real, 37918K c/s virtual Yet the difference is surprisingly small, suggesting that there's still room for speeding up our MD4 on CPU. It may be worth experimenting with different orderings of x, y, z to G(). Maybe some of the 6 will result in lower optimal GWS or/and better performance than others. (The same applies to SHA1 and SHA2.) nt_kernel.cl and mscash_kernel.cl (any others?) will need separate patches. mscash_kernel.cl doesn't even use bitselect() for F(), and doesn't use rotate(). They should be made to use opencl_md4.h macros. Alexander diff git a/src/opencl_md4.h b/src/opencl_md4.h index cbd5c2d..4a698e3 100644  a/src/opencl_md4.h +++ b/src/opencl_md4.h @@ 16,20 +16,23 @@ #include "opencl_misc.h" +/* The basic MD4 functions */ #ifdef USE_BITSELECT #define MD4_F(x, y, z) bitselect((z), (y), (x)) #else #define MD4_F(x, y, z) ((z) ^ ((x) & ((y) ^ (z)))) #endif +#ifdef USE_BITSELECT +#define MD4_G(x, y, z) bitselect((x), (y), (z) ^ (x)) +#else +#define MD4_G(x, y, z) (((x) & ((y)  (z)))  ((y) & (z))) +#endif + #define MD4_H(x, y, z) (((x) ^ (y)) ^ (z)) #define MD4_H2(x, y, z) ((x) ^ ((y) ^ (z))) /* The basic MD4 functions */ #define MD4_G(x, y, z) (((x) & ((y)  (z)))  ((y) & (z)))   /* The MD4 transformation for all three rounds. */ #define MD4STEP(f, a, b, c, d, x, s) \ (a) += f((b), (c), (d)) + (x); \
Powered by blists  more mailing lists