
Date: Wed, 2 Sep 2015 18:52:25 +0300
From: Solar Designer <solar@...nwall.com>
To: johndev@...ts.openwall.com
Subject: Re: SHA1 H()
magnum, Lei 
On Wed, Sep 02, 2015 at 06:20:25PM +0300, Solar Designer wrote:
> SHA1's H() aka F3() is the same as SHA2's Maj()
And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
source files and are not always optimal:
[solar@...er src]$ grep r 'define .*Maj' .
./unused/sha512_kernel.cl:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./rawSHA512_ng_fmt_plug.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./cuda_cryptsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./opencl/pwsafe_kernel.cl:#define Maj(x,y,z) (bitselect(y, x,(z^y)))
./opencl/pwsafe_kernel.cl:#define Maj(x, y, z) ((y & z)  (x & (y  z)))
./opencl_sha512.h: #define Maj(x,y,z) bitselect(x, y, z ^ x)
./opencl_sha512.h: #define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./cuda_cryptsha256.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./escrypt/sha256.c:#define Maj(x, y, z) ((x & (y  z))  (y & z))
./cuda_xsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./opencl_sha2.h:#define Maj(x, y, z) bitselect(x, y, z ^ x)
./opencl_sha2.h:#define Maj(x, y, z) ((x & y)  (z & (x  y)))
./rawSHA256_ng_fmt_plug.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./opencl_sha256.h: #define Maj(x, y, z) bitselect(x, y, z ^ x)
./opencl_sha256.h: #define Maj(x, y, z) ((x & y) ^ (x & z) ^ (y & z))
./simdintrinsics.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./simdintrinsics.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./cuda_pwsafe.h:#define Maj(x, y, z) ((y & z)  (x & (y  z)))
./cuda_rawsha256.h:#define Maj(x,y,z) ( (x & y)  (z & (x  y)) )
./cuda_rawsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
As you can see, some of these use 5 operations instead of 4, and some
use the parallelismlacking approach with possibly emulated vcmov().
I think we should standardize on the parallelismenabled 4 operation
expression for when there's no native bitselect() or vcmov()  for both
SHA1 and SHA2 in the same way.
This means we should probably check for AMD or at least NVIDIA Maxwell,
falling back to the 4 operation expression on older NVIDIA GPUs. (Need
to check which expression gets compiled to LOP3.LUT on Maxwell, though.)
A curious aspect is that Maj() is invariant with respect to the ordering
of its arguments. We can see it in the grep output above: some of the
expressions are the same except that they have x, y, z reordered in
different ways. We could test all 6 possible orderings in different
contexts (SHA1 vs. SHA256 vs. SHA512, and different OpenCL kernels,
etc.) and see which is faster where (this might in fact differ).
Attached to this message is a program I used to search for possible
optimized expressions like this. No new findings from it, but it did
remind me of the issues I described in these two messages. I was hoping
it might find a 2 operation expression for MD5's I(), but no luck.
It doesn't yet test two bitselect()'s per expression, though  this is
worth adding and trying again (many possibilities to test there).
Oh, and of course on Maxwell and AVX512 MD5's I() (and all others) is
just one operation, if the compiler manages. We should check the
generated code for our kernels on Maxwell.
Lei  I think you haven't gotten around to introducing AVX512 ternary
logic intrinsics yet, have you? Unfortunately, we don't yet have
hardware to test them on, but you could test on Intel SDE or/and by
emulating them with macros.
Alexander
View attachment "search3.c" of type "text/xc" (2281 bytes)
Powered by blists  more mailing lists