Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 2 Sep 2015 18:52:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: SHA-1 H()

magnum, Lei -

On Wed, Sep 02, 2015 at 06:20:25PM +0300, Solar Designer wrote:
> SHA-1's H() aka F3() is the same as SHA-2's Maj()

And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
source files and are not always optimal:

[solar@...er src]$ grep -r 'define .*Maj' .
./unused/sha512_kernel.cl:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./rawSHA512_ng_fmt_plug.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./cuda_cryptsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./opencl/pwsafe_kernel.cl:#define Maj(x,y,z) (bitselect(y, x,(z^y)))
./opencl/pwsafe_kernel.cl:#define Maj(x, y, z) ((y & z) | (x & (y | z)))
./opencl_sha512.h:      #define Maj(x,y,z)      bitselect(x, y, z ^ x)
./opencl_sha512.h:        #define Maj(x,y,z)      ((x & y) ^ (x & z) ^ (y & z))
./cuda_cryptsha256.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./escrypt/sha256.c:#define Maj(x, y, z) ((x & (y | z)) | (y & z))
./cuda_xsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./opencl_sha2.h:#define Maj(x, y, z)    bitselect(x, y, z ^ x)
./opencl_sha2.h:#define Maj(x, y, z)    ((x & y) | (z & (x | y)))
./rawSHA256_ng_fmt_plug.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./opencl_sha256.h:      #define Maj(x, y, z)    bitselect(x, y, z ^ x)
./opencl_sha256.h:      #define Maj(x, y, z)    ((x & y) ^ (x & z) ^ (y & z))
./simd-intrinsics.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./simd-intrinsics.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./cuda_pwsafe.h:#define Maj(x, y, z) ((y & z) | (x & (y | z)))
./cuda_rawsha256.h:#define Maj(x,y,z) ( (x & y) | (z & (x | y)) )
./cuda_rawsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))

As you can see, some of these use 5 operations instead of 4, and some
use the parallelism-lacking approach with possibly emulated vcmov().

I think we should standardize on the parallelism-enabled 4 operation
expression for when there's no native bitselect() or vcmov() - for both
SHA-1 and SHA-2 in the same way.

This means we should probably check for AMD or at least NVIDIA Maxwell,
falling back to the 4 operation expression on older NVIDIA GPUs.  (Need
to check which expression gets compiled to LOP3.LUT on Maxwell, though.)

A curious aspect is that Maj() is invariant with respect to the ordering
of its arguments.  We can see it in the grep output above: some of the
expressions are the same except that they have x, y, z re-ordered in
different ways.  We could test all 6 possible orderings in different
contexts (SHA-1 vs. SHA-256 vs. SHA-512, and different OpenCL kernels,
etc.) and see which is faster where (this might in fact differ).

Attached to this message is a program I used to search for possible
optimized expressions like this.  No new findings from it, but it did
remind me of the issues I described in these two messages.  I was hoping
it might find a 2 operation expression for MD5's I(), but no luck.
It doesn't yet test two bitselect()'s per expression, though - this is
worth adding and trying again (many possibilities to test there).

Oh, and of course on Maxwell and AVX-512 MD5's I() (and all others) is
just one operation, if the compiler manages.  We should check the
generated code for our kernels on Maxwell.

Lei - I think you haven't gotten around to introducing AVX-512 ternary
logic intrinsics yet, have you?  Unfortunately, we don't yet have
hardware to test them on, but you could test on Intel SDE or/and by
emulating them with macros.

Alexander

View attachment "search3.c" of type "text/x-c" (2281 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.