john-dev - SHA-1 H()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150902152025.GA24897@openwall.com>
Date: Wed, 2 Sep 2015 18:20:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: SHA-1 H()

magnum, Lei -

SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().

In opencl_sha1.h we have:

#define F3(x, y, z)     (bitselect(x, y, z) ^ bitselect(x, 0U, y))

I've just tried changing this to:

#define F3(x, y, z)     bitselect(x, y, (z) ^ (x))

and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).

The same pattern is also seen in:

[solar@...er src]$ grep -r 'bitselect.*\^.*bitselect' .
./opencl_sha1.h:#define F3(x, y, z)     (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/gpg_kernel.cl:#define F(x, y, z)       (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/sha1_kernel.cl:#define F3(x, y, z)     (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/salted_sha_kernel.cl:#define F3(x, y, z)       (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

and maybe elsewhere, if written slightly differently.

In simd-intrinsics.c we have:

#if __XOP__
#define SHA1_H(x,y,z)                               \
    tmp[i] = vcmov((x[i]),(y[i]),(z[i]));           \
    tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#else
#define SHA1_H(x,y,z)                                       \
    tmp[i] = vand((x[i]),(y[i]));                           \
    tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));
#endif

This is suboptimal in two ways:

1. It doesn't use the more optimal expression above (can do 2 operations
instead of 3).

2. The check for __XOP__ prevents this optimization from being used for
other archs where we have non-emulated vcmov().  This is currently NEON
and AltiVec.

While we could simply drop the check for __XOP__ once we've optimized
the expression since it'd be 4 operations with emulated vcmov(), which
is same count as the current #else branch, I suggest that we don't,
because the 4 operations in #else include some parallelism whereas our
vcmov() emulation does not.

So I think we should either enhance the check with also checking for
__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking
for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h
that would indicate that vcmov() is emulated, so we'd avoid it then).

The same applies to rawSHA1_ng_fmt_plug.c where we have:

#define R3(W, A, B, C, D, E) do {                                   \
        E   = vadd_epi32(E, K);                                     \
        E   = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B)));   \
        E   = vadd_epi32(E, W);                                     \
        B   = vroti_epi32(B, 30);                                   \
        E   = vadd_epi32(E, vroti_epi32(A, 5));                     \
    } while (false)

In fact, it's even worse there: as currently written, this expands to 5
operations when vcmov() is emulated, instead of 4.  I think we should
put an #if around R3 and define it in two different ways: using the more
optimal 2 operations expression when vcmov() is non-emulated, and using
the 4 operations expression with parallelism (same as the current #else
branch in simd-intrinsics.c) when vcmov() is emulated.

magnum, will you take care of this, please?  And test on GPUs and on XOP.

Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.