Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Wed, 2 Sep 2015 18:20:25 +0300
From: Solar Designer <>
Subject: SHA-1 H()

magnum, Lei -

SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().

In opencl_sha1.h we have:

#define F3(x, y, z)     (bitselect(x, y, z) ^ bitselect(x, 0U, y))

I've just tried changing this to:

#define F3(x, y, z)     bitselect(x, y, (z) ^ (x))

and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).

The same pattern is also seen in:

[ src]$ grep -r 'bitselect.*\^.*bitselect' .
./opencl_sha1.h:#define F3(x, y, z)     (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/ F(x, y, z)       (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/ F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/ F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/ F3(x, y, z)     (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/ F3(x, y, z)       (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/ F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
./opencl/ F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

and maybe elsewhere, if written slightly differently.

In simd-intrinsics.c we have:

#if __XOP__
#define SHA1_H(x,y,z)                               \
    tmp[i] = vcmov((x[i]),(y[i]),(z[i]));           \
    tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#define SHA1_H(x,y,z)                                       \
    tmp[i] = vand((x[i]),(y[i]));                           \
    tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));

This is suboptimal in two ways:

1. It doesn't use the more optimal expression above (can do 2 operations
instead of 3).

2. The check for __XOP__ prevents this optimization from being used for
other archs where we have non-emulated vcmov().  This is currently NEON
and AltiVec.

While we could simply drop the check for __XOP__ once we've optimized
the expression since it'd be 4 operations with emulated vcmov(), which
is same count as the current #else branch, I suggest that we don't,
because the 4 operations in #else include some parallelism whereas our
vcmov() emulation does not.

So I think we should either enhance the check with also checking for
__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking
for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h
that would indicate that vcmov() is emulated, so we'd avoid it then).

The same applies to rawSHA1_ng_fmt_plug.c where we have:

#define R3(W, A, B, C, D, E) do {                                   \
        E   = vadd_epi32(E, K);                                     \
        E   = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B)));   \
        E   = vadd_epi32(E, W);                                     \
        B   = vroti_epi32(B, 30);                                   \
        E   = vadd_epi32(E, vroti_epi32(A, 5));                     \
    } while (false)

In fact, it's even worse there: as currently written, this expands to 5
operations when vcmov() is emulated, instead of 4.  I think we should
put an #if around R3 and define it in two different ways: using the more
optimal 2 operations expression when vcmov() is non-emulated, and using
the 4 operations expression with parallelism (same as the current #else
branch in simd-intrinsics.c) when vcmov() is emulated.

magnum, will you take care of this, please?  And test on GPUs and on XOP.

Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?



Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.