
Date: Wed, 2 Sep 2015 18:20:25 +0300 From: Solar Designer <solar@...nwall.com> To: johndev@...ts.openwall.com Subject: SHA1 H() magnum, Lei  SHA1's H() aka F3() is the same as SHA2's Maj(), yet somehow we're using less optimal expressions for it on systems with bitselect(). In opencl_sha1.h we have: #define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) I've just tried changing this to: #define F3(x, y, z) bitselect(x, y, (z) ^ (x)) and got some speedup for pbkdf2hmacsha1opencl on GCN (1200K to 1228K c/s). The same pattern is also seen in: [solar@...er src]$ grep r 'bitselect.*\^.*bitselect' . ./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) ./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) ./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) ./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) ./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) ./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y)) ./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y)) ./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y)) and maybe elsewhere, if written slightly differently. In simdintrinsics.c we have: #if __XOP__ #define SHA1_H(x,y,z) \ tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \ tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i]))); #else #define SHA1_H(x,y,z) \ tmp[i] = vand((x[i]),(y[i])); \ tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i]))); #endif This is suboptimal in two ways: 1. It doesn't use the more optimal expression above (can do 2 operations instead of 3). 2. The check for __XOP__ prevents this optimization from being used for other archs where we have nonemulated vcmov(). This is currently NEON and AltiVec. While we could simply drop the check for __XOP__ once we've optimized the expression since it'd be 4 operations with emulated vcmov(), which is same count as the current #else branch, I suggest that we don't, because the 4 operations in #else include some parallelism whereas our vcmov() emulation does not. So I think we should either enhance the check with also checking for __ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking for nonemulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h that would indicate that vcmov() is emulated, so we'd avoid it then). The same applies to rawSHA1_ng_fmt_plug.c where we have: #define R3(W, A, B, C, D, E) do { \ E = vadd_epi32(E, K); \ E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \ E = vadd_epi32(E, W); \ B = vroti_epi32(B, 30); \ E = vadd_epi32(E, vroti_epi32(A, 5)); \ } while (false) In fact, it's even worse there: as currently written, this expands to 5 operations when vcmov() is emulated, instead of 4. I think we should put an #if around R3 and define it in two different ways: using the more optimal 2 operations expression when vcmov() is nonemulated, and using the 4 operations expression with parallelism (same as the current #else branch in simdintrinsics.c) when vcmov() is emulated. magnum, will you take care of this, please? And test on GPUs and on XOP. Lei, will you test/benchmark on NEON and AltiVec once magnum commits the fixes, please? Thanks, Alexander
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.