Date: Mon, 9 Jul 2012 10:30:29 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Cc: Tavis Ormandy <taviso@...xchg8b.com> Subject: Re: Rotate and bitselect investigation magnum, Sayantan, Tavis - On Mon, Jul 09, 2012 at 10:15:54AM +0530, Sayantan Datta wrote: > On Mon, Jul 9, 2012 at 6:27 AM, Solar Designer <solar@...nwall.com> wrote: > > On Mon, Jul 09, 2012 at 01:24:14AM +0530, Sayantan Datta wrote: > > > I was able to squeeze in another bitselect optimization in the kernel > > > resulting in 2% performance jump. > > > > Where did you put that other bitselect()? > > F(x,y,z) ((x & y) | (z & (x | y)))==F(x,y,z) (bitselect(x, y, z) ^ > bitselect(x, (uint)0, y)) Wow. I wonder if this trick for SHA-1 was known at all. Not to us, it seems. The second bitselect() is essentially an and-not, so the speed might be better if it's written as such (if there's an and-not instruction). Also, I guess this change should hurt on NVIDIA (does it?), so you'll need to wrap it in some #ifdef. Anyway, I've just tried it on CPU (XOP). Patch attached. Here are the speeds (best of several invocations in each case): Before: Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE Raw: 28925K c/s real, 28925K c/s virtual After: Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE Raw: 28435K c/s real, 28435K c/s virtual Somehow a slowdown here, even though one instruction is saved. I guess we incur some data dependency stall, which apparently could be avoided with greater parallelism (interleaved instructions for two sets of SHA-1 computations - that is, for 8 of them at once). With the sse-intrinsics.c code, there's actually a speedup from an equivalent change: Before: Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE Raw: 22221K c/s real, 22221K c/s virtual Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [128/128 XOP intrinsics 8x]... (8xOMP) DONE Raw: 4515 c/s real, 562 c/s virtual After: Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE Raw: 23629K c/s real, 23629K c/s virtual Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [128/128 XOP intrinsics 8x]... (8xOMP) DONE Raw: 4698 c/s real, 586 c/s virtual I guess we should commit at least the change to sse-intrinsics.c. As to the change to rawSHA1_ng_fmt.c, maybe we should commit it in commented-out form - e.g., replace the "#ifdef __XOP__" with "#if 0" for now, just to have this alternative code recorded. Finally, I think more OpenCL kernels may benefit from this - essentially all where we have SHA-1. Thanks, Alexander View attachment "john-sha1-r3-bitselect.diff" of type "text/plain" (2716 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.