Date: Tue, 8 Sep 2015 13:17:14 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: md5crypt mmxput*() On Sat, Sep 05, 2015 at 05:51:51AM +0300, Solar Designer wrote: > magnum, Jim, Simon, Lei - > > Some speedup for md5crypt on CPU might be possible through vectorizing > the mmxput*() functions, or through use of SHLD/SHRD instructions > (available since 386) or other archs' equivalents (I think ARM has this > too) in mmxput3() when not vectorized (somehow gcc does not do it for > us). These functions are similar to buf_update() in cryptmd5_kernel.cl, > where I've added uses of amd_bitalign() and NVIDIA's funnel shifter > recently (analogous to SHLD/SHRD), and which obviously is processed on > the SIMD units on GPUs (can do it on CPUs as well, although no SHLD/SHRD > then, unless a given CPU architecture has them in SIMD form as well - > need to look into that). I looked into making mmxput3() use SHLD/SHRD, and found this comment: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55583#c6 "We can convince the current compiler to generate shrd by constructing ((((unsigned long long)a)<<32) | b) >> n" I tried doing this, but since I'm on x86_64 I actually got a 64-bit shift instead. Good news: it's also faster. Before: Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE Raw: 228864 c/s real, 28608 c/s virtual After: Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE Raw: 231424 c/s real, 28928 c/s virtual Patched attached. I only tested this on bull so far. I put an #if in there enabling the 64-bit approach on any 64-bit arch as well as on 32-bit x86 (expecting that SHRD would be generated, as per the comment above) - but this needs to be tested on different machines and with different versions of gcc (I wouldn't be surprised if there's a regression with some old version). I think further speedup is possible by using a switch statement to make the shift counts into constants (we have an if anyway, we'll just replace it with a switch) like cryptmd5_kernel.cl has. And indeed by vectorizing this. But for a trivial patch, the above speedup isn't bad. In fact, with a switch there might not be a speedup from the 64-bit shift anymore (except on 32-bit x86, where it should enable SHRD). I think the speedup comes from one of the shift counts becoming a constant now (the constant 32), but with a switch all of them would be constants anyway. So maybe that #if would need to be revised then. If we're not going to vectorize this soon, then maybe the next steps are to try switch and then to tune the #if based on testing on different CPUs and compiler versions. ... or just test the attached patch a bit more and commit it. Maybe we should enable the 64-bit approach for ARM as well (IIRC, it has a similar instruction to 386's SHRD). Alexander View attachment "john-md5crypt-bitalign.diff" of type "text/plain" (1804 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.