john-dev - Re: md5crypt mmxput*()

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150908101714.GA12952@openwall.com>
Date: Tue, 8 Sep 2015 13:17:14 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: md5crypt mmxput*()

On Sat, Sep 05, 2015 at 05:51:51AM +0300, Solar Designer wrote:
> magnum, Jim, Simon, Lei -
> 
> Some speedup for md5crypt on CPU might be possible through vectorizing
> the mmxput*() functions, or through use of SHLD/SHRD instructions
> (available since 386) or other archs' equivalents (I think ARM has this
> too) in mmxput3() when not vectorized (somehow gcc does not do it for
> us).  These functions are similar to buf_update() in cryptmd5_kernel.cl,
> where I've added uses of amd_bitalign() and NVIDIA's funnel shifter
> recently (analogous to SHLD/SHRD), and which obviously is processed on
> the SIMD units on GPUs (can do it on CPUs as well, although no SHLD/SHRD
> then, unless a given CPU architecture has them in SIMD form as well -
> need to look into that).

I looked into making mmxput3() use SHLD/SHRD, and found this comment:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55583#c6

"We can convince the current compiler to generate shrd by constructing
((((unsigned long long)a)<<32) | b) >> n"

I tried doing this, but since I'm on x86_64 I actually got a 64-bit
shift instead.  Good news: it's also faster.  Before:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw:    228864 c/s real, 28608 c/s virtual

After:

Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw:    231424 c/s real, 28928 c/s virtual

Patched attached.  I only tested this on bull so far.  I put an #if in
there enabling the 64-bit approach on any 64-bit arch as well as on
32-bit x86 (expecting that SHRD would be generated, as per the comment
above) - but this needs to be tested on different machines and with
different versions of gcc (I wouldn't be surprised if there's a
regression with some old version).

I think further speedup is possible by using a switch statement to make
the shift counts into constants (we have an if anyway, we'll just
replace it with a switch) like cryptmd5_kernel.cl has.  And indeed by
vectorizing this.  But for a trivial patch, the above speedup isn't bad.

In fact, with a switch there might not be a speedup from the 64-bit
shift anymore (except on 32-bit x86, where it should enable SHRD).
I think the speedup comes from one of the shift counts becoming a
constant now (the constant 32), but with a switch all of them would be
constants anyway.  So maybe that #if would need to be revised then.

If we're not going to vectorize this soon, then maybe the next steps are
to try switch and then to tune the #if based on testing on different
CPUs and compiler versions.

... or just test the attached patch a bit more and commit it.

Maybe we should enable the 64-bit approach for ARM as well (IIRC, it has
a similar instruction to 386's SHRD).

Alexander

View attachment "john-md5crypt-bitalign.diff" of type "text/plain" (1804 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.