john-dev - Re: MD5 timings (x96-32)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120708062040.GC28346@openwall.com>
Date: Sun, 8 Jul 2012 10:20:40 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: MD5 timings (x96-32)

On Sat, Jul 07, 2012 at 04:57:08PM -0500, jfoug wrote:
> Also, I think we should make changes in the x86-mmx.h and x86-sse.h, to have
> these build types use MD5_X2 and not asm.

We'll need to combine this with a gcc version check like I do for BF_X2,
but with different gcc versions.  Here's what I get on Core i7 920 (best
of several runs since the machine is a server with some light load):

gcc 3.4.5, x1 - 8100 c/s (asm code, reference speed)
gcc 3.4.5, x2 - 5500 c/s
gcc 4.0.0, x2 - 8250 c/s
gcc 4.1.0, x2 - 8200 c/s
gcc 4.2.0, x2 - 7940 c/s
gcc 4.5.0, x2 - 8750 c/s

Same binaries copied to Pentium 3, 1.0 GHz, no load:

gcc 3.4.5, x1 - 2462 c/s (asm code, reference speed)
gcc 3.4.5, x2 - 1824 c/s
gcc 4.0.0, x2 - 2666 c/s
gcc 4.1.0, x2 - 2582 c/s
gcc 4.2.0, x2 - 2520 c/s
gcc 4.5.0, x2 - 2670 c/s

So there's some slight speedup with this change when using gcc 4+, but
regressions are possible (seen with 4.2.0 on the Core i7 so far) and
more testing is needed (more gcc versions, more CPUs).

I guess similar speedup is possible by tuning the x1 asm code for PPro
family CPUs (it's currently tuned for the original Pentium), similarly
to how we have two code versions for Blowfish in x86.S (with runtime
detection).  I am unhappy about spending time on this mostly legacy
code, but trying lots of gcc versions on lots of CPUs is the same thing
in this respect.

A related possible optimization is common subexpression elimination in
8 out of 16 steps in MD5's round 3.  It can have non-obvious effect on
performance too, since on one hand it's one fewer XOR per step, but on
the other a register is tied to holding this value between steps.
Attached is an experimental patch for this, also experimenting with LEA
instead of ADD in _some_ places (would likely need to do it in all if
beneficial on target CPUs).  I tried this a while ago, and abandoned it
for lack of speedup (roughly same speed on P3), although I guess that
some speedup could be obtained with more effort.  Just to make it clear:
this patch is _not_ meant to be committed into any tree.  We're just
discussing.

Alexander

View attachment "john-x86-md5.diff" of type "text/plain" (2413 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.