john-dev - MD4/MD5 round 3 common XOR (was: bitslice MD*/SHA*, AVX2)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150314040000.GA9163@openwall.com>
Date: Sat, 14 Mar 2015 07:00:00 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: MD4/MD5 round 3 common XOR (was: bitslice MD*/SHA*, AVX2)

On Fri, Mar 13, 2015 at 09:01:25AM +0100, magnum wrote:
> On 2015-03-11 23:07, Solar Designer wrote:
> > On Wed, Mar 11, 2015 at 10:45:19PM +0100, magnum wrote:
> >> On 2015-03-11 22:21, Solar Designer wrote:
> >>> In my testing, this might not be beneficial on 2-operand archs such as
> >>> plain x86, but it should be on 3-operand archs such as AVX.  So we
> >>> should update the code in sse-intrinsics.c, and benchmark.  And we should
> >>> update the plain C code anyway, such as for non-x86 archs (which are
> >>> mostly 3-operand RISC).
> >>>
> >>> magnum, Jim?
> >>
> >> Yeah... unless we have some GSoC candidate wanting to show his/her
> >> teeth? That would be a good start!
> > 
> > OK, I don't mind keeping this on hold until GSoC student application
> > period ends.  Would you track it, so it doesn't get forgotten in case no
> > GSoC candidate takes care of it?
> 
> Out of curiosity I did some experiments with sse-intrinsics.c and I only
> see regression when trying to implement this. Does that make sense? I
> also tried with no interleaving, still a regression. Could this somehow
> break some other optimization made by the compiler? In the MD4 case I
> didn't even have to add a new temp variable, it already has tmp2 free to
> use at that place.
> 
> It doesn't get much slower, but always definitely slower.

As long as you're building for AVX or better and have enough registers,
that's unexpected.  Yes, if you're observing this anyway, it might be
breaking other optimizations.

Like I wrote, on 2-operand archs, including SSE2, this might hurt
performance as it might be replacing XOR's with MOV's and eating up
extra registers.  Also, on register-starved archs, like on x86 in 32-bit
mode, the extra register pressure might be hurting performance.  But on
x86-64 with AVX or better, it should be beneficial.

A month or so ago, atom said on Twitter that it was always beneficial
for him, though.  Oh, maybe that's because he wasn't testing on anything
older than Bulldozer or Ivy Bridge, and these have free MOVs via
register renaming?

Maybe your reuse of "tmp2", or the compiler's reuse of a register of its
choosing, causes extra anti-dependencies?

http://en.wikipedia.org/wiki/Data_dependency#Anti-dependency

I don't know how good or not specific CPUs' register renaming is at
dealing with these.

You could try introducing a new variable for this, to possibly make it
more likely that the compiler would allocate a new register if it can.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.