john-dev - Re: MD5 on XOP, NEON, AltiVec

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150905043429.GA24746@openwall.com>
Date: Sat, 5 Sep 2015 07:34:29 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: MD5 on XOP, NEON, AltiVec

On Sat, Sep 05, 2015 at 07:17:49AM +0300, Solar Designer wrote:
> I sort of found it: somehow the code handling SSEi_FLAT_OUT, when
> compiled in, changes the stack frame layout in such a way that
> performance drops.  I wasn't yet able to tell why it drops.  The
> offsets look properly aligned to me either way.

BTW, the code size with SSEi_FLAT_OUT is:

$ nm -S simd-intrinsics.o | fgrep -w T
[...]
0000000000000000 0000000000002daf T SIMDmd5body
0000000000002dc0 0000000000003f9f T md5cryptsse

without SSEi_FLAT_OUT it becomes:

0000000000000000 0000000000002a38 T SSEmd5body
0000000000002a40 0000000000003f9f T md5cryptsse

That's 27982 vs. 27095 bytes.  In Bulldozer, we have 64 KiB 2-way L1i
shared for two "cores" in a module.  So in terms of sheer size, this
should fit.  It is possible that the extra 900 bytes result in overlap
with something else that we use (2-way is very low), but I would expect
md5crypt's 1000 iterations to take long enough for this effect to be
insignificant.

Thus, this doesn't appear to be it.  However, we should keep in mind
that our code here grew this large, and maybe support builds with less
function inlining for CPUs with 16 KiB instruction caches (what CPUs are
they these days? some ARM chips perhaps?)

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.