john-dev - intrinsics: speed up for linux-x86-64-native

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120914130809.GA14967@debian>
Date: Fri, 14 Sep 2012 17:08:09 +0400
From: Aleksey Cherepanov <aleksey.4erepanov@...il.com>
To: john-dev@...ts.openwall.com
Subject: intrinsics: speed up for linux-x86-64-native

(I caught cold so I postponed my todos and get fun learning
intrinsics at this moment.)

Looking over sse-intrinsics.c I noticed weird thing: multiple
MD5_PARA_DO cycles when it is possible to write one cycle over
everything and avoid use of tmp variable. I tried to avoid some cycles
and got a speed up. But when I merged them into one cycle per MD5_STEP
I got a significant slowdown.

Imperative observation is that _mm_add_epi32 with result of other
_mm_add_epi32 as direct argument is a bad idea while mixing of
different instructions is ok (all other instructions are logical that
should be trivial and fast under any circumstances (except andnot), I
guess). So I guess it is caused by instruction's latencies (hint from
Alexander Cherepanov).

I found https://developer.apple.com/hardwaredrivers/ve/sse.html :
section "Pipelines, Latencies and Unrolling" explains a bit about
optimization but I got a slowdown again trying to merge and/or unroll
remaining cycles. How could I understand that I exhausted cache?

Also I tried to change MD5_SSE_PARA value but no luck.

I attach a patch with the best combination without intrusive changes
(i.e. not splitting MD5_{F,G,H,I} functions to mix their instructions
with addition). It gives noticeable speed up for md5 based formats.
Though speed of this format is still far from intrinsics compiled with
icc (obtained from not patched intrinsics).

Also I quickly tried to do the same for md4 (nt format) and got speed
equal to icc's intrinsics in many combos including one cycle per step.

I did not tried sha1 because I am afraid that I am on the wrong way:
does not my changes work only on my computer/compiler setup (core i7,
gcc 4.7.1)? What are the reasons for so many cycles?

If everything is ok I could prepare patches for md4 and sha1, should
I? Should I try more intrusive changes?

Thanks!

-- 
Regards,
Aleksey Cherepanov

View attachment "t.patch" of type "text/x-diff" (2518 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.