john-dev - bitslice MD*/SHA*, AVX2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150311205509.GA19294@openwall.com>
Date: Wed, 11 Mar 2015 23:55:09 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: bitslice MD*/SHA*, AVX2

Hi,

So I was thinking whether to drop trying bitslice SHA-2 from the "John
the Ripper SIMD support enhancements" / "Better SIMD implementations of
SHA-2 hashes and higher-level schemes" GSoC sub-task description, since
this portion is too(?) unlikely to result in better performance on
current hardware.

I dug up the latest revision of my md5slice.c experiment (attached to
this message) and tried it out on our i7-4770K.  Here's the speed with
128-bit AVX, building with "gcc version 4.6.3 (Ubuntu/Linaro
4.6.3-1ubuntu5)":

solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native
solar@...l:~/md5slice$ time ./md5slice
vector size = 128 bits
c09c4c1f 21876746 18aed2 70b452f0

real    0m0.630s
user    0m0.628s
sys     0m0.000s

That's for 10 million MD5s, giving us a speed of 10M/0.63 = ~15.9M MD5/s.

To try AVX2, I changed one line:

typedef element vector __attribute__ ((vector_size (16)));

to:

typedef element vector __attribute__ ((vector_size (32)));

and built with:

solar@...l:~/md5slice$ export PATH=~gcc/gcc-4.9-20130707/bin:$PATH
solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native

This gave "warning: always_inline function might not be inlinable" about
FF(), I(), H(), F(), add32r(), add32c(), add32() - but then it built
fine.  The speed is:

solar@...l:~/md5slice$ time ./md5slice
vector size = 256 bits
c09c4c1f 21876746 18aed2 70b452f0

real    0m0.354s
user    0m0.348s
sys     0m0.004s

which corresponds to 10M/0.35 = ~28.5 MD5/s.

For comparison, a non-OpenMP build of john-1.8.0-jumbo-1 on this machine
achieves:

solar@...l:~/j/john-1.8.0-jumbo-1/run$ ./john -te -form=raw-md5
Benchmarking: Raw-MD5 [MD5 128/128 AVX 12x]... DONE
Raw:    31923K c/s real, 31923K c/s virtual

So the speed is finally comparable.  That's AVX vs. AVX2, so a
straightforward implementation of MD5 with AVX2 will likely run faster
yet, but the results so far are not conclusively anti-bitslice.

And the ease of upgrading from 128-bit AVX to 256-bit AVX2 with a
one-line change (a two-character change, even) is appealing.  BTW, I did
review the generated assembly code to confirm that it was 128-bit AVX
and 256-bit AVX2 in these two tests as expected.

There's definitely some further room for optimization in md5slice.c -
for example, it does not yet take advantage of the common subexpressions
in the invocations of HH() (we can save 8*32 XORs).  FF() has everything
merged into it manually and specialized, whereas GG(), HH(), and II()
use other functions.  Also, explicit (such as via cpp or with a script)
unrolling and inlining might (or might not) work better than gcc's (the
various *ptr++'s will be replaced with explicit array accesses at fixed
indices, which I think gcc does for us anyway).

Overall, based on this test, I think I'll leave the task listed.  And we
might even need to give it a try for MD4, MD5, and SHA-1 if AVX2 somehow
does not speedup straightforward implementations of them a lot (although
from this test I expect that it will, and the bitslice implementations
will remain much slower than straightforward).

Alexander

View attachment "md5slice.c" of type "text/x-c" (10797 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.