Date: Wed, 11 Mar 2015 23:55:09 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: bitslice MD*/SHA*, AVX2 Hi, So I was thinking whether to drop trying bitslice SHA-2 from the "John the Ripper SIMD support enhancements" / "Better SIMD implementations of SHA-2 hashes and higher-level schemes" GSoC sub-task description, since this portion is too(?) unlikely to result in better performance on current hardware. I dug up the latest revision of my md5slice.c experiment (attached to this message) and tried it out on our i7-4770K. Here's the speed with 128-bit AVX, building with "gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)": solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native solar@...l:~/md5slice$ time ./md5slice vector size = 128 bits c09c4c1f 21876746 18aed2 70b452f0 real 0m0.630s user 0m0.628s sys 0m0.000s That's for 10 million MD5s, giving us a speed of 10M/0.63 = ~15.9M MD5/s. To try AVX2, I changed one line: typedef element vector __attribute__ ((vector_size (16))); to: typedef element vector __attribute__ ((vector_size (32))); and built with: solar@...l:~/md5slice$ export PATH=~gcc/gcc-4.9-20130707/bin:$PATH solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native This gave "warning: always_inline function might not be inlinable" about FF(), I(), H(), F(), add32r(), add32c(), add32() - but then it built fine. The speed is: solar@...l:~/md5slice$ time ./md5slice vector size = 256 bits c09c4c1f 21876746 18aed2 70b452f0 real 0m0.354s user 0m0.348s sys 0m0.004s which corresponds to 10M/0.35 = ~28.5 MD5/s. For comparison, a non-OpenMP build of john-1.8.0-jumbo-1 on this machine achieves: solar@...l:~/j/john-1.8.0-jumbo-1/run$ ./john -te -form=raw-md5 Benchmarking: Raw-MD5 [MD5 128/128 AVX 12x]... DONE Raw: 31923K c/s real, 31923K c/s virtual So the speed is finally comparable. That's AVX vs. AVX2, so a straightforward implementation of MD5 with AVX2 will likely run faster yet, but the results so far are not conclusively anti-bitslice. And the ease of upgrading from 128-bit AVX to 256-bit AVX2 with a one-line change (a two-character change, even) is appealing. BTW, I did review the generated assembly code to confirm that it was 128-bit AVX and 256-bit AVX2 in these two tests as expected. There's definitely some further room for optimization in md5slice.c - for example, it does not yet take advantage of the common subexpressions in the invocations of HH() (we can save 8*32 XORs). FF() has everything merged into it manually and specialized, whereas GG(), HH(), and II() use other functions. Also, explicit (such as via cpp or with a script) unrolling and inlining might (or might not) work better than gcc's (the various *ptr++'s will be replaced with explicit array accesses at fixed indices, which I think gcc does for us anyway). Overall, based on this test, I think I'll leave the task listed. And we might even need to give it a try for MD4, MD5, and SHA-1 if AVX2 somehow does not speedup straightforward implementations of them a lot (although from this test I expect that it will, and the bitslice implementations will remain much slower than straightforward). Alexander View attachment "md5slice.c" of type "text/x-c" (10797 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.