john-dev - [patch] sse/xop implementation of raw-sha256

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <51644911.2040707@bindshell.nl>
Date: Tue, 09 Apr 2013 10:00:01 -0700
From: Jeremi Gosney <epixoip@...dshell.nl>
To: john-dev@...ts.openwall.com
Subject: [patch] sse/xop implementation of raw-sha256

I was surprised to learn there was no SIMD implementation of SHA2 in
JTR, so I wrote one. I've completed SHA256, and have attached a patch
for it. I will begin work on SHA512 shortly.

A new notes...

I implemented this as a separate plugin from rawSHA256_fmt_plug.c and
sse-intrinsics.c to aid in development and testing. I presume that if
you accept this patch you will want to roll this into the existing
files, and probably standardize some of the variable/macro names as well.

This implementation is not incredibly spectacular. It shows decent
performance on AMD, not so much on Intel.  But, I think it's a good
starting point, and those who are more experienced with JTR and/or
better versed in SSE will likely be able to improve upon it quite a bit.

I don't particularly like the way I'm loading plains in crypt_all (lines
305-315), but I couldn't think of a better solution. I think something
more clever could be done, though.

I got a huge performance hit when I attempted to do interleaving, and I
presume that is because SHA2 is a lot higher in register count and
memory pressure than MD4/MD5/SHA1. Even just doing 2x4 resulted in way
too much spilling and incurred an 80% performance loss over not
interleaving, so I abandoned interleaving.

I also had this idea of keeping B-H in xmm registers, only unloading A
in crypt_all for get_hash and cmp_all, and only unloading B-H if we get
to cmp_one(). Theoretically that should have been faster since it would
save seven unlikely MOVDQA instructions, but due to the register
pressure, holding onto those seven registers resulted in a performance
hit as well.

Here's some benchmarks from a couple of my systems:

~ 4.1x speed-up AMD FX-4100, single core

Benchmarking: Raw SHA-256 [32/64 OpenSSL]... DONE
Raw:    2164K c/s real, 2164K c/s virtual
 
Benchmarking: Raw SHA-256 [128/128 XOP instrinsics 4x]... DONE
Raw:    8928K c/s real, 8928K c/s virtual


~ 2.4x speed-up on Intel Xeon X7350, single core

Benchmarking: Raw SHA-256 [32/64 OpenSSL]... DONE
Raw:    2279K c/s real, 2279K c/s virtual
 
Benchmarking: Raw SHA-256 [128/128 SSSE3 instrinsics 4x]... DONE
Raw:    5431K c/s real, 5431K c/s virtual


But, it doesn't scale very well with MPI on AMD, I'm not sure why. Only
~3.3x speed-up on FX-4100 with 4xMPI

Benchmarking: Raw SHA-256 [32/64 OpenSSL]... (4xMPI) DONE
Raw:    6965K c/s real, 7255K c/s virtual
 
Benchmarking: Raw SHA-256 [128/128 XOP instrinsics 4x]... (4xMPI) DONE
Raw:    23558K c/s real, 24539K c/s virtual


Scales okay on Intel though, still ~2.4x speed-up on X7350 with 4xMPI

Benchmarking: Raw SHA-256 [32/64 OpenSSL]... (4xMPI) DONE
Raw:    9117K c/s real, 9117K c/s virtual

Benchmarking: Raw SHA-256 [128/128 SSSE3 instrinsics 4x]... (4xMPI) DONE
Raw:    21740K c/s real, 21740K c/s virtual


Regards,
epixoip

View attachment "rawSHA256_ng_fmt_plug.diff" of type "text/plain" (19563 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.