Date: Tue, 09 Apr 2013 10:00:01 -0700 From: Jeremi Gosney <epixoip@...dshell.nl> To: john-dev@...ts.openwall.com Subject: [patch] sse/xop implementation of raw-sha256 I was surprised to learn there was no SIMD implementation of SHA2 in JTR, so I wrote one. I've completed SHA256, and have attached a patch for it. I will begin work on SHA512 shortly. A new notes... I implemented this as a separate plugin from rawSHA256_fmt_plug.c and sse-intrinsics.c to aid in development and testing. I presume that if you accept this patch you will want to roll this into the existing files, and probably standardize some of the variable/macro names as well. This implementation is not incredibly spectacular. It shows decent performance on AMD, not so much on Intel. But, I think it's a good starting point, and those who are more experienced with JTR and/or better versed in SSE will likely be able to improve upon it quite a bit. I don't particularly like the way I'm loading plains in crypt_all (lines 305-315), but I couldn't think of a better solution. I think something more clever could be done, though. I got a huge performance hit when I attempted to do interleaving, and I presume that is because SHA2 is a lot higher in register count and memory pressure than MD4/MD5/SHA1. Even just doing 2x4 resulted in way too much spilling and incurred an 80% performance loss over not interleaving, so I abandoned interleaving. I also had this idea of keeping B-H in xmm registers, only unloading A in crypt_all for get_hash and cmp_all, and only unloading B-H if we get to cmp_one(). Theoretically that should have been faster since it would save seven unlikely MOVDQA instructions, but due to the register pressure, holding onto those seven registers resulted in a performance hit as well. Here's some benchmarks from a couple of my systems: ~ 4.1x speed-up AMD FX-4100, single core Benchmarking: Raw SHA-256 [32/64 OpenSSL]... DONE Raw: 2164K c/s real, 2164K c/s virtual Benchmarking: Raw SHA-256 [128/128 XOP instrinsics 4x]... DONE Raw: 8928K c/s real, 8928K c/s virtual ~ 2.4x speed-up on Intel Xeon X7350, single core Benchmarking: Raw SHA-256 [32/64 OpenSSL]... DONE Raw: 2279K c/s real, 2279K c/s virtual Benchmarking: Raw SHA-256 [128/128 SSSE3 instrinsics 4x]... DONE Raw: 5431K c/s real, 5431K c/s virtual But, it doesn't scale very well with MPI on AMD, I'm not sure why. Only ~3.3x speed-up on FX-4100 with 4xMPI Benchmarking: Raw SHA-256 [32/64 OpenSSL]... (4xMPI) DONE Raw: 6965K c/s real, 7255K c/s virtual Benchmarking: Raw SHA-256 [128/128 XOP instrinsics 4x]... (4xMPI) DONE Raw: 23558K c/s real, 24539K c/s virtual Scales okay on Intel though, still ~2.4x speed-up on X7350 with 4xMPI Benchmarking: Raw SHA-256 [32/64 OpenSSL]... (4xMPI) DONE Raw: 9117K c/s real, 9117K c/s virtual Benchmarking: Raw SHA-256 [128/128 SSSE3 instrinsics 4x]... (4xMPI) DONE Raw: 21740K c/s real, 21740K c/s virtual Regards, epixoip View attachment "rawSHA256_ng_fmt_plug.diff" of type "text/plain" (19563 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.