Date: Mon, 27 Apr 2015 05:38:58 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] JtR SIMD support enhancements On Sun, Apr 26, 2015 at 07:30:24PM +0200, magnum wrote: > On Thu, Apr 23, 2015 at 11:35:44PM +0800, Lei Zhang wrote: > >Benchmarking: nt2, NT [MD4 512/512 MIC 16x]... DONE > >Raw: 4907K c/s real, 4907K c/s virtual > > So this is one 1 GHz core running AVX-512. Here's my 2.3 GHz laptop > running AVX: > > Benchmarking: nt2, NT [MD4 128/128 AVX 4x3]... DONE > Raw: 43748K c/s real, 43748K c/s virtual > > Accounting only for clock and width, the MIC should do 83519K. What is > it that makes it 17x slower than that figure? You should also account for the peak instruction issue rate. MIC core can only issue one SIMD instruction per cycle (when running 4 threads; or at most half that when running 1 thread). Your laptop can probably issue two (even when running 1 thread). Also, "2.3 GHz" is probably the non-turbo rate, and you probably get ~3.5 GHz when running 1 thread. Interleaving is being used on your CPU, but not in the build for MIC. max_keys_per_crypt might be non-optimal for MIC. And once again, most importantly, single-thread performance is not what we should optimize for, especially not on MIC. Those cores are designed to work well when running at least 2 threads each, and preferably 4. > Comparing a slow hash and threads, here's MIC's md5crypt: > > Benchmarking: md5crypt, crypt(3) $1$ [MD5 512/512 MIC 16x]... (240xOMP) DONE > Raw: 864932 c/s real, 3627 c/s virtual > > And here's mine (4 cores + HT): > > $ ../run/john -test -form:md5crypt > Will run 8 OpenMP threads > Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (8xOMP) DONE > Raw: 159360 c/s real, 22102 c/s virtual Again, this is too good for "2.3 GHz". Your CPU almost certainly has turbo, even with 8 threads. > MIC should do 9127K if only clock and width was different. The real > figure is 10x slower than that. Why? You appear to be assuming MIC's 4 threads/core would let it achieve multiple SIMD instructions/cycle/core, whereas in reality they merely enable it to achieve one SIMD instruction/cycle/core. Here's what we get on super: [solar@...er src]$ ../run/john -te -form=md5crypt Will run 32 OpenMP threads Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE Raw: 539136 c/s real, 16853 c/s virtual 539136 * 60*512*1.053 / (16*128*2*3.0) = 1419276 c/s So assuming 3.0 GHz turbo clock rate on super (that's max turbo for super's CPUs when all cores are in use), we get 1419K c/s expected on our Xeon Phi 5110P. We've actually achieved 865K. That's not bad, considering that we're not using interleaving on Xeon Phi yet (we should try!), and that once we do we'll have less L1 data cache space per 32-bit SIMD lane than we do on host CPUs. It's 32 KB per core either way, but on MIC we'll be running more instances/core in the wider SIMD and with more threads/core. For NTLM, we could expect something like this per core: [solar@...er src]$ ../run/john -te -form=nt Benchmarking: NT [MD4 128/128 X2 SSE2-16]... DONE Raw: 44068K c/s real, 44513K c/s virtual [solar@...er src]$ OMP_NUM_THREADS=1 ../run/john -te -form=nt2 Warning: OpenMP is disabled; a non-OpenMP build may be faster Benchmarking: nt2, NT [MD4 128/128 AVX 4x3]... DONE Raw: 37257K c/s real, 37257K c/s virtual 44000 * 512*1.053 / (128*2*3.3) = 28080K c/s but that's when running 4 threads/core on MIC, when we have interleaving, and once we've tuned max_keys_per_crypt (together with the interleaving factor) - and ignoring the cache size per instance difference. Our current 4907K is "only" 5.7 times lower. Having played with Xeon Phi a bit before, I didn't expect the single-thread non-tuned speeds to be better than that. Clearly, there's room for improvement, and we should proceed with that, but it's more like ~6x for fast hashes and ~1.5x for md5crypt. Now, for descrypt there's theoretically more room for improvement: [solar@...er src]$ ../run/john -te -form=descrypt Will run 32 OpenMP threads Benchmarking: descrypt, traditional crypt(3) [DES 128/128 AVX-16]... (32xOMP) DONE Many salts: 73007K c/s real, 2286K c/s virtual Only one salt: 38283K c/s real, 1207K c/s virtual 73007 * 60*512*1.053 / (16*128*2*3.0) = 192191K c/s but we're only getting: [user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=descrypt Will run 240 OpenMP threads Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE Many salts: 84811K c/s real, 354016 c/s virtual Only one salt: 7710K c/s real, 60023 c/s virtual so 192191/84811 = 2.27x worse than "expected". That's disappointing, but I couldn't improve that speed further so far. I think it's a combination of lower cache size per instance, Xeon Phi's worse scalar performance, and Amdahl's law (especially for the single-salt speed). It's also useful to look at: [solar@...er run]$ ../run/john -te -form=bsdicrypt Will run 32 OpenMP threads Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 128/128 AVX-16]... (32xOMP) DONE Many salts: 2457K c/s real, 76704 c/s virtual Only one salt: 1492K c/s real, 47030 c/s virtual [user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=bsdicrypt Will run 240 OpenMP threads Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [512/512]... DONE Many salts: 2958K c/s real, 12377 c/s virtual Only one salt: 413042 c/s real, 4113 c/s virtual The single-salt performance impact of needing to run a larger number of slower threads is lower for this slower hash, as expected. Also, the multi-salt benchmark shows Xeon Phi work a little bit better than in the previous test: 84811/73007 = 1.16 2958/2457 = 1.20 This is also quite disappointing, though, since twice better performance was theoretically possible (not considering the cache size per instance difference). Yet this is the fastest bsdicrypt multi-salt cracker that we currently have for our hardware. We don't have bsdicrypt supported on GPU yet, right? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.