john-dev - Re: [GSoC] JtR SIMD support enhancements

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150427023857.GA27427@openwall.com>
Date: Mon, 27 Apr 2015 05:38:58 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] JtR SIMD support enhancements

On Sun, Apr 26, 2015 at 07:30:24PM +0200, magnum wrote:
> On Thu, Apr 23, 2015 at 11:35:44PM +0800, Lei Zhang wrote:
> >Benchmarking: nt2, NT [MD4 512/512 MIC 16x]... DONE
> >Raw:	4907K c/s real, 4907K c/s virtual
> 
> So this is one 1 GHz core running AVX-512. Here's my 2.3 GHz laptop 
> running AVX:
> 
> Benchmarking: nt2, NT [MD4 128/128 AVX 4x3]... DONE
> Raw:	43748K c/s real, 43748K c/s virtual
> 
> Accounting only for clock and width, the MIC should do 83519K. What is 
> it that makes it 17x slower than that figure?

You should also account for the peak instruction issue rate.  MIC core
can only issue one SIMD instruction per cycle (when running 4 threads;
or at most half that when running 1 thread).  Your laptop can probably
issue two (even when running 1 thread).  Also, "2.3 GHz" is probably the
non-turbo rate, and you probably get ~3.5 GHz when running 1 thread.

Interleaving is being used on your CPU, but not in the build for MIC.
max_keys_per_crypt might be non-optimal for MIC.

And once again, most importantly, single-thread performance is not what
we should optimize for, especially not on MIC.  Those cores are designed
to work well when running at least 2 threads each, and preferably 4.

> Comparing a slow hash and threads, here's MIC's md5crypt:
> 
> Benchmarking: md5crypt, crypt(3) $1$ [MD5 512/512 MIC 16x]... (240xOMP) DONE
> Raw:	864932 c/s real, 3627 c/s virtual
> 
> And here's mine (4 cores + HT):
> 
> $ ../run/john -test -form:md5crypt
> Will run 8 OpenMP threads
> Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (8xOMP) DONE
> Raw:	159360 c/s real, 22102 c/s virtual

Again, this is too good for "2.3 GHz".  Your CPU almost certainly has
turbo, even with 8 threads.

> MIC should do 9127K if only clock and width was different. The real 
> figure is 10x slower than that. Why?

You appear to be assuming MIC's 4 threads/core would let it achieve
multiple SIMD instructions/cycle/core, whereas in reality they merely
enable it to achieve one SIMD instruction/cycle/core.

Here's what we get on super:

[solar@...er src]$ ../run/john -te -form=md5crypt
Will run 32 OpenMP threads
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw:    539136 c/s real, 16853 c/s virtual

539136 * 60*512*1.053 / (16*128*2*3.0) = 1419276 c/s

So assuming 3.0 GHz turbo clock rate on super (that's max turbo for
super's CPUs when all cores are in use), we get 1419K c/s expected on
our Xeon Phi 5110P.  We've actually achieved 865K.  That's not bad,
considering that we're not using interleaving on Xeon Phi yet (we should
try!), and that once we do we'll have less L1 data cache space per
32-bit SIMD lane than we do on host CPUs.  It's 32 KB per core either
way, but on MIC we'll be running more instances/core in the wider SIMD
and with more threads/core.

For NTLM, we could expect something like this per core:

[solar@...er src]$ ../run/john -te -form=nt
Benchmarking: NT [MD4 128/128 X2 SSE2-16]... DONE
Raw:    44068K c/s real, 44513K c/s virtual

[solar@...er src]$ OMP_NUM_THREADS=1 ../run/john -te -form=nt2
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: nt2, NT [MD4 128/128 AVX 4x3]... DONE
Raw:    37257K c/s real, 37257K c/s virtual

44000 * 512*1.053 / (128*2*3.3) = 28080K c/s

but that's when running 4 threads/core on MIC, when we have
interleaving, and once we've tuned max_keys_per_crypt (together with the
interleaving factor) - and ignoring the cache size per instance
difference.  Our current 4907K is "only" 5.7 times lower.  Having played
with Xeon Phi a bit before, I didn't expect the single-thread non-tuned
speeds to be better than that.

Clearly, there's room for improvement, and we should proceed with that,
but it's more like ~6x for fast hashes and ~1.5x for md5crypt.

Now, for descrypt there's theoretically more room for improvement:

[solar@...er src]$ ../run/john -te -form=descrypt
Will run 32 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [DES 128/128 AVX-16]... (32xOMP) DONE
Many salts:     73007K c/s real, 2286K c/s virtual
Only one salt:  38283K c/s real, 1207K c/s virtual

73007 * 60*512*1.053 / (16*128*2*3.0) = 192191K c/s

but we're only getting:

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=descrypt
Will run 240 OpenMP threads
Benchmarking: descrypt, traditional crypt(3) [512/512]... DONE
Many salts:     84811K c/s real, 354016 c/s virtual
Only one salt:  7710K c/s real, 60023 c/s virtual

so 192191/84811 = 2.27x worse than "expected".  That's disappointing,
but I couldn't improve that speed further so far.  I think it's a
combination of lower cache size per instance, Xeon Phi's worse scalar
performance, and Amdahl's law (especially for the single-salt speed).

It's also useful to look at:

[solar@...er run]$ ../run/john -te -form=bsdicrypt
Will run 32 OpenMP threads
Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [DES 128/128 AVX-16]... (32xOMP) DONE
Many salts:     2457K c/s real, 76704 c/s virtual
Only one salt:  1492K c/s real, 47030 c/s virtual

[user@...er-mic0 user]$ LD_LIBRARY_PATH=. ./john -te=1 -form=bsdicrypt
Will run 240 OpenMP threads
Benchmarking: bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations) [512/512]... DONE
Many salts:     2958K c/s real, 12377 c/s virtual
Only one salt:  413042 c/s real, 4113 c/s virtual

The single-salt performance impact of needing to run a larger number of
slower threads is lower for this slower hash, as expected.  Also, the
multi-salt benchmark shows Xeon Phi work a little bit better than in the
previous test:

84811/73007 = 1.16
2958/2457 = 1.20

This is also quite disappointing, though, since twice better performance
was theoretically possible (not considering the cache size per instance
difference).  Yet this is the fastest bsdicrypt multi-salt cracker that
we currently have for our hardware.  We don't have bsdicrypt supported
on GPU yet, right?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.