john-users - Re: OpenMP vs lots of cores

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20100630213807.GA17386@openwall.com>
Date: Thu, 1 Jul 2010 01:38:07 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: OpenMP vs lots of cores

On Wed, Jun 30, 2010 at 09:55:42AM +0200, Magnum, P.I. wrote:
> I haven't had the time to try the OpenMP stuff but I sure will.

Please do.

> How far will it scale in the current versions? I mean, what if you had
> 96 cores on a single host? It won't use all of them, right?

It should.  I am sometimes hard-coding the number 96 (or 192) because it
is a multiple of 32 and 3 - so it'll work reasonably well for 2, 3, 4,
6, 8, 12, 16, 24, or 32 logical CPUs.  This was in fact tested for 32
logical CPUs and 32 threads here:

http://www.openwall.com/lists/john-users/2010/05/16/2

It scaled perfectly (considering that the CPU was quad-core, but capable
of running 8 threads per core simultaneously).  It should similarly
scale to 32 distinct CPU cores.

Maybe I should be using 192 to accommodate 8-core versions of UltraSPARC T2
(64 logical CPUs per chip).

Of the hashes currently parallelized with OpenMP (including with recent
patches), DES-based crypt(3) does not have its key setup parallelized.
This means that it won't scale well for the single-salt case - in fact,
for that special case there's only significant benefit when going to 2
or 3 threads.  But attacking a single salt is not the case I primarily
care about.  As you have seen, for the multi-salt case 8 threads work
well, and more should too.

In general, only some parts get parallelized, whereas some others remain
sequential.  When the hash type, input file, and/or invocation options
are such that much time is spent in the sequential parts, the partial
parallelization won't provide its "full" benefit.  Luckily, for slow
hash types - such as bcrypt - only a negligible portion of time is spent
on generating new candidate passwords to try, etc. - and even going to
96 threads for the parallel parts does not change this picture.  So you
may expect to get almost a 96x speedup on a 96-core system - vs. a
single-thread run of the same build.  That's under no unrelated load.
The speedup vs. a non-OpenMP build will usually be less, because thread
safety of the code usually has its cost - on the order of 10%.

For not-so-slow hash types, but with multiple salts, the same effect is
achieved due to the salts - e.g., in my test runs (that I posted the
output of in here) there were around 1450 different salts, which means
that only one candidate password needed to be generated (and "set up"
for use by the parallelized hashing code) per 1450 hashes computed.
So the candidate passwords generation and key setup cost was negligible,
even if not parallelized and even with a lot of threads handling the
parallelized hashing part.

For fast and saltless hashes, things are a lot worse.  Key setup (if
applicable) will need to be parallelized as well (such as for LM hashes),
and even then scalability will be very limited as long as candidate
passwords are only generated by a single thread.  My expectation is
that with some effort this could be reasonable on 4- or 8-core, but not
far beyond that.  To get further, multiple candidate password streams
will have to be used (similar to what the MPI patch does).  This is one
of the reasons why I am focusing on slow and/or salted hashes now, for
which good scalability with the OpenMP approach may be achieved.

> Or would it, given enough what? Hashes? Salts?

It would use all cores it has "thread slots" for (96 or 192 in the
current code).  It needs enough candidate passwords for that, but you'll
almost always have those (or you won't care about speeding things up).

It doesn't strictly need many salts - it will use all cores anyway - but
the "overhead" of non-parallelized parts is lower with more salts.

> I suppose even if it can use all of them, 
> the overhead will at some point make it ineffective,

In some cases, yes.  I've provided some examples above.  There are also
cases (already supported) that should scale to 96+ cores well.

> perhaps even just slow it down.

It is possible to come up with such examples.  For example, the current
LM hash code actually often becomes slower with more threads running -
it was not yet parallelized properly, though (and I am unsure if I will
do that or not, especially speaking of OpenMP in particular).  For LM
hashes, key setup is the costly operation, whereas the current code only
parallelizes the actual DES encryptions, which are relatively cheap.
So that's more of an implementation shortcoming (I just did not care
about those hashes yet - I parallelized the bitslice DES code with
crypt(3) hashes in mind, not LM).  I imagine that examples not relying
on shortcomings of a specific implementation could be identified as well.

> Obviously though, MPI and MP could be used together. Say we have 96 
> cores and fire off 12 MPI nodes using 8 MP threads each. Naturally some 
> testing would be needed to find the best mix.

I'd expect it to make sense to use different approaches at different
levels - e.g., use the MPI patch (or whatever will hopefully replace it)
to distribute the workload across multiple hosts, but JtR's OpenMP
support (or whatever will replace it) within individual hosts.  The
rationale for using a mix like this would be to manage fewer nodes from
the console.  It'd be painful to explicitly treat a single UltraSPARC T2
CPU as 32 or 64 nodes.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.