john-users - Re: Hyperthreading / fork versus mpi / instruction sets?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20170911163621.GA4271@openwall.com>
Date: Mon, 11 Sep 2017 18:36:21 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Hyperthreading / fork versus mpi / instruction sets?

On Sun, Sep 10, 2017 at 09:40:25PM +0200, spam@...lab.nl wrote:
> I'm using john on one system with an enormous amount of CPUs. The CPUs
> support hyperthreading. I'm trying to figure out what the fastest
> combination of settings is.

Typically it is: hyperthreading enabled and "--fork".

> Hyperthreading:
> As far as I understand this is beneficial only if the cracking code is not
> 100% optimized. So in theory: not useful, each HT thread cannot do anything
> useful since the 'real' core is fully saturated. Practice: run john on HT
> cores as well to optimize CPU utilization. Even if the cracking code is 100%
> optimized it won't harm me (no disadvantages). Two questions: (1) is this
> correct?

This is a start towards an understanding, but it is over-simplification.

Other factors at play are register pressure (for a given algorithm on a
given instruction set architecture) and maximum instruction decode and
issue rate from one hardware thread (whether it's the same as what the
core is capable of from multiple threads or not).

Speaking of register pressure, sometimes it isn't possible to optimally
utilize a core from just one thread because the number of registers
needed for frequent storage and retrieval of intermediate results for a
would-be-optimal number of interleaved instances of the algorithm
exceeds what's available in ISA.  Then we have to choose between
interleaving fewer instances or spilling registers to memory (actually
L1 cache), or a mix of both.  (Register renaming - that is, transparent
use of a larger number of hardware registers than are exposed via the
ISA - helps a bit, but it can't help avoid spills to memory/cache in the
ISA level code.)

Why interleave multiple instances rather than implement just one?
That's to keep the core's execution units busy doing useful work at all
times, including during previously issued instructions' latencies.

With 2 or more hardware threads/core, we effectively get more ISA level
registers for the interleaving: each hardware thread has its own set of
ISA level registers.  (Yes, they share the same pool of hardware
registers, but it's large enough.  It's ISA registers that are the
scarce resources.)  We can then optimally use fewer interleaved
instances of the algorithm per thread, and thus not run into register
pressure and not incur spills.

Sometimes optimal interleaving factor (in the ISA level code) varies
depending on whether you intend to run e.g. 1 or 2 threads/core.  For
example, our bcrypt code with 2x interleaving is more optimal for Intel
CPUs with HT enabled, and our bcrypt code with 3x interleaving is more
optimal for Intel CPUs with HT disabled.  Since both kinds of
CPUs/settings exist, it is tough for us to provide universally optimal
defaults/build (would need runtime detection and code dispatching), and
currently we choose the 2x interleaving when building for modern x86
CPUs assuming that it's typical to have HT enabled and considering that
it's slightly higher performance in absolute terms.  Unfortunately, the
performance hit this decision of ours has on the HT-disabled CPUs is
pretty bad.  (Frank recently committed a fix so that advanced users can
use "make generic" to have bcrypt's interleaving factor tuned for their
CPU at build time.  But that's a special case, and it hurts performance
at other JtR formats.)  I use this example here to illustrate that the
problem is complicated.

Speaking of maximum instruction decode and issue rates, for
general-purpose x86 CPUs these tend to be available even from just one
hardware thread.  (Indeed, Intel sells crippled versions of their CPUs
that are not HT-capable, yet with optimal code those can usually be
almost as fast as their HT-capable elder siblings.)  However, this isn't
always the case.  For Intel's Xeon Phi (at least first generation; I am
not familiar enough with second generation yet), you actually have to
run at least 2 threads/core (and up to 4) to even have a chance of using
the cores fully.  Similarly, IBM's POWER chips (and I think Oracle's
recent SPARC chips) assume use of multiple hardware threads/core in
order to have a chance of full utilization:

http://www.openwall.com/lists/john-dev/2015/10/19/1

For HPC workloads, it is fairly common to disable HT.  That's largely
wrong for those as well, but it's also different.  The causes include
poor scalability of the algorithms or their implementations (so that
having e.g. twice more logical CPUs which combined are just slightly
faster becomes a bad idea) and the algorithms being memory-bound (thus
heavily competing for shared caches and memory, and not benefiting as
much from greater processing power without a corresponding increase in
cache sizes and memory bandwidth).  This may be right given fixed code
with no intent to optimize, but it may be wrong to give up on a chance
to optimize and make beneficial use of multiple hardware threads, even
for those workloads.  For memory-bound workloads, multiple hardware
threads per core help avoid keeping the core idle while one of the
threads is waiting for memory.  Care may need to be taken to minimize
competition for caches - e.g., on one rather crazy occasion I ended up
having the application parse /proc/cpuinfo, assign CPU affinities, and
use per hardware core mutexes to avoid having two hardware threads of
the same core getting into the same code section at once, where one
thread's exclusive use of that core's portion of L3 cache was more
beneficial for overall performance.  I understand that few developers
would discover opportunities like this and go for this sort of effort,
and accept the need to reconsider such decisions after each hardware
upgrade, and thus many HPC clusters run with HT disabled, thereby
wasting some of their performance potential.  Luckily, we don't run into
this sort of dilemmas for most functionality in JtR.

> And (2) any advice about the number of extra HT processes to
> assign? Use all? Use just say one or two to compensate for a small fraction
> of non-perfect code?

Use all.

"Use just say one or two" can't possibly fully compensate for
"non-perfect code" anyway (even if this were the only problem HT was
meant to solve, which isn't the case), because the code will run on each
core.  So when you run extra threads on some cores, but not on all at
the same time, you have this compensation on some, but not on all.

> Fork vs. MPI:
> I've mentioned that there is a number of hash formats that support MPI and
> that john runs those hash types on MPI by default. Furthermore I've seen
> that forked parallel processing (--fork=n) is possible for all hash types.
> AFAIK, MPI is typically used in network connected multi-system environments.
> Forking is done on one machine. My assumption is that forking is more
> efficient than MPI because of less overhead (= faster). However MPI might
> allow more granular control, rescheduling during the cracking process to get
> maximum efficiency, but *only* useful if MPI latency is extremely low
> compared the cracking speed. My questions questions: (1) is this correct?

No.  As you've realized, you confused MPI and OpenMP.

> Furthermore: (2) what's the best approach for fast hashes (e.g. raw-md5) and

"--fork", or use "--format=raw-md5-opencl --mask=..." with a GPU.

> (3) what's the best approach for slow hashes (e.g. bcrypt)?

Also "--fork", but the relative slowdown of OpenMP is smaller for slow
hashes than it is for fast hashes.

That's in terms of c/s rate.  There are some other factors at play: how
optimal the order of candidate passwords tested is, whether all jobs
complete nearly simultaneously or not, ease of use.  OpenMP may be
better at these (albeit not always as in some cases it increases the
amount of buffering), but usually not to a sufficient extent to prefer
it over "--fork".

The rescheduling you mentioned is typically disabled by default anyway,
and is limited by the program's architecture (frequent sync points).
With libgomp (what gcc uses), you can enable it with an environment
variable, but it might make sense only when attacking slow hashes (or
else the rescheduling is relatively too expensive on its own) on a
system that has other load on it - which is a bad idea anyway.  OpenMP
is particularly sensitive to other load.  If you have any unrelated
load, then either avoid OpenMP or lower its thread count (leaving some
logical CPUs idle some of the time).  So this is actually another reason
to prefer "--fork", where getting out of sync doesn't result in any loss
in c/s rate and doesn't need any rescheduling, but instead may result in
the child processes terminating at different times.  Hopefully, such
distractions even out over a long period of time, and don't result in
too different termination times.

Unfortunately, the current JtR architecture is far from optimal at its
use of thread-level parallelism.  Some of the above tradeoffs could be
reduced or avoided with a different architecture.

> Instruction sets:
> I've mentioned that john contains *a lot* of instruction set specific
> optimized code (e.g. SSE4.x/AVXx). Older multi CPU Xeon E5 and E7 systems
> are quite cheap nowadays, and looking at absolute performance (in general)
> they're still extremely fast (still in the list of fastest processors).
> However they lack e.g. AVX2 support. Right now it's very difficult for me to
> figure out what's the best choice, without buying massively expensive new
> CPUs. Question: is there a benchmark available of something like the latest
> fancy everything supporting CPU versus different instruction sets builds on
> this system, so one can figure out what the advantage of buying new CPUs is,
> from a john cracking perspective?

We have some benchmarks on the wiki, but we haven't kept them up to date:

http://openwall.info/wiki/john/benchmarks

AVX2 almost doubles the speeds (relative to AVX) for formats where we
have SIMD code.  It makes no difference for formats where we don't
(e.g., it won't help for bcrypt).

AVX-512 is a further speedup, but you're unlikely to find those CPUs
cheap yet.

> Any other considerations or wise advice, especially concerning maximizing
> CPU cracking is also more than welcome! Thank you.

For CPUs, consider getting engineering samples of pairs of Xeon E5 v3 or
newer off eBay.  Also consider GPUs.  With CPUs, you'll get more JtR
formats, and bcrypt will perform better (but lose to FPGAs).  With GPUs,
you'll get better speeds per dollar and per Watt for most of the formats
where we have OpenCL code (and you'll also use Hashcat), but for fast
hashes you'll get decent speeds only as long as you use on-device
candidate password generation (for JtR this means at least partial use
of "--mask").  Avoid first-generation Xeon Phi even if cheap.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.