Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Sun, 12 Nov 2017 19:19:13 +0100
From: Solar Designer <>
Subject: max_keys_per_crypt tuning


As recently discussed on these GitHub issues:

Poor OpenMP scalability of MultiBit format

Auto-tune instead of OMP_SCALE

our current default/initial MAX_KEYS_PER_CRYPT for non-OpenMP builds,
its value for the case of running 1 thread with OpenMP builds, and
finally OMP_SCALE for running multiple threads are currently very wrong
for many of JtR jumbo formats.

While ideally we'd rework JtR OpenMP support entirely, first reworking
JtR formats interface, this isn't going to happen at least in time for
the next jumbo release (and would be too invasive a change), and besides
as I point out above some of this is wrong even without OpenMP.

Here's what I think our current approach should be:

The default/initial MAX_KEYS_PER_CRYPT used for non-OpenMP builds and
when running 1 thread with OpenMP builds should be no less than
CRK_PREFETCH, which is currently 64 in our typical builds, for fast hash
formats where having many hashes (per salt, where applicable) is a
possibility, because otherwise cracker.c's prefetching of bitmap cache
lines can't work well.  And it's sometimes important: per my recent
testing on an i7-4770K running JtR core tree against 20M LM hashes,
disabling of prefetching has a 2.5x performance impact.

Another reason to use a value significantly larger than 1 even in those
builds is to amortize function call overhead, etc.  Even before OpenMP
support and prefetching were introduced, we commonly ended up tuning a
format's MAX_KEYS_PER_CRYPT to 96 or 128 or so.  We probably need to
perform such tuning, but also ensure the resulting value is no less than
64 unless the hash type is slow, where bitmap prefetching would be
relatively insignificant anyway.

For slow hashes like scrypt, it is OK to have MAX_KEYS_PER_CRYPT of 1,
or whichever is the number of hashes being computed simultaneously by
our implementation (e.g., 2 or 3 per thread for our bcrypt code).
For those, setting MAX_KEYS_PER_CRYPT to be as low as we can given our
code may even be preferable as it improves the interactivity and lets
JtR move to further candidate passwords sooner when running against many
salts (desirable when the candidate passwords are in an optimized order).

Currently, in many formats we only support max_keys_per_crypt > 1 when
running OpenMP, with code like:

static int crypt_all(int *pcount, struct db_salt *salt)
	const int count = *pcount;
	int index = 0;
#ifdef _OPENMP
#pragma omp parallel for
	for (index = 0; index < count; index++)

(example taken from dahua_fmt_plug.c).  This is wrong, as it doesn't let
us set non-OpenMP max_keys_per_crypt to be higher than 1.  Loops like
this should either be compiled unconditionally, or logic like this may
be used:

if defined(_OPENMP) || MAX_KEYS_PER_CRYPT > 1
	for (index = 0; index < count; index++)

(example taken from cq_fmt_plug.c).  I think unconditional is preferred,
since either the format is fast enough that it should always have
max_keys_per_crypt > 1, or it's slow enough that the loop overhead does
not matter.

Then, OMP_SCALE - the additional scaling factor to compensate for OpenMP
thread desynchronization (to reduce the relative wait times for getting
the threads back in sync) - should only be applied when actually running
more than 1 thread.  Currently we usually also apply it for the 1 thread
case, which I now think is wrong.  We just got this fixed in
multibit_fmt_plug.c with commit 00655ad63366.  We need changes like this:

-       self->params.min_keys_per_crypt *= omp_t;
-       omp_t *= OMP_SCALE;
-       self->params.max_keys_per_crypt *= omp_t;
+       if (omp_t > 1) {
+               self->params.min_keys_per_crypt *= omp_t;
+               omp_t *= OMP_SCALE;
+               self->params.max_keys_per_crypt *= omp_t;
+       }

We might also want to move common code like this into a function (maybe
in formats.c) that we'd use from formats' init(), instead of keeping it
as duplicate code like we have now.

After correcting the above, we need to (re-)adjust the split between
MAX_KEYS_PER_CRYPT and OMP_SCALE.  For example, now we might have
MAX_KEYS_PER_CRYPT 1 (which also kills prefetching) and OMP_SCALE 128
for a certain format, whereas more optimal might be MAX_KEYS_PER_CRYPT
64 and OMP_SCALE 2 (it was per my testing of multibit_fmt_plug.c).  Then
we'd have max_keys_per_crypt = 64 for non-OpenMP and single thread, but
256 for 2 threads, 512 for 4 threads, etc.

Ideally, such (re-)adjustments require running benchmarks for
non-OpenMP, for 1 thread, and for multiple threads.  Ideally not only
"--test", but also actual cracking with varying hash per salt counts.
That's a lot of work, so we'd want to automate it, even if as a result
we end up merely adjusting the hard-coded values.

For now, though, things are so bad that even simply setting each fast
format's values to MAX_KEYS_PER_CRYPT 64 and OMP_SCALE 2 (along with
making the code changes proposed above) would likely be an improvement.
We could then relbench jumbo as a whole from before and after this
mass-change, separately for non-OpenMP and OpenMP builds and for varying
thread counts (1, 2, 4, and on to 32 or whatever the number of logical
CPUs on the test system is).  relbench should detect any regressions,
which we'd need to deal with manually.

Any takers?


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.