john-users - 1.7.6-jumbo-11 adds MSCash2 with OpenMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110204213651.GA19428@openwall.com>
Date: Sat, 5 Feb 2011 00:36:51 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: 1.7.6-jumbo-11 adds MSCash2 with OpenMP

Hi,

I've just released JtR 1.7.6-jumbo-11:

http://www.openwall.com/john/#contrib

The changes since -jumbo-9 are:

The x86-64-specific NTLM cmp_all() bug has been fixed.  The bug was
discovered and the patch was proposed by bartavelle (thanks!)  The bug
could result in some NTLM hashes not getting cracked where they should
have been.  More info here:

http://www.openwall.com/lists/john-users/2010/12/17/7
http://www.openwall.com/lists/john-users/2010/12/17/9

I enhanced the self-tests such that the NTLM bug above would be detected
by them now.  This ended up detecting another bug: in -jumbo-11,
"md5-gen" fails self-test for the 5th test hash/password when built with
-mmx or -sse2 targets (but not when using SSE2 on x86-64).  Moreover,
after this failed test, the very next "format" being tested results in a
segfault.  I left this issue without a fix in -jumbo-11, hoping that
JimF (the author of the "generic MD5" code in the jumbo patch) will take
a look. ;-)

The patch adding support for MSCash2 (Domain Cached Credentials of
modern Windows systems) contributed by S3nf (thanks!) has been merged:

http://www.openwall.com/lists/john-users/2010/12/26/1

I made minor changes to the MSCash2 code.  First, in -jumbo-10 (which
only existed for 5 hours before being moved to historical/) I changed
MS_NUM_KEYS in mscash2_fmt.c from 64 to 1.  I think the setting of 64
was blindly inherited from mscash_fmt.c, but it made no sense for the
slow hash that MSCash2 is, and it resulted in slow benchmarking (could
be tens of seconds for MSCash2 alone).

Then, I actually made use of the code's support for handling of multiple
passwords at once to introduce optional OpenMP parallelization. :-)
In -jumbo-11, MS_NUM_KEYS is set to 1 in default builds, but it is set
to 24 when building with OpenMP support enabled (in the Makefile as it
is documented in doc/README for the official 1.7.6).  Indeed, I also
added "#pragma omp parallel" directives and made necessary adjustments
to variable declarations and the code.

Trying to actually run the code with multiple threads uncovered what
looked like a minor bug in mscash2_fmt.c: PBKDF2_DCC2().  The line:

		out[16] ^= temp[16];

was probably included in error (processing 17 bytes instead of 16).
Removing this line made OpenMP-enabled builds work.

I also improved the code to only process up to the supplied "count" of
candidate passwords, not always MS_NUM_KEYS (which was wasteful during
self-tests, with some uses by "single crack" mode, and when processing
the very last bunch of candidate passwords in any mode).  Finally, I set
MIN_KEYS_PER_CRYPT to 1 in all cases, although it'd be better to adjust
it from init() to match the actual number of threads, like BF_fmt.c
does.  This is something to improve in a later revision (patches for
this are welcome; please test those with "single crack" mode).

Here are a couple of benchmark results from a Core i7 920 2.67 GHz
server under some load.  Without OpenMP:

Benchmarking: M$ Cache Hash 2 [Generic 1x]... DONE
Raw:    94.0 c/s real, 94.0 c/s virtual

With OpenMP (8 threads; the CPU is quad-core with SMT):

Benchmarking: M$ Cache Hash 2 [Generic 1x]... DONE
Raw:    362 c/s real, 47.5 c/s virtual

Both builds were made with gcc 4.5.0.  Indeed, this is very far from
optimal.  Further speedup is possible by re-arranging data layout such
that each thread's temporary and result array elements are further apart
from other thread's (not on the same memory page).  This may probably be
achieved by replacing the many arrays with an array of structs.  Also,
the code is currently not optimized and makes no use of SSE2, so the
single process performance can probably be improved a lot (and then
per-thread performance will improve as well).

While at it, I similarly parallelized the original mscash_fmt.c, because
this was just as easy to do.  Its MS_NUM_KEYS is now set to 96 (was 64),
and MIN_KEYS_PER_CRYPT is set to 1 (the same further enhancement that I
described above is desirable here as well).  Without OpenMP, I get:

Benchmarking: M$ Cache Hash [Generic 1x]... DONE
Many salts:     15112K c/s real, 15112K c/s virtual
Only one salt:  6043K c/s real, 6104K c/s virtual

With OpenMP-enabled build and "OMP_NUM_THREADS=3 GOMP_SPINCOUNT=10000",
this improves to:

Benchmarking: M$ Cache Hash [Generic 1x]... DONE
Many salts:     30755K c/s real, 10389K c/s virtual
Only one salt:  11575K c/s real, 3858K c/s virtual

Going beyond 3 threads provides almost no further improvement.  I blame
the thread-unfriendly data layout for this (see above), as well as
frequent switches between sequential and parallel execution resulting
from this hash type being so fast (even for 96 instances of the hash).

Maybe someone else in here will play with this mscash* stuff further,
improving the data layout and making MIN_KEYS_PER_CRYPT (or rather the
corresponding struct field) depend on the number of threads.

As usual, feedback is welcome.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.