announce - [openwall-announce] Owl-current on CD; JtR DES crypt(3) and LM hash speedup

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20100704231124.GA31023@openwall.com>
Date: Mon, 5 Jul 2010 03:11:24 +0400
From: Solar Designer <solar@...nwall.com>
To: announce@...ts.openwall.com
Subject: [openwall-announce] Owl-current on CD; JtR DES crypt(3) and LM hash speedup

Hi,

As usual, this is a cumulative announcement for several things at once.
These were previously tweeted about - http://twitter.com/openwall - and
posted on the news page - http://www.openwall.com/news

For this announcement, I'll group them into two categories:

1. It is now possible to get Openwall GNU/*/Linux -current snapshots on
CD (with delivery worldwide) - 32-bit and/or 64-bit (your choice).  The
pricing starts at $9.35 (which just covers our costs), but you're
encouraged to pick a more expensive option (which supports our project):

http://www.openwall.com/Owl/order

The intent is to keep recent -current snapshots available for purchase
on CD along with releases, although that will depend on demand or lack
thereof.  Previously, only the last release was available for purchase
on CD.

2. John the Ripper's bitslice DES code is being re-worked much further,
resulting in greater ease of use on multi-core systems, as well as in
major per-core speedups at LM hashes.

This includes optional OpenMP parallelization, which allows a single
"john" process, invoked in the usual manner, to take advantage of
multiple CPU cores when auditing DES-based Unix crypt(3) hashes.
(JtR 1.7.6 release only supported this kind of parallelization for
certain other/slower hash types.)

This also includes a new vectorization- and parallelization-friendly
key setup algorithm, which makes LM hash computations more than twice
faster per-core (as tested on x86-64) and allows for parallelization of
DES-based crypt(3) hash computations even for the single-salt case
(including parallelization of most of the key setup "overhead").

The current in-development yet publicly released patches for JtR 1.7.6
achieve the following performance numbers on a single Core i7 920 2.67 GHz
CPU (quad-core capable of running 2 threads per core):

LM hashes, single process, single thread (no OpenMP), "--test" - 39M c/s
... ditto, actual "incremental" mode run (more "overhead") - 30M+ c/s
... 8 simultaneous processes, combined "--test" speeds - 173M c/s
LM hashes, single process, 8 threads (OpenMP), "--test" - 65M c/s
... ditto, actual "incremental" mode run (more "overhead") - 45M+ c/s

DES crypt(3), 1 process, 8 threads (OpenMP), "--test", multi-salt - 10.2M c/s
... ditto, actual "incremental" mode run (more "overhead") - 10.0M+ c/s
DES crypt(3), 1 process, 8 threads (OpenMP), "--test", single salt - 8.6M c/s
... ditto, actual "incremental" mode run (more "overhead") - 8.1M+ c/s

These numbers for DES crypt(3) correspond to an OpenMP parallelization
efficiency of 80% to 90% (vs. multiple separate processes running the
non-OpenMP build with separate candidate password streams) - e.g., the
same system would do 11.5M c/s combined for multi-salt with separate
processes.  This slight efficiency loss may be compensated for by the
greater ease of use (just one JtR invocation to manage instead of 8) and
by likely more optimal order in which candidate passwords are tried when
there's just one stream of those.

Finally, here are some more exciting performance numbers for a dual Xeon
X5460 3.16 GHz server (8 CPU cores total) under light unrelated load:

LM hashes, single process, single thread (no OpenMP), "--test" - 45M c/s
... 8 simultaneous processes, combined "--test" speeds - 356M c/s
LM hashes, single process, 8 threads (OpenMP), "--test" - 64M c/s

The 356M c/s figure is pretty exciting.  Previously, one would expect
this kind of performance from a GPU, but here it is achieved with two
CPUs found in a single system, and even under light unrelated load.

The 64M c/s figure for the OpenMP build is pretty good, but not exciting -
we've already seen better speed for a single Core i7.  Unrelated system
load truly kills OpenMP performance in many cases, and the efficiency of
LM hash parallelization with OpenMP is not great anyway.

Now to DES-based crypt(3) on the dual Xeon:

DES crypt(3), 1 process, 8 threads (OpenMP), "--test", multi-salt - 21M c/s
DES crypt(3), 1 process, 8 threads (OpenMP), "--test", single salt - 15.5M c/s

The OpenMP parallelization efficiency is 67% to 86% - that is, even
better speeds may be achieved with 8 simultaneous processes - such as
24M c/s for multi-salt and 23M c/s for single salt.

The patches may be found at:

http://openwall.info/wiki/john/patches

Here are some john-users postings with even more detail and substantiation
for the performance numbers given above:

http://www.openwall.com/lists/john-users/2010/07/03/1
http://www.openwall.com/lists/john-users/2010/06/30/2

As usual, feedback is welcome - on the john-users list, please.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.