announce - [openwall-announce] John the Ripper 1.7.9

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20111123155924.GA4554@openwall.com>
Date: Wed, 23 Nov 2011 19:59:24 +0400
From: Solar Designer <solar@...nwall.com>
To: announce@...ts.openwall.com, john-users@...ts.openwall.com
Subject: [openwall-announce] John the Ripper 1.7.9

Hi,

I've released John the Ripper 1.7.9 today.  Please download it from the
usual location:

http://www.openwall.com/john/

(A -jumbo based on 1.7.9 will be available a bit later.)

This release completes the DES speedup work sponsored by Rapid7:

http://www.openwall.com/lists/announce/2011/06/22/1

Most importantly, functionality of the -omp-des* patches has been
reimplemented in the main source code tree, improving upon the best
properties of the -omp-des-4 and -omp-des-7 patches at once.  Thus,
there are no longer any -omp-des* patches for 1.7.9.

I'd like to thank Nicholas J. Kain for his help in figuring out the
cause of a performance regression with the -omp-des-7 patch, which has
helped me avoid this issue in the reimplementation.  I would also like
to thank magnum and Anatoly Pugachev for their help in testing of
1.7.8.x development versions.  The new speeds on Core i7-2600K 3.4 GHz
(actually 3.5 GHz due to Turbo Boost) are:

1 thread:
Benchmarking: Traditional DES [128/128 BS AVX-16]... DONE
Many salts:     5802K c/s real, 5861K c/s virtual
Only one salt:  5491K c/s real, 5546K c/s virtual

8 threads (on 4 physical cores):
Benchmarking: Traditional DES [128/128 BS AVX-16]... DONE
Many salts:     22773K c/s real, 2843K c/s virtual
Only one salt:  18284K c/s real, 2291K c/s virtual

1 thread:
Benchmarking: LM DES [128/128 BS AVX-16]... DONE
Raw:    71238K c/s real, 71238K c/s virtual

4 threads:
Benchmarking: LM DES [128/128 BS AVX-16]... DONE
Raw:    108199K c/s real, 27117K c/s virtual

DES-based crypt(3) scales pretty well, whereas LM is too fast for that -
but we get decent speeds anyway.  I'll include more benchmarks below
(including for a 64-way machine).

There are many other enhancements in 1.7.9 as well.  Here's a summary
from doc/CHANGES with additional comments for this announcement (I put
those in braces):

* Added optional parallelization of the MD5-based crypt(3) code with OpenMP.

(Yes, this is similar to the change introduced in 1.7.8-jumbo-8 by magnum,
but it's also different and both changes will co-exist in a -jumbo
rebased on 1.7.9.  1.7.8-jumbo-8's MD5-crypt code provides better speed
on typical x86/SSE2 machines, whereas 1.7.9's is more portable.)

* Added optional parallelization of the bitslice DES code with OpenMP.

(This is what I started this message with.)

* Replaced the bitslice DES key setup algorithm with a faster one, which
significantly improves performance at LM hashes, as well as at DES-based
crypt(3) hashes when there's just one salt (or very few salts).

(This is the 1.7.8-fast-des-key-setup-3 patch reimplemented in a
portable fashion, as well as optimized for specific architectures,
including with assembly code for x86-64/SSE2, x86/SSE2, and x86/MMX.
The patch is no longer needed.)

* Optimized the DES S-box x86-64 (16-register SSE2) assembly code.

(This achieves about a 3% speedup at bitslice DES on Core 2'ish CPUs.)

* Added support for 10-character DES-based tripcodes (not optimized yet).

(This was originally a proof of concept patch I posted in response to a
message on john-users, then it made its way into -jumbo, and now into
the main tree.  Optimizing it is a next step.)

* Added support for the "$2y$" prefix of bcrypt hashes.

(These are treated the same as "$2a$" in JtR.)

* Added two more hash table sizes (16M and 128M entries) for faster
processing of very large numbers of hashes per salt (over 1M).

(John may kind of waste memory when you load over a million of hashes
now - it may even trade an extra gigabyte for a very slight speedup.
The rationale here is that computers do have gigabytes of RAM these days
and we'd better put it to use.  If you need to be loading millions of
hashes, yet don't want to let John trade RAM for speed like this, use
"--save-memory=2".)

* Added two pre-defined external mode variables: "abort" and "status",
which let an external mode request the current cracking session to be
aborted or the status line to be displayed, respectively.

(There are usage examples in the default john.conf included in 1.7.9.)

* Made some minor optimizations to external mode function calls and
virtual machine implementation.

(Just slightly faster external mode processing.)

* The "--make-charset" option now uses floating-point rather than 64-bit
integer operations, which allows for larger CHARSET_* settings in params.h.

(This deals with the common request where people want incremental mode
to use a larger character set and/or password length.  This is now
easier to do, being able to adjust the CHARSET_* settings almost
arbitrarily - e.g., the full 8-bit character set and lengths up to 16
(and even more) may be enabled at once.  A rebuild of John and
regeneration of .chr files are still needed after such changes, though.)

* Added runtime detection of Intel AVX and AMD XOP instruction set
extensions, with optional fallback to an alternate program binary.

(Previously, when an -avx or -xop build was run on a CPU not supporting
these instruction set extensions or under an operating system not
saving/restoring the registers on context switches, the program would
crash.  Now it prints a nice "Sorry ..." message, or it can even
transparently invoke a fallback binary.  The latter functionality is
made use of in john.spec for the RPM package of John in Owl:
http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/john/ )

* In OpenMP-enabled builds, added support for fallback to a non-OpenMP
build when the requested thread count is 1.

(OpenMP-enabled builds are often suboptimal when running just one
thread, which they may sometimes have to e.g. because the system
actually has only one logical CPU.  Now a binary package of John, such
as Owl's, is able to make such builds transparently invoke a non-OpenMP
build for slightly better performance.  This is in fact currently made
use of in john.spec on Owl available at the URL above.)

* Added relbench, a Perl script to compare two "john --test" benchmark
runs, such as for different machines, "make" targets, C compilers,
optimization options, or/and versions of John the Ripper.

(This was introduced in 1.7.8-jumbo-8 and announced with a lot of detail
previously:
http://www.openwall.com/lists/announce/2011/11/09/1
1.7.9 includes a slightly newer revision of the script (correcting an
issue reported by JimF) and it has the script documented in doc/OPTIONS,
as well as in a lengthy comment in the script itself.)

* Additional public lists of "top N passwords" have been merged into the
bundled common passwords list, and some insufficiently common passwords
were removed from the list.

(Most importantly, RockYou top 1000 and Gawker top 250 lists were used
to make John's password.lst hopefully more suitable for website
passwords while also keeping it suitable for operating system passwords.
Common passwords that rank high on multiple lists are listed closer to
the beginning of password.lst, whereas some passwords that were seen on
operating system accounts but turned out to be extremely uncommon on
websites are now moved to the end of the list.)

* Many minor enhancements and a few bug fixes were made.

(Obviously, I can't and shouldn't document every individual change here.)

Now some more benchmarks.  Core i7-2600K, stock clock rate and Turbo
Boost settings (so should be 3.5 GHz here), 8 threads:

Benchmarking: Traditional DES [128/128 BS AVX-16]... DONE
Many salts:     22773K c/s real, 2843K c/s virtual
Only one salt:  18284K c/s real, 2291K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS AVX-16]... DONE
Many salts:     741376 c/s real, 93020 c/s virtual
Only one salt:  626566 c/s real, 79104 c/s virtual

Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    66914 c/s real, 8343 c/s virtual

Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    4800 c/s real, 606 c/s virtual

Benchmarking: LM DES [128/128 BS AVX-16]... DONE
Raw:    88834K c/s real, 11146K c/s virtual

(4 threads was faster for LM here.)

Dual Xeon E5420 (8 cores total, running 8 threads), 2.5 GHz:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     20334K c/s real, 2546K c/s virtual
Only one salt:  15499K c/s real, 1936K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     654869 c/s real, 82001 c/s virtual
Only one salt:  558284 c/s real, 69785 c/s virtual

Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    85844 c/s real, 10727 c/s virtual

Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    5135 c/s real, 642 c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    54027K c/s real, 6753K c/s virtual

For comparison, a non-OpenMP build on the same machine (using one core):

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     2787K c/s real, 2787K c/s virtual
Only one salt:  2676K c/s real, 2676K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     89472 c/s real, 88586 c/s virtual
Only one salt:  87168 c/s real, 86304 c/s virtual

Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    10768 c/s real, 10768 c/s virtual

Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    658 c/s real, 658 c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    38236K c/s real, 38619K c/s virtual

Comparing these with relbench:

Number of benchmarks:           7
Minimum:                        1.41299 real, 0.17486 virtual
Maximum:                        7.97214 real, 0.99619 virtual
Median:                         7.29602 real, 0.91353 virtual
Median absolute deviation:      0.67612 real, 0.08267 virtual
Geometric mean:                 5.60660 real, 0.70207 virtual
Geometric standard deviation:   1.77219 real, 1.78048 virtual

Excluding LM and single salt benchmarks:

Number of benchmarks:           4
Minimum:                        7.29602 real, 0.91353 virtual
Maximum:                        7.97214 real, 0.99619 virtual
Median:                         7.55772 real, 0.95035 virtual
Median absolute deviation:      0.25385 real, 0.03054 virtual
Geometric mean:                 7.59208 real, 0.95215 virtual
Geometric standard deviation:   1.03971 real, 1.03654 virtual

So for password security auditing (thus running on many salts at once)
against these four hash types (the crypt(3) varieties), we get a median
and mean speedup of over 7.5x when using John's OpenMP parallelization
on an 8-core machine without SMT.  (It is important to note that the
machine was under no other load, though.  Unfortunately, OpenMP tends to
be very sensitive to other load.)

SPARC Enterprise M8000 server, 8 SPARC64-VII CPUs at 2880 MHz, Sun
Studio 12.2 compiler.  These are quad-core CPUs with 2 threads per core,
so 32 cores and 64 threads total (actually running 64 threads here):

Benchmarking: Traditional DES [64/64 BS]... DONE
Many salts:     25664K c/s real, 756852 c/s virtual
Only one salt:  11066K c/s real, 728273 c/s virtual

Benchmarking: BSDI DES (x725) [64/64 BS]... DONE
Many salts:     1118K c/s real, 24811 c/s virtual
Only one salt:  694930 c/s real, 24535 c/s virtual

Benchmarking: FreeBSD MD5 [32/64 X2]... DONE
Raw:    156659 c/s real, 3075 c/s virtual

Benchmarking: OpenBSD Blowfish (x32) [32/64]... DONE
Raw:    9657 c/s real, 242 c/s virtual

Benchmarking: LM DES [64/64 BS]... DONE
Raw:    16246K c/s real, 5860K c/s virtual

Some of these speeds are impressive, yet they're comparable to a much
smaller x86 machine (between 4 and 16 cores on different ones of these
tests).  Clearly, the lack of wider than 64-bit vectors in SPARC hurts
bitslice DES speeds, big-endianness is unfriendly to MD5 (and vice
versa), and LM does not scale at all (maybe a result of higher latency
interconnect between the CPUs here, although I am just guessing).
The Blowfish speed is pretty good, though - a dual Xeon X5650 machine
(12 cores, 24 threads total) achieves a similar speed (with 1.7.8 here,
but 1.7.9 should be the same at this):

Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw:    10800 c/s real, 449 c/s virtual

So 32 SPARC cores are similar to 12 x86 cores on this one.

The MD5 speed on the SPARC is roughly twice that of an 8-core x86
machine benchmarked above, so it could correspond to a 16-core, but if
we recall that -jumbo has faster code, which would not run on a SPARC
(no equivalent to 128-bit SSE2 vectors there), it's not so great.  To be
fair, a lesser speedup could be obtained with 64-bit VIS vectors (if
anyone bothers implementing that), though.  And, of course, applying
OpenMP to individual relatively low-level loops only is not a very
efficient way to parallelize John; I opted for it for now in part
because it allowed to preserve the usual end-user behavior of John -
almost like when it's running on a single CPU.  Unlike the x86 systems
benchmarked above, this same SPARC server would likely achieve a much
better combined performance with many individual non-OpenMP John
processes.  The real to virtual time ratios are still significantly
below 64, which indicates that there's idle CPU time left.

Yet I think it is good that the main tree's code is portable, allowing
to put such beasts to use if they would otherwise be idle. ;-)  Also,
this serves well to get John's code changes tested and to improve its
overall quality for the more common systems.  Some bugs were in fact
discovered and fixed prior to the 1.7.9 release due to such testing.
It also makes John reusable as an OpenMP benchmark.

As usual, any feedback is very welcome.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.