john-users - John the Ripper 1.7.9-jumbo-6

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120629154921.GA31976@openwall.com>
Date: Fri, 29 Jun 2012 19:49:21 +0400
From: Solar Designer <solar@...nwall.com>
To: announce@...ts.openwall.com, john-users@...ts.openwall.com
Subject: John the Ripper 1.7.9-jumbo-6

Hi,

We've released John the Ripper 1.7.9-jumbo-6 earlier today.  This is a
"community-enhanced" version, which includes many contributions from JtR
community members - in fact, that's what it primarily consists of.  It's
been half a year since 1.7.9-jumbo-5, which is a lot of time, and a lot
has been added to jumbo since then.  Even though it's just a one digit
change in the version number, this is in fact the biggest single jumbo
update we've made so far.  It appears that between -5 and -6 the source
code grew by over 1 MB, or by over 40,000 lines of code (and that's not
including lines that were changed as opposed to added).  The biggest new
thing is integrated GPU support, both CUDA and OpenCL - although for a
subset of the hash and non-hash types only, not for all that are
supported on CPU.  (Also, it is efficient only for so-called "slow"
hashes now, and for the "non-hashes" that we chose to support on GPU.
For "fast" hashes, it is just a development milestone, albeit a
desirable one as well.)  The other biggest new thing is the addition of
support for many more "non-hashes" and hashes (see below).

You may download John the Ripper 1.7.9-jumbo-6 at the usual place:

http://www.openwall.com/john/

With so many changes, even pushing this release out was difficult.
Despite of the statement that "jumbo is buggy by definition", we did try
to eliminate as many bugs as we reasonably could - but after a week of
mad testing and bug-fixing, I chose to release the tree as-is, only
documenting the remaining known bugs (below and in doc/BUGS).  Still, we
ended up posting over 1200 messages to john-dev in June - even though in
prior months we did not even hit 500.  Indeed, we did run plenty of
tests and fix plenty of bugs, which you won't see in this release.

I've included a lengthy description of some of the changes below, and
below that I'll add some benchmark results that I find curious (such as
for bcrypt on CPU vs. GPU).

Direct code contributors to 1.7.9-jumbo-6 (since 1.7.9-jumbo-5), by
commit count:

magnum
Dhiru Kholia
Frank Dittrich
JimF (Jim Fougeron)
myrice (Dongdong Li)
Claudio Andre
Lukas Odzioba
Solar Designer
Sayantan Datta
Samuele Giovanni Tonon
Tavis Ormandy
bartavelle (Simon Marechal)
Sergey V
bizonix
Robert Veznaver
Andras

New non-hashes:
* Mac OS X keychains [OpenMP]  (Dhiru)
  - based on research from extractkeychain.py by Matt Johnston
* KeePass 1.x files [OpenMP]  (Dhiru)
  - keepass2john is based on ideas from kppy by Karsten-Kai Koenig
    http://gitorious.org/kppy/kppy
* Password Safe [OpenMP, CUDA, OpenCL]  (Dhiru, Lukas)
* ODF files [OpenMP]  (Dhiru)
* Office 2007/2010 documents [OpenMP]  (Dhiru)
  - office2john is based on test-dump-msole.c by Jody Goldberg and
  OoXmlCrypto.cs by Lyquidity Solutions Limited
* Mozilla Firefox, Thunderbird, SeaMonkey master passwords [OpenMP]  (Dhiru)
  - based on FireMaster and FireMasterLinux
    http://code.google.com/p/rainbowsandpwnies/wiki/FiremasterLinux
* RAR -p mode encrypted archives  (magnum)
  - RAR -hp mode was supported previously, now both modes are

New challenge/responses, MACs:
* WPA-PSK [OpenMP, CUDA, OpenCL]  (Lukas, Solar)
  - CPU code is loosely based on Aircrack-ng
    http://www.aircrack-ng.org
    http://openwall.info/wiki/john/WPA-PSK
* VNC challenge/response authentication [OpenMP]  (Dhiru)
  - based on VNCcrack by Jack Lloyd
    http://www.randombit.net/code/vnccrack/
* SIP challenge/response authentication [OpenMP]  (Dhiru)
  - based on SIPcrack by Martin J. Muench
* HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512  (magnum)

New hashes:
* IBM RACF [OpenMP]  (Dhiru)
  - thanks to Nigel Pentland (author of CRACF) and Main Framed for providing
  algorithm details, sample code, sample RACF binary database, test vectors
* sha512crypt (SHA-crypt) [OpenMP, CUDA, OpenCL]  (magnum, Lukas, Claudio)
  - previously supported in 1.7.6+ only via "generic crypt(3)" interface
* sha256crypt (SHA-crypt) [OpenMP, CUDA]  (magnum, Lukas)
  - previously supported in 1.7.6+ only via "generic crypt(3)" interface
* DragonFly BSD SHA-256 and SHA-512 based hashes [OpenMP]  (magnum)
* Django 1.4 [OpenMP]  (Dhiru)
* Drupal 7 $S$ phpass-like (based on SHA-512) [OpenMP]  (magnum)
* WoltLab Burning Board 3 [OpenMP]  (Dhiru)
* New EPiServer default (based on SHA-256) [OpenMP]  (Dhiru)
* GOST R 34.11-94 [OpenMP]  (Dhiru, Sergey V, JimF)
* MD4 support in "dynamic" hashes (user-configurable)  (JimF)
  - previously, only MD5 and SHA-1 were supported in "dynamic"
* Raw-SHA1-LinkedIn (raw SHA-1 with first 20 bits zeroed)  (JimF)

Alternate implementations for previously supported hashes:
* Faster raw SHA-1 (raw-sha1-ng, password length up to 15)  (Tavis)

OpenMP support in new formats:
* Mac OS X keychains  (Dhiru)
* KeePass 1.x files  (Dhiru)
* Password Safe  (Lukas)
* ODF files  (Dhiru)
* Office 2007/2010 documents  (Dhiru)
* Mozilla Firefox, Thunderbird, SeaMonkey master passwords  (Dhiru)
* WPA-PSK  (Solar)
* VNC challenge/response authentication  (Dhiru)
* SIP challenge/response authentication  (Dhiru)
* IBM RACF  (Dhiru)
* DragonFly BSD SHA-256 and SHA-512 based hashes  (magnum)
* Django 1.4  (Dhiru)
* Drupal 7 $S$ phpass-like (based on SHA-512)  (magnum)
* WoltLab Burning Board 3  (Dhiru)
* New EPiServer default (based on SHA-256)  (Dhiru)
* GOST R 34.11-94  (Dhiru, JimF)

OpenMP support for previously supported hashes that lacked it:
* Mac OS X 10.4 - 10.6 salted SHA-1  (magnum)
* DES-based tripcodes  (Solar)
* Invision Power Board 2.x salted MD5  (magnum)
* HTTP Digest access authentication MD5  (magnum)
* MySQL (old)  (Solar)

CUDA support for:
* phpass MD5-based "portable hashes"  (Lukas)
* md5crypt (FreeBSD-style MD5-based crypt(3) hashes)  (Lukas)
* sha512crypt (glibc 2.7+ SHA-crypt)  (Lukas)
* sha256crypt (glibc 2.7+ SHA-crypt)  (Lukas)
* Password Safe  (Lukas)
* WPA-PSK  (Lukas)
* Raw SHA-224, raw SHA-256 [inefficient]  (Lukas)
* MSCash (DCC) [not working reliably yet]  (Lukas)
* MSCash2 (DCC2) [not working reliably yet]  (Lukas)
* Raw SHA-512 [not working reliably yet]  (myrice)
* Mac OS X 10.7 salted SHA-512 [not working reliably yet]  (myrice)
  - we have already identified the problem with the above two, and a post
  1.7.9-jumbo-6 fix should be available shortly - please ask on john-users if
  interested in trying it out

OpenCL support for:
* phpass MD5-based "portable hashes"  (Lukas)
* md5crypt (FreeBSD-style MD5-based crypt(3) hashes)  (Lukas)
* sha512crypt (glibc 2.7+ SHA-crypt)  (Claudio)
  - suitable for NVIDIA cards, faster than the CUDA implementation above
  http://openwall.info/wiki/john/OpenCL-SHA-512
* bcrypt (OpenBSD-style Blowfish-based crypt(3) hashes)  (Sayantan)
  - pre-configured for AMD Radeon HD 7970, will likely fail on others unless
  WORK_GROUP_SIZE is adjusted in opencl_bf_std.h and opencl/bf_kernel.cl;
  the achieved level of performance is CPU-like (bcrypt is known to be
  somewhat GPU-unfriendly - a lot more than SHA-512)
  http://openwall.info/wiki/john/GPU/bcrypt
* MSCash2 (DCC2)  (Sayantan)
  - with optional and experimental multi-GPU support as a compile-time hack
  (even AMD+NVIDIA mix), by editing init() in opencl_mscash2_fmt.c
* Password Safe  (Lukas)
* WPA-PSK  (Lukas)
* RAR  (magnum)
* MySQL 4.1 double-SHA-1 [inefficient]  (Samuele)
* Netscape LDAP salted SHA-1 (SSHA) [inefficient]  (Samuele)
* NTLM [inefficient]  (Samuele)
* Raw MD5 [inefficient]  (Dhiru, Samuele)
* Raw SHA-1 [inefficient]  (Samuele)
* Raw SHA-512 [not working properly yet]  (myrice)
* Mac OS X 10.7 salted SHA-512 [not working properly yet]  (myrice)
  - we have already identified the problem with the above two, and a post
  1.7.9-jumbo-6 fix should be available shortly - please ask on john-users if
  interested in trying it out

Several of these require byte-addressable store (any NVIDIA card, but
only 5000 series or newer if AMD/ATI).  Also, OpenCL kernels for "slow"
hashes/non-hashes (e.g. RAR) may cause "ASIC hang" on certain AMD/ATI
cards with recent driver versions.  We'll try to address these issues in
a future version.

AMD XOP (Bulldozer) support added for:
* Many hashes based on MD4, MD5, SHA-1  (Solar)

Uses of SIMD (MMX assembly, SSE2/AVX/XOP intrinsics) added for:
* Mac OS X 10.4 - 10.6 salted SHA-1  (magnum)
* Invision Power Board 2.x salted MD5  (magnum)
* HTTP Digest access authentication MD5  (magnum)
* SAP CODVN B (BCODE) MD5  (magnum)
* SAP CODVN F/G (PASSCODE) SHA-1  (magnum)
* Oracle 11  (magnum)

Other optimizations:
* Reduced memory usage for raw-md4, raw-md5, raw-sha1, and nt2  (magnum)
* Prefer CommonCrypto over OpenSSL on Mac OS X 10.7  (Dhiru)
* New SSE2 intrinsics code for SHA-1  (JimF, magnum)
* Smarter use of SSE2 and SSSE3 intrinsics (the latter only if enabled in the
compiler at build time) to implement some bit rotates for MD5, SHA-1  (Solar)
* Assorted optimizations for raw SHA-1 and HMAC-MD5  (magnum)
* In RAR format, added inline storing of RAR data in JtR input file when the
original file is small enough  (magnum)
* Added use of the bitslice DES implementation for tripcodes  (Solar)
* Raw-MD5-unicode made "thick" again (that is, not building upon "dynamic"),
using much faster code  (magnum)
* Assorted performance tweaks in "salted-sha1" (SSHA)  (magnum)
* Added functions for larger hash tables to several formats  (magnum, Solar)

Other assorted enhancements:
* linux-*-gpu (both CUDA and OpenCL at once), linux-*-cuda, linux-*-opencl,
macosx-x86-64-opencl make targets  (magnum et al.)
* linux-*-native make targets (pass -march=native to gcc)  (magnum)
* New option: --dupe-suppression (for wordlist mode)  (magnum)
* New option: --loopback[=FILE] (implies --dupe-suppression)  (magnum)
* New option: --max-run-time=N for graceful exit after N seconds  (magnum)
* New option: --log-stderr  (magnum)
* New option: --regenerate-lost-salts=N for cracking hashes where we do not
have the salt and essentially need to crack it as well  (JimF)
* New unlisted option: --list (for bash completion, GUI, etc.)  (magnum)
* --list=[encodings|opencl-devices]  (magnum)
* --list=cuda-devices  (Lukas)
* --list=format-details  (Frank)
* --list=subformats  (magnum)
* New unlisted option: --length=N for reducing maximum plaintext length of a
format, mostly for testing purposes  (magnum)
* Enhanced parameter syntax for --markov: may refer to a configuration file
section, may specify the start and/or end in percent of total  (Frank)
* Make incremental mode restore ETA figures  (JimF)
* In "dynamic", support NUL octets in constants  (JimF)
* In "salted-sha1" (SSHA), support any salt length  (magnum)
* Use comment and home directory fields from PWDUMP-style input  (magnum)
* Sort the format names list in "john" usage output alphabetically  (magnum)
* New john.conf options subsection "MPI"  (magnum)
* New john.conf config item CrackStatus under Options:Jumbo  (magnum)
* \xNN escape sequence to specify arbitrary characters in rules  (JimF)
* New rule command _N to reject a word unless it is of length N  (JimF)
* Extra wordlist rule sections: Extra, Single-Extra, Jumbo  (magnum)
* Enhanced "Double" external mode sample  (JimF)
* Source $JOHN/john.local.conf by default  (magnum)
* Many format and algorithm names have been changed for consistency  (Solar)
* When intrinsics are in use, the reported algorithm name now tells which ones
(SSE2, AVX, or XOP)  (Solar)
* benchmark-unify: a Perl script to unify benchmark output of different
versions of JtR for use with relbench  (Frank)
* Per-benchmark speed ratio output added to relbench  (Frank)
* bash completion for JtR (to install: "sudo make bash-completion")  (Frank)
* New program: raw2dyna (helper to convert raw hashes to "dynamic")  (JimF)
* New program: pass_gen.pl (generates hashes from plaintexts)  (JimF, magnum)
* Many code changes made, many bugs fixed, many new bugs introduced  (all)

Now the promised benchmarks.  Here's 1.7.9-jumbo-5 to 1.7.9-jumbo-6
overall speed change on one core in FX-8120 (should be 4.0 GHz turbo),
after running through benchmark-unify and relbench (yet about 50 of the
new version's benchmark results could not be directly compared against
results of the previous version, and thus are excluded):

Number of benchmarks:           151
Minimum:                        0.84668 real, 0.84668 virtual
Maximum:                        10.92416 real, 10.92416 virtual
Median:                         1.10800 real, 1.10800 virtual
Median absolute deviation:      0.12531 real, 0.12369 virtual
Geometric mean:                 1.26217 real, 1.26284 virtual
Geometric standard deviation:   1.47239 real, 1.47274 virtual

Ditto for OpenMP-enabled builds (8 threads, should be 3.1 GHz):

Number of benchmarks:           151
Minimum:                        0.94616 real, 0.48341 virtual
Maximum:                        24.19709 real, 4.29610 virtual
Median:                         1.17609 real, 1.05964 virtual
Median absolute deviation:      0.17436 real, 0.11465 virtual
Geometric mean:                 1.35493 real, 1.17097 virtual
Geometric standard deviation:   1.71505 real, 1.36577 virtual

These show that overall we do indeed have a speedup, and that's without
any GPU stuff.

Also curious is speedup due to OpenMP in 1.7.9-jumbo-6 (same version in
both cases), on the same CPU (8 threads):

Number of benchmarks:           202
Minimum:                        0.76235 real, 0.09553 virtual
Maximum:                        30.51791 real, 3.81904 virtual
Median:                         1.01479 real, 0.98287 virtual
Median absolute deviation:      0.02747 real, 0.03514 virtual
Geometric mean:                 1.71441 real, 0.77454 virtual
Geometric standard deviation:   2.08823 real, 1.50966 virtual

The 30x maximum speedup (with only 8 threads) is indeed abnormal, it is for:

Ratio:  30.51791 real, 3.81904 virtual  SIP MD5:Raw

We'll correct the non-OpenMP performance for SIP in the next version.
For the rest, the maximum speedup is 6.13x for SSH, which is great
(considering that the CPU clock rate reduces with more threads running,
and that this is a 4-module CPU rather than a true 8-core).  Here are
the top 10 OpenMP performers (excluding SIP):

Ratio:  6.13093 real, 0.77210 virtual   SSH RSA/DSA (one 2048-bit RSA and one 1024-bit DSA key):Raw
Ratio:  6.05882 real, 0.75737 virtual   NTLMv2 C/R MD4 HMAC-MD5:Many salts
Ratio:  6.04342 real, 0.75548 virtual   LMv2 C/R MD4 HMAC-MD5:Many salts
Ratio:  5.92830 real, 0.74108 virtual   GOST R 34.11-94:Raw
Ratio:  5.81605 real, 0.73986 virtual   sha256crypt (rounds=5000):Raw
Ratio:  5.65289 real, 0.70523 virtual   sha512crypt (rounds=5000):Raw
Ratio:  5.63333 real, 0.72034 virtual   Drupal 7 $S$ SHA-512 (x16385):Raw
Ratio:  5.56435 real, 0.69937 virtual   OpenBSD Blowfish (x32):Raw
Ratio:  5.50484 real, 0.69682 virtual   Password Safe SHA-256:Raw
Ratio:  5.49613 real, 0.68814 virtual   Sybase ASE salted SHA-256:Many salts

The worst regression is for:

Ratio:  0.76235 real, 0.09553 virtual   LM DES:Raw

It is known that our current LM hash code does not scale well, and is
very fast even with one thread (close to the bottleneck of the current
interface).  It is in fact better not to use OpenMP for LM hashes yet,
or to keep the thread count low (e.g., 4 would behave better than 8).
The low median and mean speedup are because many hashes still lack
OpenMP support - mostly the "fast" ones, where we'd bump into the
bottleneck anyway.  We might deal with this later.  For "slow" hashes,
the speedup with OpenMP is close to perfect (5x to 6x for this CPU).

Now to the new stuff.  The effect of XOP (make linux-x86-64-xop):

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=md5
Benchmarking: FreeBSD MD5 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
Raw:    204600 c/s real, 25625 c/s virtual

-5 achieved at most:

user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    158208 c/s real, 19751 c/s virtual

with "make linux-x86-64i" (icc precompiled SSE2 intrinsics), and only:

user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    141312 c/s real, 17664 c/s virtual

with "make linux-x86-64-xop" because it did not yet use XOP for MD5 (nor
for MD4 and SHA-1), only knowing how to use it for DES (which it did).

So we got an over 20% speedup due to XOP here.

Similarly, for raw SHA-1 best result with -5:

user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=raw-sha1
Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE
Raw:    13067K c/s real, 13067K c/s virtual

whereas -6 does, with JimF's and magnum's optimizations and with XOP:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1
Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
Raw:    23461K c/s real, 23698K c/s virtual

and with Tavis' contribution:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1-ng
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
Raw:    28024K c/s real, 28024K c/s virtual

So that's an over 2x speedup if we can accept the length 15 limit, or
an almost 80% speedup otherwise.

Note: all of the raw SHA-1 benchmarks above are for one CPU core, not
for the entire chip (no OpenMP for fast hashes like this yet, but
there's MPI and there are always separate process invocations...)

To more important stuff, sha512crypt on CPU vs. GPU:

For reference, here's what we would get with the previous version, using
the glibc implementation of SHA-crypt:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=crypt -sub=sha512crypt
Benchmarking: generic crypt(3) SHA-512 rounds=5000 [?/64]... (8xOMP) DONE
Many salts:     1518 c/s real, 189 c/s virtual
Only one salt:  1515 c/s real, 189 c/s virtual

Now we also have builtin implementation, although it nevertheless uses
OpenSSL for the SHA-512 primitive (it doesn't have its own SHA-512 yet -
adding that and making use of SIMD would provide much additional
speedup, this is a to-do item for us):

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt
Benchmarking: sha512crypt (rounds=5000) [64/64]... (8xOMP) DONE
Raw:    2045 c/s real, 256 c/s virtual

So it is about 35% faster.  Let's try GPUs, first GTX 570 1600 MHz
(a card that is vendor-overclocked to that frequency):

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-cuda
Benchmarking: sha512crypt (rounds=5000) [CUDA]... DONE
Raw:    3833 c/s real, 3833 c/s virtual

Another 2x speedup here, but that's still not it.  Let's see:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
OpenCL platform 0: NVIDIA CUDA, 1 device(s).
Using device 0: GeForce GTX 570
Building the kernel, this could take a while
Local work size (LWS) 512, global work size (GWS) 7680
Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
Raw:    11405 c/s real, 11349 c/s virtual

And now this is it - Claudio's OpenCL code is really good on NVIDIA,
giving us a 5.5x speedup over CPU.  (SHA-512 is not as GPU-friendly as
e.g. MD5, but is friendly enough for some decent speedup.)

Let's also try AMD Radeon HD 7970 (normally a faster card), at stock clocks:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Building the kernel, this could take a while
Elapsed time: 17 seconds
Local work size (LWS) 32, global work size (GWS) 16384
Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
Raw:    5144 c/s real, 3276K c/s virtual

Not as much luck here yet.  Finally, for comparison and to show how any
one of the three OpenCL devices may be accessed from john's command-line
with --platform and --device options, the same OpenCL code on the CPU:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1 -dev=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 1: AMD FX(tm)-8120 Eight-Core Processor
Local work size (LWS) 1, global work size (GWS) 1024
Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
Raw:    1850 c/s real, 233 c/s virtual

This shows that the code is indeed pretty efficient - almost reaching
OpenSSL's specialized code speed.

Now to bcrypt.  This CPU is pretty good at it:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE
Raw:    5300 c/s real, 664 c/s virtual

(FWIW, with overclocking I was able to get this to about 5650 c/s, but
not more - bumping into 125 W TDP.  The above is at stock clocks.)

This is for "$2a$05" or only 32 iterations, which is used as baseline
for benchmarks for historical reasons.  Actual systems often use
"$2a$08" (8 times slower) to "$2a$10" (32 times slower) these days.

Anyway, the reference cracking speed for bcrypt above is higher than the
speed for sha512crypt on the same CPU (with the current code at least,
which admittedly can be optimized much further).  Can we make it even
higher on a GPU?  Maybe, but not yet, not with the current code:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
****Please see 'opencl_bf_std.h' for device specific optimizations****
Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
Raw:    4143 c/s real, 238933 c/s virtual

user@...l:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable --od-setclocks=1225,1375
AMD Overdrive(TM) enabled

Default Adapter - AMD Radeon HD 7900 Series
                  New Core Peak   : 1225
                  New Memory Peak : 1375
user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
****Please see 'opencl_bf_std.h' for device specific optimizations****
Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
Raw:    5471 c/s real, 358400 c/s virtual

It's only with a 30% overclock that the high-end GPU gets to the same
level of performance as the 2-3 times cheaper CPU.  BTW, the GPU stays
cool with this overclock (73 C with stock cooling when running bf-opencl
for a while), precisely because we have to heavily under-utilize it due
to it not having enough local memory to accommodate as many parallel
bcrypt computations as we'd need for full occupancy and to hide memory
access latencies.

Maybe more optimal code will achieve better results, though.

The NVIDIA card also has no luck competing with the CPU at bcrypt yet:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl
OpenCL platform 0: NVIDIA CUDA, 1 device(s).
Using device 0: GeForce GTX 570
****Please see 'opencl_bf_std.h' for device specific optimizations****
Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
Raw:    1137 c/s real, 1137 c/s virtual

Some tuning could provide better numbers, but they stay a lot lower than
the CPU's and HD 7970's anyway (for the current code).

Some other GPU benchmarks where I think we achieve decent performance
(not exactly the best, but on par with competing tools that had GPU
support for longer):

GTX 570 1600 MHz:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=phpass-cuda
Benchmarking: phpass MD5 ($P$9 lengths 1 to 15) [CUDA]... DONE
Raw:    510171 c/s real, 507581 c/s virtual

HD 7970 925 MHz (stock):

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Optimal Work Group Size:256
Kernel Execution Speed (Higher is better):1.403044
Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
Raw:    92467 c/s real, 92142 c/s virtual

1225 MHz:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Optimal Work Group Size:128
Kernel Execution Speed (Higher is better):1.856949
Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
Raw:    121644 c/s real, 121644 c/s virtual

(would overheat if actually used? this is not bcrypt anymore)

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar
OpenCL platform 0: NVIDIA CUDA, 1 device(s).
Using device 0: GeForce GTX 570
Optimal keys per crypt 32768
(to avoid this test on next run, put "rar_GWS = 32768" in john.conf, section [Options:OpenCL])
Local worksize (LWS) 64, Global worksize (GWS) 32768
Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
Raw:    4380 c/s real, 4334 c/s virtual

The HD 7970 card is back to stock clocks here:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Optimal keys per crypt 65536
(to avoid this test on next run, put "rar_GWS = 65536" in john.conf, section [Options:OpenCL])
Local worksize (LWS) 64, Global worksize (GWS) 65536
Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
Raw:    7162 c/s real, 468114 c/s virtual

WPA-PSK, on CPU:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk
Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [32/64]... (8xOMP) DONE
Raw:    1980 c/s real, 247 c/s virtual

(no SIMD yet; could do several times faster).  CUDA:

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-cuda
Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [CUDA]... (8xOMP) DONE
Raw:    32385 c/s real, 16695 c/s virtual

OpenCL on the faster card (stock clock):

user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Max local work size 256
Optimal local work size = 256
Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... (8xOMP) DONE
Raw:    55138 c/s real, 42442 c/s virtual

27x speedup over CPU here, although presumably the CPU code is further
from optimal.

...Hey, what are you doing here?  That message was way too long, you
couldn't possibly read this far.  I'll just presume you scrolled to the
end.  There's good stuff you have missed above, so please scroll up. ;-)

As usual, feedback is welcome on the john-users list.  I realize that
we're currently missing usage instructions for much of the new stuff, so
please just ask on john-users - and try to make your questions specific.
That way, code contributors will also be prompted/forced to contribute
documentation, and we'll get it under doc/ and on the wiki - in fact,
you can contribute to that too.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.