Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 29 Jun 2012 21:56:02 +0200
From: newangels newangels <contact.newangels@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: John the Ripper 1.7.9-jumbo-6

Hi,

Verry nice news & so many improvement's ! thanks a lot to all of you
for the effort's & time.

I just try to compiled on MAC_OSX LION an GPU enable build, but
unfotunately got an error.

I do :

make macosx-x86-64-opencl

& i got :

make[1]: *** [common_opencl_pbkdf2.o] Error 1
make: *** [macosx-x86-64-opencl] Error 2

System information's :

MacBook Pro 17' / ATI 6750 M - 1Go / SSD - osx-Lion

Can some of you help me on this issue,

Thanks a lot in advance,

Regards,

Donovan

2012/6/29, Solar Designer <solar@...nwall.com>:
> Hi,
>
> We've released John the Ripper 1.7.9-jumbo-6 earlier today.  This is a
> "community-enhanced" version, which includes many contributions from JtR
> community members - in fact, that's what it primarily consists of.  It's
> been half a year since 1.7.9-jumbo-5, which is a lot of time, and a lot
> has been added to jumbo since then.  Even though it's just a one digit
> change in the version number, this is in fact the biggest single jumbo
> update we've made so far.  It appears that between -5 and -6 the source
> code grew by over 1 MB, or by over 40,000 lines of code (and that's not
> including lines that were changed as opposed to added).  The biggest new
> thing is integrated GPU support, both CUDA and OpenCL - although for a
> subset of the hash and non-hash types only, not for all that are
> supported on CPU.  (Also, it is efficient only for so-called "slow"
> hashes now, and for the "non-hashes" that we chose to support on GPU.
> For "fast" hashes, it is just a development milestone, albeit a
> desirable one as well.)  The other biggest new thing is the addition of
> support for many more "non-hashes" and hashes (see below).
>
> You may download John the Ripper 1.7.9-jumbo-6 at the usual place:
>
> http://www.openwall.com/john/
>
> With so many changes, even pushing this release out was difficult.
> Despite of the statement that "jumbo is buggy by definition", we did try
> to eliminate as many bugs as we reasonably could - but after a week of
> mad testing and bug-fixing, I chose to release the tree as-is, only
> documenting the remaining known bugs (below and in doc/BUGS).  Still, we
> ended up posting over 1200 messages to john-dev in June - even though in
> prior months we did not even hit 500.  Indeed, we did run plenty of
> tests and fix plenty of bugs, which you won't see in this release.
>
> I've included a lengthy description of some of the changes below, and
> below that I'll add some benchmark results that I find curious (such as
> for bcrypt on CPU vs. GPU).
>
> Direct code contributors to 1.7.9-jumbo-6 (since 1.7.9-jumbo-5), by
> commit count:
>
> magnum
> Dhiru Kholia
> Frank Dittrich
> JimF (Jim Fougeron)
> myrice (Dongdong Li)
> Claudio Andre
> Lukas Odzioba
> Solar Designer
> Sayantan Datta
> Samuele Giovanni Tonon
> Tavis Ormandy
> bartavelle (Simon Marechal)
> Sergey V
> bizonix
> Robert Veznaver
> Andras
>
> New non-hashes:
> * Mac OS X keychains [OpenMP]  (Dhiru)
>   - based on research from extractkeychain.py by Matt Johnston
> * KeePass 1.x files [OpenMP]  (Dhiru)
>   - keepass2john is based on ideas from kppy by Karsten-Kai Koenig
>     http://gitorious.org/kppy/kppy
> * Password Safe [OpenMP, CUDA, OpenCL]  (Dhiru, Lukas)
> * ODF files [OpenMP]  (Dhiru)
> * Office 2007/2010 documents [OpenMP]  (Dhiru)
>   - office2john is based on test-dump-msole.c by Jody Goldberg and
>   OoXmlCrypto.cs by Lyquidity Solutions Limited
> * Mozilla Firefox, Thunderbird, SeaMonkey master passwords [OpenMP]
> (Dhiru)
>   - based on FireMaster and FireMasterLinux
>     http://code.google.com/p/rainbowsandpwnies/wiki/FiremasterLinux
> * RAR -p mode encrypted archives  (magnum)
>   - RAR -hp mode was supported previously, now both modes are
>
> New challenge/responses, MACs:
> * WPA-PSK [OpenMP, CUDA, OpenCL]  (Lukas, Solar)
>   - CPU code is loosely based on Aircrack-ng
>     http://www.aircrack-ng.org
>     http://openwall.info/wiki/john/WPA-PSK
> * VNC challenge/response authentication [OpenMP]  (Dhiru)
>   - based on VNCcrack by Jack Lloyd
>     http://www.randombit.net/code/vnccrack/
> * SIP challenge/response authentication [OpenMP]  (Dhiru)
>   - based on SIPcrack by Martin J. Muench
> * HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512
> (magnum)
>
> New hashes:
> * IBM RACF [OpenMP]  (Dhiru)
>   - thanks to Nigel Pentland (author of CRACF) and Main Framed for
> providing
>   algorithm details, sample code, sample RACF binary database, test vectors
> * sha512crypt (SHA-crypt) [OpenMP, CUDA, OpenCL]  (magnum, Lukas, Claudio)
>   - previously supported in 1.7.6+ only via "generic crypt(3)" interface
> * sha256crypt (SHA-crypt) [OpenMP, CUDA]  (magnum, Lukas)
>   - previously supported in 1.7.6+ only via "generic crypt(3)" interface
> * DragonFly BSD SHA-256 and SHA-512 based hashes [OpenMP]  (magnum)
> * Django 1.4 [OpenMP]  (Dhiru)
> * Drupal 7 $S$ phpass-like (based on SHA-512) [OpenMP]  (magnum)
> * WoltLab Burning Board 3 [OpenMP]  (Dhiru)
> * New EPiServer default (based on SHA-256) [OpenMP]  (Dhiru)
> * GOST R 34.11-94 [OpenMP]  (Dhiru, Sergey V, JimF)
> * MD4 support in "dynamic" hashes (user-configurable)  (JimF)
>   - previously, only MD5 and SHA-1 were supported in "dynamic"
> * Raw-SHA1-LinkedIn (raw SHA-1 with first 20 bits zeroed)  (JimF)
>
> Alternate implementations for previously supported hashes:
> * Faster raw SHA-1 (raw-sha1-ng, password length up to 15)  (Tavis)
>
> OpenMP support in new formats:
> * Mac OS X keychains  (Dhiru)
> * KeePass 1.x files  (Dhiru)
> * Password Safe  (Lukas)
> * ODF files  (Dhiru)
> * Office 2007/2010 documents  (Dhiru)
> * Mozilla Firefox, Thunderbird, SeaMonkey master passwords  (Dhiru)
> * WPA-PSK  (Solar)
> * VNC challenge/response authentication  (Dhiru)
> * SIP challenge/response authentication  (Dhiru)
> * IBM RACF  (Dhiru)
> * DragonFly BSD SHA-256 and SHA-512 based hashes  (magnum)
> * Django 1.4  (Dhiru)
> * Drupal 7 $S$ phpass-like (based on SHA-512)  (magnum)
> * WoltLab Burning Board 3  (Dhiru)
> * New EPiServer default (based on SHA-256)  (Dhiru)
> * GOST R 34.11-94  (Dhiru, JimF)
>
> OpenMP support for previously supported hashes that lacked it:
> * Mac OS X 10.4 - 10.6 salted SHA-1  (magnum)
> * DES-based tripcodes  (Solar)
> * Invision Power Board 2.x salted MD5  (magnum)
> * HTTP Digest access authentication MD5  (magnum)
> * MySQL (old)  (Solar)
>
> CUDA support for:
> * phpass MD5-based "portable hashes"  (Lukas)
> * md5crypt (FreeBSD-style MD5-based crypt(3) hashes)  (Lukas)
> * sha512crypt (glibc 2.7+ SHA-crypt)  (Lukas)
> * sha256crypt (glibc 2.7+ SHA-crypt)  (Lukas)
> * Password Safe  (Lukas)
> * WPA-PSK  (Lukas)
> * Raw SHA-224, raw SHA-256 [inefficient]  (Lukas)
> * MSCash (DCC) [not working reliably yet]  (Lukas)
> * MSCash2 (DCC2) [not working reliably yet]  (Lukas)
> * Raw SHA-512 [not working reliably yet]  (myrice)
> * Mac OS X 10.7 salted SHA-512 [not working reliably yet]  (myrice)
>   - we have already identified the problem with the above two, and a post
>   1.7.9-jumbo-6 fix should be available shortly - please ask on john-users
> if
>   interested in trying it out
>
> OpenCL support for:
> * phpass MD5-based "portable hashes"  (Lukas)
> * md5crypt (FreeBSD-style MD5-based crypt(3) hashes)  (Lukas)
> * sha512crypt (glibc 2.7+ SHA-crypt)  (Claudio)
>   - suitable for NVIDIA cards, faster than the CUDA implementation above
>   http://openwall.info/wiki/john/OpenCL-SHA-512
> * bcrypt (OpenBSD-style Blowfish-based crypt(3) hashes)  (Sayantan)
>   - pre-configured for AMD Radeon HD 7970, will likely fail on others
> unless
>   WORK_GROUP_SIZE is adjusted in opencl_bf_std.h and opencl/bf_kernel.cl;
>   the achieved level of performance is CPU-like (bcrypt is known to be
>   somewhat GPU-unfriendly - a lot more than SHA-512)
>   http://openwall.info/wiki/john/GPU/bcrypt
> * MSCash2 (DCC2)  (Sayantan)
>   - with optional and experimental multi-GPU support as a compile-time hack
>   (even AMD+NVIDIA mix), by editing init() in opencl_mscash2_fmt.c
> * Password Safe  (Lukas)
> * WPA-PSK  (Lukas)
> * RAR  (magnum)
> * MySQL 4.1 double-SHA-1 [inefficient]  (Samuele)
> * Netscape LDAP salted SHA-1 (SSHA) [inefficient]  (Samuele)
> * NTLM [inefficient]  (Samuele)
> * Raw MD5 [inefficient]  (Dhiru, Samuele)
> * Raw SHA-1 [inefficient]  (Samuele)
> * Raw SHA-512 [not working properly yet]  (myrice)
> * Mac OS X 10.7 salted SHA-512 [not working properly yet]  (myrice)
>   - we have already identified the problem with the above two, and a post
>   1.7.9-jumbo-6 fix should be available shortly - please ask on john-users
> if
>   interested in trying it out
>
> Several of these require byte-addressable store (any NVIDIA card, but
> only 5000 series or newer if AMD/ATI).  Also, OpenCL kernels for "slow"
> hashes/non-hashes (e.g. RAR) may cause "ASIC hang" on certain AMD/ATI
> cards with recent driver versions.  We'll try to address these issues in
> a future version.
>
> AMD XOP (Bulldozer) support added for:
> * Many hashes based on MD4, MD5, SHA-1  (Solar)
>
> Uses of SIMD (MMX assembly, SSE2/AVX/XOP intrinsics) added for:
> * Mac OS X 10.4 - 10.6 salted SHA-1  (magnum)
> * Invision Power Board 2.x salted MD5  (magnum)
> * HTTP Digest access authentication MD5  (magnum)
> * SAP CODVN B (BCODE) MD5  (magnum)
> * SAP CODVN F/G (PASSCODE) SHA-1  (magnum)
> * Oracle 11  (magnum)
>
> Other optimizations:
> * Reduced memory usage for raw-md4, raw-md5, raw-sha1, and nt2  (magnum)
> * Prefer CommonCrypto over OpenSSL on Mac OS X 10.7  (Dhiru)
> * New SSE2 intrinsics code for SHA-1  (JimF, magnum)
> * Smarter use of SSE2 and SSSE3 intrinsics (the latter only if enabled in
> the
> compiler at build time) to implement some bit rotates for MD5, SHA-1
> (Solar)
> * Assorted optimizations for raw SHA-1 and HMAC-MD5  (magnum)
> * In RAR format, added inline storing of RAR data in JtR input file when
> the
> original file is small enough  (magnum)
> * Added use of the bitslice DES implementation for tripcodes  (Solar)
> * Raw-MD5-unicode made "thick" again (that is, not building upon
> "dynamic"),
> using much faster code  (magnum)
> * Assorted performance tweaks in "salted-sha1" (SSHA)  (magnum)
> * Added functions for larger hash tables to several formats  (magnum,
> Solar)
>
> Other assorted enhancements:
> * linux-*-gpu (both CUDA and OpenCL at once), linux-*-cuda, linux-*-opencl,
> macosx-x86-64-opencl make targets  (magnum et al.)
> * linux-*-native make targets (pass -march=native to gcc)  (magnum)
> * New option: --dupe-suppression (for wordlist mode)  (magnum)
> * New option: --loopback[=FILE] (implies --dupe-suppression)  (magnum)
> * New option: --max-run-time=N for graceful exit after N seconds  (magnum)
> * New option: --log-stderr  (magnum)
> * New option: --regenerate-lost-salts=N for cracking hashes where we do not
> have the salt and essentially need to crack it as well  (JimF)
> * New unlisted option: --list (for bash completion, GUI, etc.)  (magnum)
> * --list=[encodings|opencl-devices]  (magnum)
> * --list=cuda-devices  (Lukas)
> * --list=format-details  (Frank)
> * --list=subformats  (magnum)
> * New unlisted option: --length=N for reducing maximum plaintext length of
> a
> format, mostly for testing purposes  (magnum)
> * Enhanced parameter syntax for --markov: may refer to a configuration file
> section, may specify the start and/or end in percent of total  (Frank)
> * Make incremental mode restore ETA figures  (JimF)
> * In "dynamic", support NUL octets in constants  (JimF)
> * In "salted-sha1" (SSHA), support any salt length  (magnum)
> * Use comment and home directory fields from PWDUMP-style input  (magnum)
> * Sort the format names list in "john" usage output alphabetically
> (magnum)
> * New john.conf options subsection "MPI"  (magnum)
> * New john.conf config item CrackStatus under Options:Jumbo  (magnum)
> * \xNN escape sequence to specify arbitrary characters in rules  (JimF)
> * New rule command _N to reject a word unless it is of length N  (JimF)
> * Extra wordlist rule sections: Extra, Single-Extra, Jumbo  (magnum)
> * Enhanced "Double" external mode sample  (JimF)
> * Source $JOHN/john.local.conf by default  (magnum)
> * Many format and algorithm names have been changed for consistency
> (Solar)
> * When intrinsics are in use, the reported algorithm name now tells which
> ones
> (SSE2, AVX, or XOP)  (Solar)
> * benchmark-unify: a Perl script to unify benchmark output of different
> versions of JtR for use with relbench  (Frank)
> * Per-benchmark speed ratio output added to relbench  (Frank)
> * bash completion for JtR (to install: "sudo make bash-completion")
> (Frank)
> * New program: raw2dyna (helper to convert raw hashes to "dynamic")  (JimF)
> * New program: pass_gen.pl (generates hashes from plaintexts)  (JimF,
> magnum)
> * Many code changes made, many bugs fixed, many new bugs introduced  (all)
>
> Now the promised benchmarks.  Here's 1.7.9-jumbo-5 to 1.7.9-jumbo-6
> overall speed change on one core in FX-8120 (should be 4.0 GHz turbo),
> after running through benchmark-unify and relbench (yet about 50 of the
> new version's benchmark results could not be directly compared against
> results of the previous version, and thus are excluded):
>
> Number of benchmarks:           151
> Minimum:                        0.84668 real, 0.84668 virtual
> Maximum:                        10.92416 real, 10.92416 virtual
> Median:                         1.10800 real, 1.10800 virtual
> Median absolute deviation:      0.12531 real, 0.12369 virtual
> Geometric mean:                 1.26217 real, 1.26284 virtual
> Geometric standard deviation:   1.47239 real, 1.47274 virtual
>
> Ditto for OpenMP-enabled builds (8 threads, should be 3.1 GHz):
>
> Number of benchmarks:           151
> Minimum:                        0.94616 real, 0.48341 virtual
> Maximum:                        24.19709 real, 4.29610 virtual
> Median:                         1.17609 real, 1.05964 virtual
> Median absolute deviation:      0.17436 real, 0.11465 virtual
> Geometric mean:                 1.35493 real, 1.17097 virtual
> Geometric standard deviation:   1.71505 real, 1.36577 virtual
>
> These show that overall we do indeed have a speedup, and that's without
> any GPU stuff.
>
> Also curious is speedup due to OpenMP in 1.7.9-jumbo-6 (same version in
> both cases), on the same CPU (8 threads):
>
> Number of benchmarks:           202
> Minimum:                        0.76235 real, 0.09553 virtual
> Maximum:                        30.51791 real, 3.81904 virtual
> Median:                         1.01479 real, 0.98287 virtual
> Median absolute deviation:      0.02747 real, 0.03514 virtual
> Geometric mean:                 1.71441 real, 0.77454 virtual
> Geometric standard deviation:   2.08823 real, 1.50966 virtual
>
> The 30x maximum speedup (with only 8 threads) is indeed abnormal, it is
> for:
>
> Ratio:  30.51791 real, 3.81904 virtual  SIP MD5:Raw
>
> We'll correct the non-OpenMP performance for SIP in the next version.
> For the rest, the maximum speedup is 6.13x for SSH, which is great
> (considering that the CPU clock rate reduces with more threads running,
> and that this is a 4-module CPU rather than a true 8-core).  Here are
> the top 10 OpenMP performers (excluding SIP):
>
> Ratio:  6.13093 real, 0.77210 virtual   SSH RSA/DSA (one 2048-bit RSA and
> one 1024-bit DSA key):Raw
> Ratio:  6.05882 real, 0.75737 virtual   NTLMv2 C/R MD4 HMAC-MD5:Many salts
> Ratio:  6.04342 real, 0.75548 virtual   LMv2 C/R MD4 HMAC-MD5:Many salts
> Ratio:  5.92830 real, 0.74108 virtual   GOST R 34.11-94:Raw
> Ratio:  5.81605 real, 0.73986 virtual   sha256crypt (rounds=5000):Raw
> Ratio:  5.65289 real, 0.70523 virtual   sha512crypt (rounds=5000):Raw
> Ratio:  5.63333 real, 0.72034 virtual   Drupal 7 $S$ SHA-512 (x16385):Raw
> Ratio:  5.56435 real, 0.69937 virtual   OpenBSD Blowfish (x32):Raw
> Ratio:  5.50484 real, 0.69682 virtual   Password Safe SHA-256:Raw
> Ratio:  5.49613 real, 0.68814 virtual   Sybase ASE salted SHA-256:Many
> salts
>
> The worst regression is for:
>
> Ratio:  0.76235 real, 0.09553 virtual   LM DES:Raw
>
> It is known that our current LM hash code does not scale well, and is
> very fast even with one thread (close to the bottleneck of the current
> interface).  It is in fact better not to use OpenMP for LM hashes yet,
> or to keep the thread count low (e.g., 4 would behave better than 8).
> The low median and mean speedup are because many hashes still lack
> OpenMP support - mostly the "fast" ones, where we'd bump into the
> bottleneck anyway.  We might deal with this later.  For "slow" hashes,
> the speedup with OpenMP is close to perfect (5x to 6x for this CPU).
>
> Now to the new stuff.  The effect of XOP (make linux-x86-64-xop):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=md5
> Benchmarking: FreeBSD MD5 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
> Raw:    204600 c/s real, 25625 c/s virtual
>
> -5 achieved at most:
>
> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
> Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
> Raw:    158208 c/s real, 19751 c/s virtual
>
> with "make linux-x86-64i" (icc precompiled SSE2 intrinsics), and only:
>
> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
> Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
> Raw:    141312 c/s real, 17664 c/s virtual
>
> with "make linux-x86-64-xop" because it did not yet use XOP for MD5 (nor
> for MD4 and SHA-1), only knowing how to use it for DES (which it did).
>
> So we got an over 20% speedup due to XOP here.
>
> Similarly, for raw SHA-1 best result with -5:
>
> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=raw-sha1
> Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE
> Raw:    13067K c/s real, 13067K c/s virtual
>
> whereas -6 does, with JimF's and magnum's optimizations and with XOP:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1
> Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
> Raw:    23461K c/s real, 23698K c/s virtual
>
> and with Tavis' contribution:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1-ng
> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
> Raw:    28024K c/s real, 28024K c/s virtual
>
> So that's an over 2x speedup if we can accept the length 15 limit, or
> an almost 80% speedup otherwise.
>
> Note: all of the raw SHA-1 benchmarks above are for one CPU core, not
> for the entire chip (no OpenMP for fast hashes like this yet, but
> there's MPI and there are always separate process invocations...)
>
> To more important stuff, sha512crypt on CPU vs. GPU:
>
> For reference, here's what we would get with the previous version, using
> the glibc implementation of SHA-crypt:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=crypt -sub=sha512crypt
> Benchmarking: generic crypt(3) SHA-512 rounds=5000 [?/64]... (8xOMP) DONE
> Many salts:     1518 c/s real, 189 c/s virtual
> Only one salt:  1515 c/s real, 189 c/s virtual
>
> Now we also have builtin implementation, although it nevertheless uses
> OpenSSL for the SHA-512 primitive (it doesn't have its own SHA-512 yet -
> adding that and making use of SIMD would provide much additional
> speedup, this is a to-do item for us):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt
> Benchmarking: sha512crypt (rounds=5000) [64/64]... (8xOMP) DONE
> Raw:    2045 c/s real, 256 c/s virtual
>
> So it is about 35% faster.  Let's try GPUs, first GTX 570 1600 MHz
> (a card that is vendor-overclocked to that frequency):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-cuda
> Benchmarking: sha512crypt (rounds=5000) [CUDA]... DONE
> Raw:    3833 c/s real, 3833 c/s virtual
>
> Another 2x speedup here, but that's still not it.  Let's see:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> Building the kernel, this could take a while
> Local work size (LWS) 512, global work size (GWS) 7680
> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
> Raw:    11405 c/s real, 11349 c/s virtual
>
> And now this is it - Claudio's OpenCL code is really good on NVIDIA,
> giving us a 5.5x speedup over CPU.  (SHA-512 is not as GPU-friendly as
> e.g. MD5, but is friendly enough for some decent speedup.)
>
> Let's also try AMD Radeon HD 7970 (normally a faster card), at stock
> clocks:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
> -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Building the kernel, this could take a while
> Elapsed time: 17 seconds
> Local work size (LWS) 32, global work size (GWS) 16384
> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
> Raw:    5144 c/s real, 3276K c/s virtual
>
> Not as much luck here yet.  Finally, for comparison and to show how any
> one of the three OpenCL devices may be accessed from john's command-line
> with --platform and --device options, the same OpenCL code on the CPU:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1
> -dev=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 1: AMD FX(tm)-8120 Eight-Core Processor
> Local work size (LWS) 1, global work size (GWS) 1024
> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
> Raw:    1850 c/s real, 233 c/s virtual
>
> This shows that the code is indeed pretty efficient - almost reaching
> OpenSSL's specialized code speed.
>
> Now to bcrypt.  This CPU is pretty good at it:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf
> Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE
> Raw:    5300 c/s real, 664 c/s virtual
>
> (FWIW, with overclocking I was able to get this to about 5650 c/s, but
> not more - bumping into 125 W TDP.  The above is at stock clocks.)
>
> This is for "$2a$05" or only 32 iterations, which is used as baseline
> for benchmarks for historical reasons.  Actual systems often use
> "$2a$08" (8 times slower) to "$2a$10" (32 times slower) these days.
>
> Anyway, the reference cracking speed for bcrypt above is higher than the
> speed for sha512crypt on the same CPU (with the current code at least,
> which admittedly can be optimized much further).  Can we make it even
> higher on a GPU?  Maybe, but not yet, not with the current code:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> ****Please see 'opencl_bf_std.h' for device specific optimizations****
> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
> Raw:    4143 c/s real, 238933 c/s virtual
>
> user@...l:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable
> --od-setclocks=1225,1375
> AMD Overdrive(TM) enabled
>
> Default Adapter - AMD Radeon HD 7900 Series
>                   New Core Peak   : 1225
>                   New Memory Peak : 1375
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> ****Please see 'opencl_bf_std.h' for device specific optimizations****
> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
> Raw:    5471 c/s real, 358400 c/s virtual
>
> It's only with a 30% overclock that the high-end GPU gets to the same
> level of performance as the 2-3 times cheaper CPU.  BTW, the GPU stays
> cool with this overclock (73 C with stock cooling when running bf-opencl
> for a while), precisely because we have to heavily under-utilize it due
> to it not having enough local memory to accommodate as many parallel
> bcrypt computations as we'd need for full occupancy and to hide memory
> access latencies.
>
> Maybe more optimal code will achieve better results, though.
>
> The NVIDIA card also has no luck competing with the CPU at bcrypt yet:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> ****Please see 'opencl_bf_std.h' for device specific optimizations****
> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
> Raw:    1137 c/s real, 1137 c/s virtual
>
> Some tuning could provide better numbers, but they stay a lot lower than
> the CPU's and HD 7970's anyway (for the current code).
>
> Some other GPU benchmarks where I think we achieve decent performance
> (not exactly the best, but on par with competing tools that had GPU
> support for longer):
>
> GTX 570 1600 MHz:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=phpass-cuda
> Benchmarking: phpass MD5 ($P$9 lengths 1 to 15) [CUDA]... DONE
> Raw:    510171 c/s real, 507581 c/s virtual
>
> HD 7970 925 MHz (stock):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:256
> Kernel Execution Speed (Higher is better):1.403044
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    92467 c/s real, 92142 c/s virtual
>
> 1225 MHz:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:128
> Kernel Execution Speed (Higher is better):1.856949
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    121644 c/s real, 121644 c/s virtual
>
> (would overheat if actually used? this is not bcrypt anymore)
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> Optimal keys per crypt 32768
> (to avoid this test on next run, put "rar_GWS = 32768" in john.conf, section
> [Options:OpenCL])
> Local worksize (LWS) 64, Global worksize (GWS) 32768
> Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
> Raw:    4380 c/s real, 4334 c/s virtual
>
> The HD 7970 card is back to stock clocks here:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal keys per crypt 65536
> (to avoid this test on next run, put "rar_GWS = 65536" in john.conf, section
> [Options:OpenCL])
> Local worksize (LWS) 64, Global worksize (GWS) 65536
> Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
> Raw:    7162 c/s real, 468114 c/s virtual
>
> WPA-PSK, on CPU:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [32/64]... (8xOMP) DONE
> Raw:    1980 c/s real, 247 c/s virtual
>
> (no SIMD yet; could do several times faster).  CUDA:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-cuda
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [CUDA]... (8xOMP) DONE
> Raw:    32385 c/s real, 16695 c/s virtual
>
> OpenCL on the faster card (stock clock):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Max local work size 256
> Optimal local work size = 256
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... (8xOMP) DONE
> Raw:    55138 c/s real, 42442 c/s virtual
>
> 27x speedup over CPU here, although presumably the CPU code is further
> from optimal.
>
> ...Hey, what are you doing here?  That message was way too long, you
> couldn't possibly read this far.  I'll just presume you scrolled to the
> end.  There's good stuff you have missed above, so please scroll up. ;-)
>
> As usual, feedback is welcome on the john-users list.  I realize that
> we're currently missing usage instructions for much of the new stuff, so
> please just ask on john-users - and try to make your questions specific.
> That way, code contributors will also be prompted/forced to contribute
> documentation, and we'll get it under doc/ and on the wiki - in fact,
> you can contribute to that too.
>
> Alexander
>

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ