[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 29 Jun 2012 21:56:02 +0200
From: newangels newangels <contact.newangels@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: John the Ripper 1.7.9-jumbo-6
Hi,
Verry nice news & so many improvement's ! thanks a lot to all of you
for the effort's & time.
I just try to compiled on MAC_OSX LION an GPU enable build, but
unfotunately got an error.
I do :
make macosx-x86-64-opencl
& i got :
make[1]: *** [common_opencl_pbkdf2.o] Error 1
make: *** [macosx-x86-64-opencl] Error 2
System information's :
MacBook Pro 17' / ATI 6750 M - 1Go / SSD - osx-Lion
Can some of you help me on this issue,
Thanks a lot in advance,
Regards,
Donovan
2012/6/29, Solar Designer <solar@...nwall.com>:
> Hi,
>
> We've released John the Ripper 1.7.9-jumbo-6 earlier today. This is a
> "community-enhanced" version, which includes many contributions from JtR
> community members - in fact, that's what it primarily consists of. It's
> been half a year since 1.7.9-jumbo-5, which is a lot of time, and a lot
> has been added to jumbo since then. Even though it's just a one digit
> change in the version number, this is in fact the biggest single jumbo
> update we've made so far. It appears that between -5 and -6 the source
> code grew by over 1 MB, or by over 40,000 lines of code (and that's not
> including lines that were changed as opposed to added). The biggest new
> thing is integrated GPU support, both CUDA and OpenCL - although for a
> subset of the hash and non-hash types only, not for all that are
> supported on CPU. (Also, it is efficient only for so-called "slow"
> hashes now, and for the "non-hashes" that we chose to support on GPU.
> For "fast" hashes, it is just a development milestone, albeit a
> desirable one as well.) The other biggest new thing is the addition of
> support for many more "non-hashes" and hashes (see below).
>
> You may download John the Ripper 1.7.9-jumbo-6 at the usual place:
>
> http://www.openwall.com/john/
>
> With so many changes, even pushing this release out was difficult.
> Despite of the statement that "jumbo is buggy by definition", we did try
> to eliminate as many bugs as we reasonably could - but after a week of
> mad testing and bug-fixing, I chose to release the tree as-is, only
> documenting the remaining known bugs (below and in doc/BUGS). Still, we
> ended up posting over 1200 messages to john-dev in June - even though in
> prior months we did not even hit 500. Indeed, we did run plenty of
> tests and fix plenty of bugs, which you won't see in this release.
>
> I've included a lengthy description of some of the changes below, and
> below that I'll add some benchmark results that I find curious (such as
> for bcrypt on CPU vs. GPU).
>
> Direct code contributors to 1.7.9-jumbo-6 (since 1.7.9-jumbo-5), by
> commit count:
>
> magnum
> Dhiru Kholia
> Frank Dittrich
> JimF (Jim Fougeron)
> myrice (Dongdong Li)
> Claudio Andre
> Lukas Odzioba
> Solar Designer
> Sayantan Datta
> Samuele Giovanni Tonon
> Tavis Ormandy
> bartavelle (Simon Marechal)
> Sergey V
> bizonix
> Robert Veznaver
> Andras
>
> New non-hashes:
> * Mac OS X keychains [OpenMP] (Dhiru)
> - based on research from extractkeychain.py by Matt Johnston
> * KeePass 1.x files [OpenMP] (Dhiru)
> - keepass2john is based on ideas from kppy by Karsten-Kai Koenig
> http://gitorious.org/kppy/kppy
> * Password Safe [OpenMP, CUDA, OpenCL] (Dhiru, Lukas)
> * ODF files [OpenMP] (Dhiru)
> * Office 2007/2010 documents [OpenMP] (Dhiru)
> - office2john is based on test-dump-msole.c by Jody Goldberg and
> OoXmlCrypto.cs by Lyquidity Solutions Limited
> * Mozilla Firefox, Thunderbird, SeaMonkey master passwords [OpenMP]
> (Dhiru)
> - based on FireMaster and FireMasterLinux
> http://code.google.com/p/rainbowsandpwnies/wiki/FiremasterLinux
> * RAR -p mode encrypted archives (magnum)
> - RAR -hp mode was supported previously, now both modes are
>
> New challenge/responses, MACs:
> * WPA-PSK [OpenMP, CUDA, OpenCL] (Lukas, Solar)
> - CPU code is loosely based on Aircrack-ng
> http://www.aircrack-ng.org
> http://openwall.info/wiki/john/WPA-PSK
> * VNC challenge/response authentication [OpenMP] (Dhiru)
> - based on VNCcrack by Jack Lloyd
> http://www.randombit.net/code/vnccrack/
> * SIP challenge/response authentication [OpenMP] (Dhiru)
> - based on SIPcrack by Martin J. Muench
> * HMAC-SHA-1, HMAC-SHA-224, HMAC-SHA-256, HMAC-SHA-384, HMAC-SHA-512
> (magnum)
>
> New hashes:
> * IBM RACF [OpenMP] (Dhiru)
> - thanks to Nigel Pentland (author of CRACF) and Main Framed for
> providing
> algorithm details, sample code, sample RACF binary database, test vectors
> * sha512crypt (SHA-crypt) [OpenMP, CUDA, OpenCL] (magnum, Lukas, Claudio)
> - previously supported in 1.7.6+ only via "generic crypt(3)" interface
> * sha256crypt (SHA-crypt) [OpenMP, CUDA] (magnum, Lukas)
> - previously supported in 1.7.6+ only via "generic crypt(3)" interface
> * DragonFly BSD SHA-256 and SHA-512 based hashes [OpenMP] (magnum)
> * Django 1.4 [OpenMP] (Dhiru)
> * Drupal 7 $S$ phpass-like (based on SHA-512) [OpenMP] (magnum)
> * WoltLab Burning Board 3 [OpenMP] (Dhiru)
> * New EPiServer default (based on SHA-256) [OpenMP] (Dhiru)
> * GOST R 34.11-94 [OpenMP] (Dhiru, Sergey V, JimF)
> * MD4 support in "dynamic" hashes (user-configurable) (JimF)
> - previously, only MD5 and SHA-1 were supported in "dynamic"
> * Raw-SHA1-LinkedIn (raw SHA-1 with first 20 bits zeroed) (JimF)
>
> Alternate implementations for previously supported hashes:
> * Faster raw SHA-1 (raw-sha1-ng, password length up to 15) (Tavis)
>
> OpenMP support in new formats:
> * Mac OS X keychains (Dhiru)
> * KeePass 1.x files (Dhiru)
> * Password Safe (Lukas)
> * ODF files (Dhiru)
> * Office 2007/2010 documents (Dhiru)
> * Mozilla Firefox, Thunderbird, SeaMonkey master passwords (Dhiru)
> * WPA-PSK (Solar)
> * VNC challenge/response authentication (Dhiru)
> * SIP challenge/response authentication (Dhiru)
> * IBM RACF (Dhiru)
> * DragonFly BSD SHA-256 and SHA-512 based hashes (magnum)
> * Django 1.4 (Dhiru)
> * Drupal 7 $S$ phpass-like (based on SHA-512) (magnum)
> * WoltLab Burning Board 3 (Dhiru)
> * New EPiServer default (based on SHA-256) (Dhiru)
> * GOST R 34.11-94 (Dhiru, JimF)
>
> OpenMP support for previously supported hashes that lacked it:
> * Mac OS X 10.4 - 10.6 salted SHA-1 (magnum)
> * DES-based tripcodes (Solar)
> * Invision Power Board 2.x salted MD5 (magnum)
> * HTTP Digest access authentication MD5 (magnum)
> * MySQL (old) (Solar)
>
> CUDA support for:
> * phpass MD5-based "portable hashes" (Lukas)
> * md5crypt (FreeBSD-style MD5-based crypt(3) hashes) (Lukas)
> * sha512crypt (glibc 2.7+ SHA-crypt) (Lukas)
> * sha256crypt (glibc 2.7+ SHA-crypt) (Lukas)
> * Password Safe (Lukas)
> * WPA-PSK (Lukas)
> * Raw SHA-224, raw SHA-256 [inefficient] (Lukas)
> * MSCash (DCC) [not working reliably yet] (Lukas)
> * MSCash2 (DCC2) [not working reliably yet] (Lukas)
> * Raw SHA-512 [not working reliably yet] (myrice)
> * Mac OS X 10.7 salted SHA-512 [not working reliably yet] (myrice)
> - we have already identified the problem with the above two, and a post
> 1.7.9-jumbo-6 fix should be available shortly - please ask on john-users
> if
> interested in trying it out
>
> OpenCL support for:
> * phpass MD5-based "portable hashes" (Lukas)
> * md5crypt (FreeBSD-style MD5-based crypt(3) hashes) (Lukas)
> * sha512crypt (glibc 2.7+ SHA-crypt) (Claudio)
> - suitable for NVIDIA cards, faster than the CUDA implementation above
> http://openwall.info/wiki/john/OpenCL-SHA-512
> * bcrypt (OpenBSD-style Blowfish-based crypt(3) hashes) (Sayantan)
> - pre-configured for AMD Radeon HD 7970, will likely fail on others
> unless
> WORK_GROUP_SIZE is adjusted in opencl_bf_std.h and opencl/bf_kernel.cl;
> the achieved level of performance is CPU-like (bcrypt is known to be
> somewhat GPU-unfriendly - a lot more than SHA-512)
> http://openwall.info/wiki/john/GPU/bcrypt
> * MSCash2 (DCC2) (Sayantan)
> - with optional and experimental multi-GPU support as a compile-time hack
> (even AMD+NVIDIA mix), by editing init() in opencl_mscash2_fmt.c
> * Password Safe (Lukas)
> * WPA-PSK (Lukas)
> * RAR (magnum)
> * MySQL 4.1 double-SHA-1 [inefficient] (Samuele)
> * Netscape LDAP salted SHA-1 (SSHA) [inefficient] (Samuele)
> * NTLM [inefficient] (Samuele)
> * Raw MD5 [inefficient] (Dhiru, Samuele)
> * Raw SHA-1 [inefficient] (Samuele)
> * Raw SHA-512 [not working properly yet] (myrice)
> * Mac OS X 10.7 salted SHA-512 [not working properly yet] (myrice)
> - we have already identified the problem with the above two, and a post
> 1.7.9-jumbo-6 fix should be available shortly - please ask on john-users
> if
> interested in trying it out
>
> Several of these require byte-addressable store (any NVIDIA card, but
> only 5000 series or newer if AMD/ATI). Also, OpenCL kernels for "slow"
> hashes/non-hashes (e.g. RAR) may cause "ASIC hang" on certain AMD/ATI
> cards with recent driver versions. We'll try to address these issues in
> a future version.
>
> AMD XOP (Bulldozer) support added for:
> * Many hashes based on MD4, MD5, SHA-1 (Solar)
>
> Uses of SIMD (MMX assembly, SSE2/AVX/XOP intrinsics) added for:
> * Mac OS X 10.4 - 10.6 salted SHA-1 (magnum)
> * Invision Power Board 2.x salted MD5 (magnum)
> * HTTP Digest access authentication MD5 (magnum)
> * SAP CODVN B (BCODE) MD5 (magnum)
> * SAP CODVN F/G (PASSCODE) SHA-1 (magnum)
> * Oracle 11 (magnum)
>
> Other optimizations:
> * Reduced memory usage for raw-md4, raw-md5, raw-sha1, and nt2 (magnum)
> * Prefer CommonCrypto over OpenSSL on Mac OS X 10.7 (Dhiru)
> * New SSE2 intrinsics code for SHA-1 (JimF, magnum)
> * Smarter use of SSE2 and SSSE3 intrinsics (the latter only if enabled in
> the
> compiler at build time) to implement some bit rotates for MD5, SHA-1
> (Solar)
> * Assorted optimizations for raw SHA-1 and HMAC-MD5 (magnum)
> * In RAR format, added inline storing of RAR data in JtR input file when
> the
> original file is small enough (magnum)
> * Added use of the bitslice DES implementation for tripcodes (Solar)
> * Raw-MD5-unicode made "thick" again (that is, not building upon
> "dynamic"),
> using much faster code (magnum)
> * Assorted performance tweaks in "salted-sha1" (SSHA) (magnum)
> * Added functions for larger hash tables to several formats (magnum,
> Solar)
>
> Other assorted enhancements:
> * linux-*-gpu (both CUDA and OpenCL at once), linux-*-cuda, linux-*-opencl,
> macosx-x86-64-opencl make targets (magnum et al.)
> * linux-*-native make targets (pass -march=native to gcc) (magnum)
> * New option: --dupe-suppression (for wordlist mode) (magnum)
> * New option: --loopback[=FILE] (implies --dupe-suppression) (magnum)
> * New option: --max-run-time=N for graceful exit after N seconds (magnum)
> * New option: --log-stderr (magnum)
> * New option: --regenerate-lost-salts=N for cracking hashes where we do not
> have the salt and essentially need to crack it as well (JimF)
> * New unlisted option: --list (for bash completion, GUI, etc.) (magnum)
> * --list=[encodings|opencl-devices] (magnum)
> * --list=cuda-devices (Lukas)
> * --list=format-details (Frank)
> * --list=subformats (magnum)
> * New unlisted option: --length=N for reducing maximum plaintext length of
> a
> format, mostly for testing purposes (magnum)
> * Enhanced parameter syntax for --markov: may refer to a configuration file
> section, may specify the start and/or end in percent of total (Frank)
> * Make incremental mode restore ETA figures (JimF)
> * In "dynamic", support NUL octets in constants (JimF)
> * In "salted-sha1" (SSHA), support any salt length (magnum)
> * Use comment and home directory fields from PWDUMP-style input (magnum)
> * Sort the format names list in "john" usage output alphabetically
> (magnum)
> * New john.conf options subsection "MPI" (magnum)
> * New john.conf config item CrackStatus under Options:Jumbo (magnum)
> * \xNN escape sequence to specify arbitrary characters in rules (JimF)
> * New rule command _N to reject a word unless it is of length N (JimF)
> * Extra wordlist rule sections: Extra, Single-Extra, Jumbo (magnum)
> * Enhanced "Double" external mode sample (JimF)
> * Source $JOHN/john.local.conf by default (magnum)
> * Many format and algorithm names have been changed for consistency
> (Solar)
> * When intrinsics are in use, the reported algorithm name now tells which
> ones
> (SSE2, AVX, or XOP) (Solar)
> * benchmark-unify: a Perl script to unify benchmark output of different
> versions of JtR for use with relbench (Frank)
> * Per-benchmark speed ratio output added to relbench (Frank)
> * bash completion for JtR (to install: "sudo make bash-completion")
> (Frank)
> * New program: raw2dyna (helper to convert raw hashes to "dynamic") (JimF)
> * New program: pass_gen.pl (generates hashes from plaintexts) (JimF,
> magnum)
> * Many code changes made, many bugs fixed, many new bugs introduced (all)
>
> Now the promised benchmarks. Here's 1.7.9-jumbo-5 to 1.7.9-jumbo-6
> overall speed change on one core in FX-8120 (should be 4.0 GHz turbo),
> after running through benchmark-unify and relbench (yet about 50 of the
> new version's benchmark results could not be directly compared against
> results of the previous version, and thus are excluded):
>
> Number of benchmarks: 151
> Minimum: 0.84668 real, 0.84668 virtual
> Maximum: 10.92416 real, 10.92416 virtual
> Median: 1.10800 real, 1.10800 virtual
> Median absolute deviation: 0.12531 real, 0.12369 virtual
> Geometric mean: 1.26217 real, 1.26284 virtual
> Geometric standard deviation: 1.47239 real, 1.47274 virtual
>
> Ditto for OpenMP-enabled builds (8 threads, should be 3.1 GHz):
>
> Number of benchmarks: 151
> Minimum: 0.94616 real, 0.48341 virtual
> Maximum: 24.19709 real, 4.29610 virtual
> Median: 1.17609 real, 1.05964 virtual
> Median absolute deviation: 0.17436 real, 0.11465 virtual
> Geometric mean: 1.35493 real, 1.17097 virtual
> Geometric standard deviation: 1.71505 real, 1.36577 virtual
>
> These show that overall we do indeed have a speedup, and that's without
> any GPU stuff.
>
> Also curious is speedup due to OpenMP in 1.7.9-jumbo-6 (same version in
> both cases), on the same CPU (8 threads):
>
> Number of benchmarks: 202
> Minimum: 0.76235 real, 0.09553 virtual
> Maximum: 30.51791 real, 3.81904 virtual
> Median: 1.01479 real, 0.98287 virtual
> Median absolute deviation: 0.02747 real, 0.03514 virtual
> Geometric mean: 1.71441 real, 0.77454 virtual
> Geometric standard deviation: 2.08823 real, 1.50966 virtual
>
> The 30x maximum speedup (with only 8 threads) is indeed abnormal, it is
> for:
>
> Ratio: 30.51791 real, 3.81904 virtual SIP MD5:Raw
>
> We'll correct the non-OpenMP performance for SIP in the next version.
> For the rest, the maximum speedup is 6.13x for SSH, which is great
> (considering that the CPU clock rate reduces with more threads running,
> and that this is a 4-module CPU rather than a true 8-core). Here are
> the top 10 OpenMP performers (excluding SIP):
>
> Ratio: 6.13093 real, 0.77210 virtual SSH RSA/DSA (one 2048-bit RSA and
> one 1024-bit DSA key):Raw
> Ratio: 6.05882 real, 0.75737 virtual NTLMv2 C/R MD4 HMAC-MD5:Many salts
> Ratio: 6.04342 real, 0.75548 virtual LMv2 C/R MD4 HMAC-MD5:Many salts
> Ratio: 5.92830 real, 0.74108 virtual GOST R 34.11-94:Raw
> Ratio: 5.81605 real, 0.73986 virtual sha256crypt (rounds=5000):Raw
> Ratio: 5.65289 real, 0.70523 virtual sha512crypt (rounds=5000):Raw
> Ratio: 5.63333 real, 0.72034 virtual Drupal 7 $S$ SHA-512 (x16385):Raw
> Ratio: 5.56435 real, 0.69937 virtual OpenBSD Blowfish (x32):Raw
> Ratio: 5.50484 real, 0.69682 virtual Password Safe SHA-256:Raw
> Ratio: 5.49613 real, 0.68814 virtual Sybase ASE salted SHA-256:Many
> salts
>
> The worst regression is for:
>
> Ratio: 0.76235 real, 0.09553 virtual LM DES:Raw
>
> It is known that our current LM hash code does not scale well, and is
> very fast even with one thread (close to the bottleneck of the current
> interface). It is in fact better not to use OpenMP for LM hashes yet,
> or to keep the thread count low (e.g., 4 would behave better than 8).
> The low median and mean speedup are because many hashes still lack
> OpenMP support - mostly the "fast" ones, where we'd bump into the
> bottleneck anyway. We might deal with this later. For "slow" hashes,
> the speedup with OpenMP is close to perfect (5x to 6x for this CPU).
>
> Now to the new stuff. The effect of XOP (make linux-x86-64-xop):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=md5
> Benchmarking: FreeBSD MD5 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
> Raw: 204600 c/s real, 25625 c/s virtual
>
> -5 achieved at most:
>
> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
> Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
> Raw: 158208 c/s real, 19751 c/s virtual
>
> with "make linux-x86-64i" (icc precompiled SSE2 intrinsics), and only:
>
> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=md5
> Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
> Raw: 141312 c/s real, 17664 c/s virtual
>
> with "make linux-x86-64-xop" because it did not yet use XOP for MD5 (nor
> for MD4 and SHA-1), only knowing how to use it for DES (which it did).
>
> So we got an over 20% speedup due to XOP here.
>
> Similarly, for raw SHA-1 best result with -5:
>
> user@...l:~/john-1.7.9-jumbo-5/run$ ./john -te -fo=raw-sha1
> Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE
> Raw: 13067K c/s real, 13067K c/s virtual
>
> whereas -6 does, with JimF's and magnum's optimizations and with XOP:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1
> Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
> Raw: 23461K c/s real, 23698K c/s virtual
>
> and with Tavis' contribution:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=raw-sha1-ng
> Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
> Raw: 28024K c/s real, 28024K c/s virtual
>
> So that's an over 2x speedup if we can accept the length 15 limit, or
> an almost 80% speedup otherwise.
>
> Note: all of the raw SHA-1 benchmarks above are for one CPU core, not
> for the entire chip (no OpenMP for fast hashes like this yet, but
> there's MPI and there are always separate process invocations...)
>
> To more important stuff, sha512crypt on CPU vs. GPU:
>
> For reference, here's what we would get with the previous version, using
> the glibc implementation of SHA-crypt:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=crypt -sub=sha512crypt
> Benchmarking: generic crypt(3) SHA-512 rounds=5000 [?/64]... (8xOMP) DONE
> Many salts: 1518 c/s real, 189 c/s virtual
> Only one salt: 1515 c/s real, 189 c/s virtual
>
> Now we also have builtin implementation, although it nevertheless uses
> OpenSSL for the SHA-512 primitive (it doesn't have its own SHA-512 yet -
> adding that and making use of SIMD would provide much additional
> speedup, this is a to-do item for us):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt
> Benchmarking: sha512crypt (rounds=5000) [64/64]... (8xOMP) DONE
> Raw: 2045 c/s real, 256 c/s virtual
>
> So it is about 35% faster. Let's try GPUs, first GTX 570 1600 MHz
> (a card that is vendor-overclocked to that frequency):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-cuda
> Benchmarking: sha512crypt (rounds=5000) [CUDA]... DONE
> Raw: 3833 c/s real, 3833 c/s virtual
>
> Another 2x speedup here, but that's still not it. Let's see:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> Building the kernel, this could take a while
> Local work size (LWS) 512, global work size (GWS) 7680
> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
> Raw: 11405 c/s real, 11349 c/s virtual
>
> And now this is it - Claudio's OpenCL code is really good on NVIDIA,
> giving us a 5.5x speedup over CPU. (SHA-512 is not as GPU-friendly as
> e.g. MD5, but is friendly enough for some decent speedup.)
>
> Let's also try AMD Radeon HD 7970 (normally a faster card), at stock
> clocks:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl
> -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Building the kernel, this could take a while
> Elapsed time: 17 seconds
> Local work size (LWS) 32, global work size (GWS) 16384
> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
> Raw: 5144 c/s real, 3276K c/s virtual
>
> Not as much luck here yet. Finally, for comparison and to show how any
> one of the three OpenCL devices may be accessed from john's command-line
> with --platform and --device options, the same OpenCL code on the CPU:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=sha512crypt-opencl -pla=1
> -dev=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 1: AMD FX(tm)-8120 Eight-Core Processor
> Local work size (LWS) 1, global work size (GWS) 1024
> Benchmarking: sha512crypt (rounds=5000) [OpenCL]... DONE
> Raw: 1850 c/s real, 233 c/s virtual
>
> This shows that the code is indeed pretty efficient - almost reaching
> OpenSSL's specialized code speed.
>
> Now to bcrypt. This CPU is pretty good at it:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf
> Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... (8xOMP) DONE
> Raw: 5300 c/s real, 664 c/s virtual
>
> (FWIW, with overclocking I was able to get this to about 5650 c/s, but
> not more - bumping into 125 W TDP. The above is at stock clocks.)
>
> This is for "$2a$05" or only 32 iterations, which is used as baseline
> for benchmarks for historical reasons. Actual systems often use
> "$2a$08" (8 times slower) to "$2a$10" (32 times slower) these days.
>
> Anyway, the reference cracking speed for bcrypt above is higher than the
> speed for sha512crypt on the same CPU (with the current code at least,
> which admittedly can be optimized much further). Can we make it even
> higher on a GPU? Maybe, but not yet, not with the current code:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> ****Please see 'opencl_bf_std.h' for device specific optimizations****
> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
> Raw: 4143 c/s real, 238933 c/s virtual
>
> user@...l:~/john-1.7.9-jumbo-6/run$ DISPLAY=:0 aticonfig --od-enable
> --od-setclocks=1225,1375
> AMD Overdrive(TM) enabled
>
> Default Adapter - AMD Radeon HD 7900 Series
> New Core Peak : 1225
> New Memory Peak : 1375
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> ****Please see 'opencl_bf_std.h' for device specific optimizations****
> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
> Raw: 5471 c/s real, 358400 c/s virtual
>
> It's only with a 30% overclock that the high-end GPU gets to the same
> level of performance as the 2-3 times cheaper CPU. BTW, the GPU stays
> cool with this overclock (73 C with stock cooling when running bf-opencl
> for a while), precisely because we have to heavily under-utilize it due
> to it not having enough local memory to accommodate as many parallel
> bcrypt computations as we'd need for full occupancy and to hide memory
> access latencies.
>
> Maybe more optimal code will achieve better results, though.
>
> The NVIDIA card also has no luck competing with the CPU at bcrypt yet:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=bf-opencl
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> ****Please see 'opencl_bf_std.h' for device specific optimizations****
> Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
> Raw: 1137 c/s real, 1137 c/s virtual
>
> Some tuning could provide better numbers, but they stay a lot lower than
> the CPU's and HD 7970's anyway (for the current code).
>
> Some other GPU benchmarks where I think we achieve decent performance
> (not exactly the best, but on par with competing tools that had GPU
> support for longer):
>
> GTX 570 1600 MHz:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=phpass-cuda
> Benchmarking: phpass MD5 ($P$9 lengths 1 to 15) [CUDA]... DONE
> Raw: 510171 c/s real, 507581 c/s virtual
>
> HD 7970 925 MHz (stock):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:256
> Kernel Execution Speed (Higher is better):1.403044
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw: 92467 c/s real, 92142 c/s virtual
>
> 1225 MHz:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=mscash2-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal Work Group Size:128
> Kernel Execution Speed (Higher is better):1.856949
> Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw: 121644 c/s real, 121644 c/s virtual
>
> (would overheat if actually used? this is not bcrypt anymore)
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar
> OpenCL platform 0: NVIDIA CUDA, 1 device(s).
> Using device 0: GeForce GTX 570
> Optimal keys per crypt 32768
> (to avoid this test on next run, put "rar_GWS = 32768" in john.conf, section
> [Options:OpenCL])
> Local worksize (LWS) 64, Global worksize (GWS) 32768
> Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
> Raw: 4380 c/s real, 4334 c/s virtual
>
> The HD 7970 card is back to stock clocks here:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=rar -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Optimal keys per crypt 65536
> (to avoid this test on next run, put "rar_GWS = 65536" in john.conf, section
> [Options:OpenCL])
> Local worksize (LWS) 64, Global worksize (GWS) 65536
> Benchmarking: RAR3 SHA-1 AES (6 characters) [OpenCL]... (8xOMP) DONE
> Raw: 7162 c/s real, 468114 c/s virtual
>
> WPA-PSK, on CPU:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [32/64]... (8xOMP) DONE
> Raw: 1980 c/s real, 247 c/s virtual
>
> (no SIMD yet; could do several times faster). CUDA:
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-cuda
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [CUDA]... (8xOMP) DONE
> Raw: 32385 c/s real, 16695 c/s virtual
>
> OpenCL on the faster card (stock clock):
>
> user@...l:~/john-1.7.9-jumbo-6/run$ ./john -te -fo=wpapsk-opencl -pla=1
> OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Tahiti
> Max local work size 256
> Optimal local work size = 256
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... (8xOMP) DONE
> Raw: 55138 c/s real, 42442 c/s virtual
>
> 27x speedup over CPU here, although presumably the CPU code is further
> from optimal.
>
> ...Hey, what are you doing here? That message was way too long, you
> couldn't possibly read this far. I'll just presume you scrolled to the
> end. There's good stuff you have missed above, so please scroll up. ;-)
>
> As usual, feedback is welcome on the john-users list. I realize that
> we're currently missing usage instructions for much of the new stuff, so
> please just ask on john-users - and try to make your questions specific.
> That way, code contributors will also be prompted/forced to contribute
> documentation, and we'll get it under doc/ and on the wiki - in fact,
> you can contribute to that too.
>
> Alexander
>
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux -
Powered by OpenVZ