john-dev - CUDA & OpenCL status

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120304002658.GA8272@openwall.com>
Date: Sun, 4 Mar 2012 04:26:58 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: CUDA & OpenCL status

Hi,

I just built a new machine for my JtR development and testing - not
cost-effective, but letting me run certain tests and benchmarks that I
previously could not.  Brief specs: FX-8120 CPU (Bulldozer) with
unlocked multiplier, GTX-570 vendor-overclocked to 1600 MHz, Radeon HD
7970 at stock clocks.  This is on ASRock 970 Extreme4 motherboard, which
has 3 PCIe 16x slots, but only 2 are usable for dual-slot-wide cards in
the midi-tower case I currently put it in; the slots operate as 8x+8x+4x
(so I got 8x+8x for the two cards now), unless only one is in use (then
it's real 16x), but I think it won't matter much (folks even use 1x
slots for this purpose).  For the OS, I simply installed Ubuntu 12.04
Beta1, plus NVidia drivers via Ubuntu's proprietary driver installer
mechanism, plus the following:

cudatoolkit_4.1.28_linux_64_ubuntu11.04.run
amd-driver-installer-8.921-x86.x86_64.run
AMD-APP-SDK-v2.6-lnx64.tgz

Of the AMD stuff, I happened to install the driver first, then read in
the SDK's README that it is preferable to install the SDK first and why.
So I tarred up /usr/lib*/lib{amdocl,OpenCL}*.so, installed the SDK, then
restored those files.

Speaking of the CPU, I briefly tried -xop builds with and without
OpenMP, which obviously worked fine.  I also tried overclocking from the
standard 3.1 GHz base and 4.0 GHz turbo up to a maximum of 5.1 GHz (no
turbo then), but of course the CPU was unusable at that frequency
(I could still navigate the BIOS menus and see CPU temperature grow
slowly, but not do much more). ;-)  So I just went back to the standard
frequencies for my CUDA and OpenCL tests.

Here are some results:

For a linux-x86-64-cuda build, I ran:

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te=0
Warning: doing quick benchmarking - the performance numbers will be inaccurate
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     3916K c/s real, 3916K c/s virtual
Only one salt:  3648K c/s real, 3648K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     140800 c/s real, 140800 c/s virtual
Only one salt:  70400 c/s real, 140800 c/s virtual

Benchmarking: FreeBSD MD5 [SSE2i 8x]... Segmentation fault (core dumped)

Oops.  Somehow in the CUDA build, some CPU-only formats fail.  This is
seen in a -gpu build as well (CUDA and NVidia OpenCL at once), where
some CPU-only formats' self-tests fail (IIRC, I was getting not
segfaults, but test failures in that build).  Has anyone else seen that
as well?

Another try:

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te=1
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     4079K c/s real, 4079K c/s virtual
Only one salt:  3866K c/s real, 3866K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     137344 c/s real, 137344 c/s virtual
Only one salt:  134272 c/s real, 134272 c/s virtual

Benchmarking: FreeBSD MD5 [SSE2i 8x]... Segmentation fault (core dumped)

So it's reproducible.

OK, let's try the CUDA formats specifically, then:

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=cryptmd5-cuda
Benchmarking: cryptmd5-cuda [MD5-based CRYPT]... DONE
Raw:    637952 c/s real, 637952 c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=cryptsha256-cuda
Benchmarking: cryptsha256-cuda [SHA256-based CRYPT]... DONE
Raw:    6892 c/s real, 6833 c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=cryptsha512-cuda
Benchmarking: cryptsha512-cuda [SHA512-based CRYPT]... DONE
Raw:    3840 c/s real, 3840 c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=mscash-cuda
Benchmarking: mscash-cuda len(pass)=8, len(salt)=13 []... DONE
Raw:    17708K c/s real, 17533K c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=mscash2-cuda
Benchmarking: mscash2-cuda [GPU]... DONE
Raw:    7859 c/s real, 7929 c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=phpass-cuda
Benchmarking: phpass-cuda [PORTABLE-MD5]... DONE
Raw:    633061 c/s real, 633061 c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=raw-sha224-cuda
Benchmarking: raw-sha224-cuda [SHA224]... DONE
Raw:    7864K c/s real, 7864K c/s virtual

user@...l:~/john/magnum-jumbo/src$ LD_LIBRARY_PATH=/usr/local/cuda/lib64 ../run/john -te -fo=raw-sha256-cuda
Benchmarking: raw-sha256-cuda [SHA256]... DONE
Raw:    7798K c/s real, 7798K c/s virtual

That's reasonable performance for "slow" hashes and poor performance for
"fast" hashes, as expected.  However, I think there's room for
improvement for SHA-224 and SHA-256 even within the current formats
interface.

The phpass performance almost exactly matches the number for
oclHashcat-plus given on the hashcat website ("653.6k c/s" for
"PC2: Windows 7, 64 bit ForceWare 285.38 1x NVidia gtx570 1600Mhz core
clock"), which is great.  The cryptmd5-cuda number is almost twice worse
than hashcat's.  mscash2-cuda is 4 times worse than hashcat's.  (I am
using the published numbers for hashcat for this comparison.)

With AMD stuff, things get worse:

user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=phpass-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Optimal Group work Size = 256
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... DONE
Raw:    958464 c/s real, 1409K c/s virtual

That's much better speed (although I think hashcat would do more like
1800k c/s on this card), but it's also the only time I was able to get
this test to pass.  Then it just kept failing:

user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=phpass-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Optimal Group work Size = 256
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed
Aborted (core dumped)
user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=phpass-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Optimal Group work Size = 256
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed
Aborted (core dumped)

cryptmd5-opencl never worked (although I did not try it after a clean
reboot, with no prior attempt to use OpenCL):

user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=cryptmd5-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Benchmarking: CRYPTMD5-OPENCL [MD5-based CRYPT]... FAILED (get_hash[0](0))
user@...l:~/john/magnum-jumbo/src$ ../run/john -device=0 -te=1 -fo=cryptmd5-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Benchmarking: CRYPTMD5-OPENCL [MD5-based CRYPT]... FAILED (get_hash[0](0))

Trying to use the CPU:

user@...l:~/john/magnum-jumbo/src$ ../run/john -device=1 -te=1 -fo=phpass-openclOpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor           >>>
Optimal Group work Size = 2
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed
Aborted (core dumped)
user@...l:~/john/magnum-jumbo/src$ ../run/john -device=1 -te=1 -fo=phpass-openclOpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor           >>>
Optimal Group work Size = 2
Benchmarking: PHPASS-OPENCL [PORTABLE-MD5]... ../../../thread/semaphore.cpp:87: sem_wait() failed
Aborted (core dumped)

No luck.

Another format:

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Optimal Local work size 64
Benchmarking: NT MD4 [OpenCL 1.0]... DONE
Raw:    26473K c/s real, 28445K c/s virtual

Works, but is inefficient (as expected for a "fast" hash currently).

...but fails on the CPU:

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl -dev=1
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor           >>>
Optimal Local work size 512
Benchmarking: NT MD4 [OpenCL 1.0]... ../../../thread/semaphore.cpp:87: sem_wait() failed
Aborted (core dumped)

Let's try both again:

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<Tahiti>>>
Optimal Local work size 64
Benchmarking: NT MD4 [OpenCL 1.0]... DONE
Raw:    25954K c/s real, 27306K c/s virtual

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=nt-opencl -dev=1
OpenCL Platforms: 1
OpenCL Platform: <<<AMD Accelerated Parallel Processing>>> 2 device(s), using device: <<<AMD FX(tm)-8120 Eight-Core Processor           >>>
Optimal Local work size 512
Benchmarking: NT MD4 [OpenCL 1.0]... DONE
Raw:    20360K c/s real, 4854K c/s virtual

Both work now, but the performance is still unreasonable (the simple
"NT" and "NT2" formats for this same hash type are slightly faster).

...

NVidia OpenCL build does not crash on CPU formats when ran with -te=0
(unlike -cuda and -gpu builds), so that's what I did, and here's the
final portion of output:

[...]
Benchmarking: dummy [N/A]... DONE
Raw:    115379K c/s real, 115379K c/s virtual

OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Max Group Work Size 1024 Optimal local work size 64
(to avoid this test on next run do export LWS=64)
Local work size (LWS) 64, Keys per crypt (KPC) 2097152
Benchmarking: Netscape LDAP SSHA OPENCL [salted SHA-1]... DONE
Many salts:     52428K c/s real, 69905K c/s virtual
Only one salt:  41943K c/s real, 41943K c/s virtual

OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Max Group Work Size 1024 Optimal local work size 128
(to avoid this test on next run do export LWS=128)
Local work size (LWS) 128, Keys per crypt (KPC) 2097152
Benchmarking: Raw MD5 [raw-md5-opencl]... DONE
Raw:    52428K c/s real, 41943K c/s virtual

OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Optimal Local work size 128
Benchmarking: NT MD4 [OpenCL 1.0]... DONE
Raw:    52428K c/s real, 26214K c/s virtual

OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Max Group Work Size 1024 Optimal local work size 64
(to avoid this test on next run do export LWS=64)
Local work size (LWS) 64, Keys per crypt (KPC) 2097152
Benchmarking: Raw SHA-1 OpenCL [raw-sha1-opencl]... DONE
Raw:    52428K c/s real, 52428K c/s virtual

OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
OpenCL error (CL_OUT_OF_RESOURCES) in file (cryptmd5_opencl_fmt.c) at line (194)
 - (Set ND range)

Oops, cryptmd5-opencl in that build always fails like that:

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=cryptmd5-opencl
../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john)
OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
OpenCL error (CL_OUT_OF_RESOURCES) in file (cryptmd5_opencl_fmt.c) at line (194) - (Set ND range)

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=mysql-sha1-opencl
../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john)
OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Max Group Work Size 1024 Error -5
Optimal local work size 64
(to avoid this test on next run do export LWS=64)
Local work size (LWS) 64, Keys per crypt (KPC) 2097152
Benchmarking: MySQL 4.1 double-SHA-1 [mysql-sha1-opencl]... DONE
Many salts:     23741K c/s real, 23967K c/s virtual
Only one salt:  23967K c/s real, 23741K c/s virtual

ser@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=phpass-opencl
../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john)
OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
OpenCL error (CL_OUT_OF_RESOURCES) in file (phpass_opencl_fmt.c) at line (162) - (Run kernel)

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=raw-sha1-opencl
../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john)
OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Max Group Work Size 1024 Optimal local work size 64
(to avoid this test on next run do export LWS=64)
Local work size (LWS) 64, Keys per crypt (KPC) 2097152
Benchmarking: Raw SHA-1 OpenCL [raw-sha1-opencl]... DONE
Raw:    46829K c/s real, 47288K c/s virtual

user@...l:~/john/magnum-jumbo/src$ ../run/john -te -fo=ssha-opencl
../run/john: /usr/lib/nvidia-current-updates/libOpenCL.so.1: no version information available (required by ../run/john)
OpenCL Platforms: 2
OpenCL Platform: <<<NVIDIA CUDA>>> 1 device(s), using device: <<<GeForce GTX 570>>>
Max Group Work Size 1024 Optimal local work size 64
(to avoid this test on next run do export LWS=64)
Local work size (LWS) 64, Keys per crypt (KPC) 2097152
Benchmarking: Netscape LDAP SSHA OPENCL [salted SHA-1]... DONE
Many salts:     67108K c/s real, 65793K c/s virtual
Only one salt:  45680K c/s real, 45680K c/s virtual

So some of them work (or at least pass the test), some don't.  Some
deliver reasonable performance, most don't.  This is fine as a
development milestone, but not surprisingly a lot of further work is
needed after this point.

Thanks for reading this far.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.