john-users - CUDA tweaking to your actual GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <04147bf88399d7cb31e370549b1841d8@smtp.hushmail.com>
Date: Sat, 6 Oct 2012 16:15:23 +0200
From: magnum <john.magnum@...hmail.com>
To: "john-users@...ts.openwall.com" <john-users@...ts.openwall.com>
Subject: CUDA tweaking to your actual GPU

When compiling JtR with CUDA support, you are supposed to change the NVCC_FLAGS in Makefile to reflect your GPU card. Here's what happens if you do, or don't. This was tested with a Kepler card, which is "sm_30" (ie. capability 3.0) while the default in Makefile is "sm_10" (ie. capability, well you guessed it, 1.0). The latter will always work, but might not be optimal.

I ran all CUDA benchmarks with sm_10 and sm_30, and here's relbench's verdict:

$ ../run/relbench -v sm10.txt sm30.txt 
Ratio:	1.07261 real, 1.06185 virtual	Raw SHA-224:Raw
Ratio:	0.99100 real, 0.99100 virtual	M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1:Raw
Ratio:	1.10947 real, 1.10947 virtual	Password Safe SHA-256:Raw
Ratio:	1.04536 real, 1.04625 virtual	sha256crypt (rounds=5000):Raw
Ratio:	0.89869 real, 0.88997 virtual	phpass MD5 ($P$9 lengths 0 to 15):Raw
Ratio:	1.02549 real, 1.02549 virtual	WPA-PSK PBKDF2-HMAC-SHA-1:Raw
Ratio:	0.79586 real, 0.78033 virtual	Raw SHA-512:Raw
Ratio:	1.34998 real, 1.36361 virtual	M$ Cache Hash MD4:Only one salt
Ratio:	0.88889 real, 0.88081 virtual	md5crypt:Raw
Ratio:	2.73102 real, 2.75858 virtual	M$ Cache Hash MD4:Many salts
Ratio:	0.81805 real, 0.81003 virtual	Mac OS X 10.7+ salted SHA-512:Many salts
Ratio:	1.07281 real, 1.08331 virtual	Raw SHA-256:Raw
Ratio:	1.10343 real, 1.09753 virtual	sha512crypt (rounds=5000):Raw
Ratio:	0.80827 real, 0.80827 virtual	Mac OS X 10.7+ salted SHA-512:Only one salt
Number of benchmarks:		14
Minimum:			0.79586 real, 0.78033 virtual
Maximum:			2.73102 real, 2.75858 virtual
Median:				1.03538 real, 1.03582 virtual
Median absolute deviation:	0.10064 real, 0.10365 virtual
Geometric mean:			1.06194 real, 1.05942 virtual
Geometric standard deviation:	1.34743 real, 1.35503 virtual

The worst slow-down was 20%, for raw-sha512. That's a pity, but the best boost was a whopping 2.7x speedup, for mscash. Both these formats are fast and currently suboptimal for JtR, and might have a large portion of random test variation.

I'm not sure what to do with this, but here are some observations:

- This test should also include tweaking BLOCKS and THREADS but everything was left at default. Some auto-homing like we do in OpenCL should be fairly high priority for CUDA. The outcome *may* be totally different with optimal values.

- The interesting formats are the slow ones. Phpass had an 11% slowdown, and md5crypt about the same. Mscash2 lost a percent, most others gained more or less.

- Apparently, this is not super-important. You get up to 10% gain if we disregard mscash.

- The OpenCL support for nvidia automatically picks your actual architecture so will likely have similar goods and bads. It's probably possible to pick an arch per format, if we want to.

Final note for OSX users: CUDA 5 for OSX is released. All JtR formats work fine, and the old bug that gave me kernel panics if GPU was set to dynamic switching is gone.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.