Date: Thu, 13 Jan 2011 09:58:24 +0100 From: Samuele Giovanni Tonon <samu@...uxasylum.net> To: john-users@...ts.openwall.com Subject: opencl sha1 jtr and others some experiments and some suggestion hello all, first of all sorry for this long email but i'm quite on a dead end and i would like some advice as well share with you all some research i did. Also forgive me if i skip on some parts i can go on the details in another email. Thanks to NTLM opencl patches for jtr i've been working a bit on a raw sha1 opencl implementation for john. Without having a good knowledge of opencl, jtr and with my rusty C i've started working and i've come with something that - at least doesn't crash and doesn't output gibberish, but it's terribly slow. so let me show you some benchmark this is the "original" raw-sha1: ../run/john -test --format=raw-sha1 Benchmarking: Raw SHA-1 [raw-sha1]... DONE Raw: 4768K c/s real, 4816K c/s virtual this the one with opencl ../run/john -test --format=raw-sha1-opencl Benchmarking: Raw SHA-1 OpenCL [SHA-1]... OpenCL Platform: <<<ATI Stream>>> and device: <<<Juniper>>> DONE Many salts: 1478 c/s real, 1894 c/s virtual Only one salt: 1961 c/s real, 1903 c/s virtual Astonishing slow isn't it ? Well, having the luck of owning an ATI and an AMD processor let me change from GPU-opencl to CPU-opencl by just changing one parameter is clGetDeviceIDs. these are the results: ../run/john -test --format=raw-sha1-opencl Benchmarking: Raw SHA-1 OpenCL [SHA-1]... OpenCL Platform: <<<ATI Stream>>> and device: <<<AMD Phenom(tm) II X4 945 Processor>>> DONE Many salts: 23833 c/s real, 19376 c/s virtual Only one salt: 22902 c/s real, 18927 c/s virtual As you can see there's a 20x time difference between "opencl with GPU" and "opencl with CPU" and 100x time with the original cpu code based. what and where is the main cause of this slowness? I don't expect those huge numbers i can get from oclhashcat (552M/s ) since it's a totally different approach from john but i don't expect either to go that slow. I did some bit of research and came out with two idea bothering me: the slowness could be because because i'm not "grouping" many cleartext password to process at the same time ( hint, that #MD5_NUM_KEYS you can find on rawMD5_opencl_fmt.c ) and because of the overhead of the exchange of data between GPU and CPU I found a simple sha1 brute force from an opencl forum ( http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2260 ) to this i added two kind of opencl kernel: one i've found on royger.org (just search sha1 opencl on google) and the other from pyrit (gpu wpa cracker). Both kernel have been revised to adapt to the code: pyrit one is a bit faster than the one from royger and these are the results: ##with royger kernel $./bfsha OpenCL Platform: <<<ATI Stream>>> and device: <<<Juniper>>> max group size 256 Computed 26214400.000000 hashes. 65.110000 ##with pyrit kernel $./bfsha OpenCL Platform: <<<ATI Stream>>> and device: <<<Juniper>>> max group size 256 Computed 26214400.000000 hashes. 61.570000 so basically only 4 seconds between the two of them. Now we are running at ~ 430k/s which is a bit better.. but - again - if i run on opencl CPU instead of GPU this is what i get: ##with royger kernel $./bfsha 1 Using CPU OpenCL Platform: <<<ATI Stream>>> and device: <<<AMD Phenom(tm) II X4 945 Processor>>> max group size 1024 Computed 26214400.000000 hashes. 19.610000 ##with pyrit kernel $./bfsha 1 Using CPU OpenCL Platform: <<<ATI Stream>>> and device: <<<AMD Phenom(tm) II X4 945 Processor>>> max group size 1024 Computed 26214400.000000 hashes. 16.820000 again almost 20x time faster! >From what i can see looks like it's not a cl kernel problem but rather something i'm doing wrong.. maybe the cpu <-> gpu data exchange or something else really big that i'm missing. >From the jtr point of view i'm in some sort missing a way to "queue" big chunks of cleartext passwords as i saw on rawMD5_opencl, or maybe i'm missing the whole structure of jtr and how it works. I'm attaching both the codes i did, it has been tested on linux debian sid and ubuntu 10.10, both 64bit for jtr you need last jumbo patch plus opencl patch and then add the .cl files in to the run/ dir as well manually add to makefile and to john.c register fmt_opencl_rawSHA1 function ) it has been tested and works on both nvidia and ati ( make linux-x86-64-opencl ). for the bfsha that code goes well on ati but segfaults on nvidia and i don't know why; i guess i should read some more opencl documentation. If you have time please share your idea and suggestion, i'm doing this mostly for fun and to spend some of my free time; once there are some "good" benchmarks i'm willing to post them as a patch. I know sha1 is not the best hash for GPU computing since it's not highly parallel but i don't expect it to perform so slow: even if pyrit use threading and take opportunity of wpa2 algorithm this still doesn't explain why i can get this numbers on the benchmark: #1: 'CAL++ Device #1 'ATI JUNIPER'': 41895.7 PMKs/s (RTT 2.7) #2: 'CPU-Core (SSE2)': 644.9 PMKs/s (RTT 3.1) #3: 'CPU-Core (SSE2)': 642.3 PMKs/s (RTT 3.1) #4: 'CPU-Core (SSE2)': 633.4 PMKs/s (RTT 3.1) As you can see, even if i'm using cal instead of opencl ( cal is 2 time faster than opencl) you can still see the huge gap between cpu and GPU. many thanks for taking your time reading this big wall of text. Regards Samuele Download attachment "raw-sha-opencl.tar.gz" of type "application/x-compressed-tar" (4313 bytes) Download attachment "shabf.tar.gz" of type "application/x-compressed-tar" (4956 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.