john-users - opencl sha1 jtr and others some experiments and some suggestion

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D2EBEB0.9050200@linuxasylum.net>
Date: Thu, 13 Jan 2011 09:58:24 +0100
From: Samuele Giovanni Tonon <samu@...uxasylum.net>
To: john-users@...ts.openwall.com
Subject: opencl sha1 jtr and others some experiments and some suggestion

hello all,
first of all sorry for this long email but i'm quite on a dead end and i
would like some advice as well share with you all some research i did.

Also forgive me if i skip on some parts i can go on the details in
another email.


Thanks to NTLM opencl patches for jtr i've been working a bit on a raw
sha1 opencl implementation for john.

Without having a good knowledge of opencl, jtr and with my rusty C i've
started working and i've come with something that - at least doesn't
crash and doesn't output gibberish, but it's terribly slow.

so let me show you some benchmark

this is the "original" raw-sha1:
../run/john -test --format=raw-sha1       Benchmarking: Raw SHA-1
[raw-sha1]... DONE
Raw:    4768K c/s real, 4816K c/s virtual

this the one with opencl

../run/john -test --format=raw-sha1-opencl
Benchmarking: Raw SHA-1 OpenCL [SHA-1]... OpenCL Platform: <<<ATI
Stream>>> and device: <<<Juniper>>>
DONE
Many salts:     1478 c/s real, 1894 c/s virtual
Only one salt:  1961 c/s real, 1903 c/s virtual

Astonishing slow isn't it ?

Well, having the luck of owning an ATI and an AMD processor let me
change from GPU-opencl to CPU-opencl by just changing one parameter is
clGetDeviceIDs. these are the results:

../run/john -test --format=raw-sha1-opencl
Benchmarking: Raw SHA-1 OpenCL [SHA-1]... OpenCL Platform: <<<ATI
Stream>>> and device: <<<AMD Phenom(tm) II X4
945 Processor>>>
DONE
Many salts:     23833 c/s real, 19376 c/s virtual
Only one salt:  22902 c/s real, 18927 c/s virtual

As you can see there's a 20x time difference between "opencl with GPU"
and "opencl with CPU" and 100x time with the original cpu code based.

what and where is the main cause of this slowness?  I don't expect those
huge numbers i can get from oclhashcat (552M/s ) since it's a totally
different approach from john but i don't expect either to go
that slow.

I did some bit of research and came out with two idea bothering me:
the slowness could be because because i'm not "grouping" many cleartext
password to process at the same time ( hint, that #MD5_NUM_KEYS you can
find on rawMD5_opencl_fmt.c )  and because of the overhead of the
exchange of data between GPU and CPU

I found a simple sha1 brute force from an opencl forum (
http://www.khronos.org/message_boards/viewtopic.php?f=37&t=2260 )

to this i added two kind of opencl kernel: one i've found on royger.org
(just search sha1 opencl on google) and the other from pyrit (gpu wpa
cracker).

Both kernel have been revised to adapt to the code: pyrit one is a bit
faster than the one from royger and these are the results:

##with royger kernel
$./bfsha
OpenCL Platform: <<<ATI Stream>>>
 and device: <<<Juniper>>>
max group size 256
Computed 26214400.000000 hashes.
65.110000

##with pyrit kernel
$./bfsha
OpenCL Platform: <<<ATI Stream>>>
 and device: <<<Juniper>>>
max group size 256
Computed 26214400.000000 hashes.
61.570000

so basically only 4 seconds between the two of them.
Now we are running at ~ 430k/s which is a bit better.. but - again -
if i run on opencl CPU instead of GPU this is what i get:

##with royger kernel
$./bfsha 1
Using CPU

OpenCL Platform: <<<ATI Stream>>>
 and device: <<<AMD Phenom(tm) II X4 945 Processor>>>
max group size 1024
Computed 26214400.000000 hashes.
19.610000

##with pyrit kernel
$./bfsha 1
Using CPU

OpenCL Platform: <<<ATI Stream>>>
 and device: <<<AMD Phenom(tm) II X4 945 Processor>>>
max group size 1024
Computed 26214400.000000 hashes.
16.820000

again almost 20x time faster!
>From what i can see looks like it's not a cl kernel problem but rather
something i'm doing wrong.. maybe the cpu <-> gpu data exchange or
something else really big that i'm missing.

>From the jtr point of view i'm in some sort missing a way to "queue" big
chunks of cleartext passwords as i saw on rawMD5_opencl, or maybe i'm
missing the whole structure of jtr and how it works.

I'm attaching both the codes i did, it has been tested on linux debian
sid and ubuntu 10.10, both 64bit for jtr you need last jumbo patch plus
opencl patch and then add the .cl files in to the run/ dir as well
manually add to makefile and to john.c register fmt_opencl_rawSHA1
function ) it has been tested and works on both nvidia and ati ( make
linux-x86-64-opencl ).

for the bfsha that code goes well on ati but segfaults on nvidia and i
don't know why; i guess i should read some more opencl documentation.

If you have time please share your idea and suggestion, i'm doing this
mostly for fun and to spend some of my free time; once there are some
"good" benchmarks i'm willing to post them as a patch.

I know sha1 is not the best hash for GPU computing since it's not highly
parallel but i don't expect it to perform so slow: even if pyrit use
threading and take opportunity of wpa2 algorithm this still doesn't
explain why i can get this numbers on the benchmark:

#1: 'CAL++ Device #1 'ATI JUNIPER'': 41895.7 PMKs/s (RTT 2.7)
#2: 'CPU-Core (SSE2)': 644.9 PMKs/s (RTT 3.1)
#3: 'CPU-Core (SSE2)': 642.3 PMKs/s (RTT 3.1)
#4: 'CPU-Core (SSE2)': 633.4 PMKs/s (RTT 3.1)

As you can see, even if i'm using cal instead of opencl ( cal is 2 time
faster than opencl) you can still see the huge gap between cpu and GPU.

many thanks for taking your time reading this big wall of text.

Regards
Samuele

Download attachment "raw-sha-opencl.tar.gz" of type "application/x-compressed-tar" (4313 bytes)

Download attachment "shabf.tar.gz" of type "application/x-compressed-tar" (4956 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.