john-users - Re: GSOC - GPU for hashes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTinDZAEJ1DPtU4Jnxj0KSxTsG8nahLZJr13f1JMx@mail.gmail.com>
Date: Mon, 21 Mar 2011 23:41:24 +0200
From: Milen Rangelov <gat3way@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: GSOC - GPU for hashes

Sorry to hijack a thread here once again...but I am really interested in
development of open-source GPU cracking software  :)

I am developing my own opensource GPU cracker  (much "younger" than JtR
eheh) and it apparently follows somehow similar pattern. I started using
OpenSSL, then done some SSE2 work, then SSE2-vectorized bitslice DES until I
finally got to the point where I had to start implementing GPU cracking. I
guess sharing my experience would help so that you do not repeat my mistakes
(that took me weeks of experimenting and lots of headache and nerves).

First, about choosing the platform. Prior to that, I had some experience
with Brook+ and some time spent on exprimenting with CAL/IL. My idea was to
invest some time improving my CAL knowledge and start learning CUDA, because
it logically seemed the best choice in terms of performance. Switching to
OpenCL was a rather spontaneous thing, initially a product of a mere
curiousity. I don't regret that...OK, except one thing.

OpenCL is rather easy compared to CAL and much more powerful compared to
Brook+. The learning curve is rather steep. CUDA seems to be more powerful
and flexible, yet OpenCL is probably the best way to get into GPGPU stuff.
Next  about the performance. There are some articles, benchmarks and stuff
maintaining the idea that OpenCL is somehow slower than either CUDA or CAL.
On ATI side that's not correct and I have proven it by achieving higher
speeds than ighashgpu and whitepixel, both written in CAL/IL. Well-optimized
OpenCL code can beat hand-written IL kernels and ATI's OpenCL compiler is
rather good at optimizing stuff. OTOH as far as NVidia is concerned, I had
lots of issues. I just cannot reach the top speeds that some CUDA cracking
software achieve no matter how hard I try optimizing the kernels. Also I had
LOTS of issues with opencl on Nvidia. For example I create a number of
threads that operate in their own context and have their own queue. Until
recently, Nvidia's OpenCL implementation was not thread-safe so I was
limited to a single thread/context/queue. Sometimes, clBuildProgram()
crashed on Nvidia for perfectly correct code. Some OCL functions like
rotate() do not work at all on NVidia. You always get surprises like some
function not working properly or another returning wrong results.

Code is portable across platforms with OpenCL, but performance is not, this
forced me to write separate kernels for ATI and NVidia. There are some
hardware differences that forced me to do so. People developing oclHashcat
got further on and wrote a separate CUDA-based program for Nvidia and a
separate OpenCL-based program for ATI. I can well understand this - IMO this
turns out to be the best in terms of performance.

About candidate generation and checking of calculated hashes - I would
advise that you offset this almost completely on GPUs, even for slow hashes.
The host-device transfers are the worst bottleneck, it just completely
sucks. After a lot of redesigning and coding, I got to this point where I
invoke a kernel with NDRange of say 500.000 (that means 500.000*vector_size
~ 2M candidates), check hashes almost completely in GPU code (direct
comparison or bitmaps in case multihashing is used), then just transfer 4
bytes indicating that I have a match or not. If some candidate passes the
GPU checks, I transfer another 500.000*4=2MB over PCIe, then again (the
number of hashes that passed the check)*(digest size). On the host side, I
do an indexed search on the hash list. Most of the time I just transfer 4
bytes after each kernel invocation, especially when cracking a single hash.
This is fast....well not. I have profiled the code using AMD's Stream Kernel
Analyzer. Guess what - calculating 2 million hashes on GPU is almost 25x
faster than transfering just 4 bytes over the PCIe bus !!!!! At least on
Radeon HD6870. That means I waste about 4% of the time  transferring just 4
bytes from GPU to host.  So minimize those transfers as much as possible.
Even for "slow" hashes. PCIe transfers are a real pain. Always transferring
calculated hashes to host for comparison is pure madness, this would never
get you to utilizing even 10% of the GPU resources.

There is another big problem with what you call "slow" hashes. It comes out
from the fact that in GPUs the concept of call stack does not exist. Yes you
can declare functions, but those are always inlined. For an interated
algorithm like MD5(Unix) this means inlining the md5 stuff 1000 times....and
as you might have guessed, compilation time is high. It can take about 20-30
minutes. It is the same problem for OpenCL, IL and I guess CUDA. The only
solution is using precompiled binary kernel and that's where the fun starts.
With CUDA you've got to compile cubins for different compute capability
architectures, for ATI -  binary kernels for literarily all AMD models you
need to support. Integrating this with the autoconf/automake stuff is again
a bitch....I am still working on that. Anyway, I somehow got used to the
thought that my program would build for hours, probably days from code.
Distributing precompiled binaries is something I consider inevitable.

And finally, some algorithms just hate GPUs :) The worst of them is bcrypt
which I doubt will be implemented on GPUs in the next years. But even DES is
bad enough. I have spent some days trying to implement efficiently DES on
OpenCL. The "standart" variant with s-box lookup tables stuffed into
constant memory is just slow. OTOH, implementing bitslice DES on GPUs looks
like a very big challenge, I guess harder than I can handle. There are
numerous problems with bitslice DES on GPUs, first one being the high GPR
usage. Second is that it's hard to convert the candidates input to a bit
vectors as there are no nice native functions like SSE2's _mm_movemask_epi8
:)


Anyway, good luck , hope that my (blatant offtopic) post was useful. I would
really like to see more opensource GPU cracking software. I also hope you
would find better ways to deal with those challenges than me and don't lose
time in repeating my stupid mistakes :)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.