john-users - Re: GSOC - GPU for hashes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1301053402.14549.77.camel@cthulhu.linuxasylum.net>
Date: Fri, 25 Mar 2011 12:43:22 +0100
From: Samuele Giovanni Tonon <samu@...uxasylum.net>
To: john-users@...ts.openwall.com
Subject: Re: GSOC - GPU for hashes

On Thu, 2011-03-24 at 19:32 +0100, Łukasz Odzioba wrote: 
> On 21 marca 2011 18:50 user Samuele Giovanni Tonon <samu@...uxasylum.net> wrote:
> > There are some already some patches for opencl tough at the moment they
> > are still experiments and far from being good but at least they works
> > If you want i can give you in details the reasons for the rawSha1 and
> > Nsldap-sha formats i did, what is working and what is troubling me at the moment.
> >
> > Regards
> > Samuele
> 
> 
> Samuele, I will be grateful if you can tell us more about your
> experience in JtR gpu programming.
> Detailed information, can help go deeper into the problem, so when you
> have some free time, please write a about it.

Well this is a long email with a lot of data in it take some free time
to examine it and learn from my mistakes :-)

opencl is rather easy, Api are well documented and there are roughly
20 functions, the hard part is understanding how the whole thing work,
how data transfer from cpu to GPU works and understand that cl code
need to work on "static" data because it can't access what you have
in cpu RAM, so forget pointers (well not exactly, there are tricks to do
that but this beyond the scope of this email).

Debugging a cl code is real pain, forget useful printf, forget easy
to use IDE, when everything is syntax correct and you don't get the
right result good luck my friend it's time to get pen and paper.

i'll start with some code for salted sha1 i did recently, you can put
them in src dir after applying jumbo-12 and opencl patches and they
should work fine; i'll use this because this is the only one where
gpu computing shine if compared with cpu (5 times faster) and it gives
an example on how jtr works on salted hash.

I'm still learning GPUCPU so maybe some things are obvious to you but
for the sake of the whole discussion i'll put everything here .
First of all i'm on ati on an amd cpu, this makes me easier to debug
the code because i can emulate the whole stuff on cpu simply switching
one function from:
    opencl_init("$JOHN/ssha_opencl_kernel.cl", CL_DEVICE_TYPE_GPU);
to
    opencl_init("$JOHN/ssha_opencl_kernel.cl", CL_DEVICE_TYPE_CPU);
and then i can go with gdb/ddd debugging for problems.
i have yet to test a good profiler for linux, i hope someone could point
me to what he's using and if it's worth.

the cl kernel has 3 args:
an array of cleartext passwords and salts, a data structure telling the
kernel how many password (SSHA_NUM_KEYS) there are and how long they are
(SHA_BLOCK) and the array to put results in.

the cl kernel is a merge from the one used in pyrit (wpa cracker) plus
some tweaks i did looking to MD% opencl implementation: basically i
enqueue SSHA_NUM_KEYS each of SHA_BLOCK size, i pad them to fit sha1
algorithm and then i return the sha as a 5 uint blocks by sorting them
so every first block (so 1/5 of the hash) of every password cracked is
at the beginning, the other 4 are last.

In this way first check to see if 1/5 of the hash matches is done very
fast, in case it matches i get the rest 4 blocks and compare them, if
everything is ok password has been cracked.

This is done because of slow data transfer between gpu and cpu: As milen
pointed out this is the real pain you will be working on.

let me point it out again with a sample from amd sdk kit
$./PCIeBandwidth 

Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : Cayman
Host to device : 4.78665 GB/s 
Device to host : 5.64414 GB/s

As you can see there's a difference in speed comparison if you "upload"
or "download" data on gpu, you need to take that in account when you
code. You also need to understand that different vc will have different 
data transfer rate, this one is for an hd6970 which i found at about
200euro and replaced my "old" 5750 which had "download" data rate lower
then the upload.

Even with this monster i can "feel" something is wrong because the video
card is not stressed enough; if compared to the times i run pyrit which
segfaults because of heat, with john running i can watch movies with no
problem.

Code for salted sha and raw sha is basically the same, yet opencl shine
on salted sha why? because of cpu salted sha SHA1_Update routine you
need to add to "add" salt to the cleartext input.

john-opencl routine for salted hash works in this way: for every "chunk"
of passwords try this single salt.
at the moment this is done in the cpu part in crypt_all :

for(i=0;i<count;i++){
            lenpwd = strlen(saved_key[i]);
            memcpy(&(inbuffer[i*SHA_BLOCK]),saved_key[i],SHA_BLOCK);
            memcpy(&(inbuffer[i*SHA_BLOCK+lenpwd]),(unsigned char
*)saved_salt, SALT_SIZE);
            lenpwd = strlen(saved_key[i])+SALT_SIZE;
            inbuffer[i*SHA_BLOCK+lenpwd] = 0x80;
}
so i send to kernel ssha_num_keys * (password+salt) to compute; if i
could find a good way to send (ssha_num_keys * password) + salt and
do that 
memcpy(&(inbuffer[i*SHA_BLOCK+lenpwd]),(unsigned char *)saved_salt,
SALT_SIZE);

in the kernel routine i think i could speed up a little more,
unfortunately my tests so far have not been lucky, any comment or
suggestion is really appreciated .

Also at the moment, john single mode doesn't work with my opencl code,
because of some malloc problems, you need to test it with incremental
and dictionary mode.

I think i have given you enough meat to put on the bbq.

Regards
Samuele 




Download attachment "opencl-ssha.tar.bz2" of type "application/x-bzip-compressed-tar" (4569 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.