john-dev - Re: Split kernel for OpenCL WPA-PSK

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4ca093e85c76dbe4d92dd0f42382c10c@smtp.hushmail.com>
Date: Fri, 09 Nov 2012 20:00:50 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Split kernel for OpenCL WPA-PSK

Solar,

how about upgrading Bull to the driver versions recommended by HashCat? 
It's Catalyst 12.8 and nvidia 304.32. Perhaps you could opt to use the 
versions (if not the actual packages) supplied with Ubuntu 12.10: 
Catalyst 12.9 and nvidia 304.43. See below.

On 11/08/2012 08:09 PM, magnum wrote:
> On 8 Nov, 2012, at 19:12 , magnum <john.magnum@...hmail.com> wrote:
>> Using device 0: Tahiti
>> Local worksize (LWS) 192, Global worksize (GWS) 196608
>> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
>> Raw:    66197 c/s real, 137970 c/s virtual
>
>> This code too does over 2.1 billion SHA1/second, but CPU post-processing nearly halves the speed (without OMP). So I'm in the process of moving all of that post-processing to GPU. It's just a couple HMACs more, so I hope to exceed 120K c/s with that in place.
>
> Lol, while digging into that post processing, I found out that the (CPU side) prf_512() function of wpapsk.h did four times more work than needed. It produced an 80 byte key of which only 16 bytes was needed. Just with this fix, the Tahiti figure went up another 35%:
>
> Using device 0: Tahiti
> Local worksize (LWS) 256, Global worksize (GWS) 262144
> Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
> Raw:    89164 c/s real, 296207 c/s virtual
>
> This will affect CUDA too. Still, I'm proceeeding with implementing all of that post-processing on GPU.

Done, but not committed yet. It now does NO post processing on CPU:

Using device 0: Tahiti
Local worksize (LWS) 128, Global worksize (GWS) 262144
Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... DONE
Raw:    129453 c/s real, 52428K c/s virtual

This is a tad faster than HashCat unless Atom has tweaked it since the 
figure I found (otherwise he will now... OTOH I'm not done yet =). 
Jumbo-7 does 42723 c/s out of the box. However, I now hit another nvidia 
bug (on the old 295.49):

Using device 0: GeForce GTX 570
Compilation log: ptxas application ptx input, line 11; error   : 
Module-scoped variables in .local state space are not allowed with ABI
ptxas fatal   : Ptx assembly aborted due to errors

Error building kernel. Returned build code: -42. DEVICE_INFO=130
OpenCL error (CL_INVALID_BINARY) in file (common-opencl.c) at line (151) 
- (clBuildProgram failed.)

It works fine with Fermi and Kepler with other versions of the driver, 
and even with the *same* version of the driver but using a 9600GT. It 
also works fine with AMD or Intel CPU drivers.

I get the same error with GPG after some performance tweaks. Haven't 
found a workaround yet. I'm SICK of chasing driver bugs. Perhaps I 
should learn CUDA instead. Where do I start?

If I only could figure out how to compile clcc on Bull. I have no chance 
to look at the referenced Ptx assembly. Or is there a way to keep temp 
files?

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.