john-dev - Re: Lukas - status report #2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120501043702.GA10186@openwall.com>
Date: Tue, 1 May 2012 08:37:02 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Lukas - status report #2

Lukas -

On Tue, May 01, 2012 at 06:13:01AM +0200, Lukas Odzioba wrote:
> 2012/5/1 Solar Designer <solar@...nwall.com>:
> > http://openwall.info/wiki/john/WPA-PSK
> >
> > Maybe you can add a specific example to it (a sample input file,
> > commands to run on it, their output) and link to it more prominently?
> 
> Of course I'll add example.
> Have you got any suggestions how to make it more prominently? Move it
> on top of page, or link in gpu formats table?

Link from the GPU page, link from the tutorials page.  Maybe even have a
separate line for links to non-hashes pages on the main /wiki/john page.

> > Here's what I am getting with the code currently in magnum-jumbo:
> >
> > user@...l:~/john/magnum-jumbo/run$ ./john -te -fo=wpapsk-cuda
> > Benchmarking: wpapsk-cuda [GPU]... DONE
> > Raw:    17341 c/s real, 17341 c/s virtual
> >
> > This is 43% of hashcat's reported speed for this card.
> 
> GTX460 with sm_20 and threads=256 does ~15k, 10k by default.

Oh, I haven't tried any tuning yet.  Please do (on bull) and add to
doc/README-CUDA.  And post about this to john-dev, indeed.

> > user@...l:~/john/magnum-jumbo/run$ ./john -te -fo=wpapsk-opencl -pla=1
> > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> > Using device 0: Tahiti
> > Optimal Group work Size = 128
> > Benchmarking: wpapsk-opencl [pbkdf2-sha1]... DONE
> > Raw:    64000 c/s real, 531692 c/s virtual
> >
> > This is quite nice.  hashcat is reported to do 158.1k c/s on 5970, so
> > our target speed for 7970 may be about 130k c/s.
> 
> I would be more happy to see 80-90k, previously (just pmk calculation
> - most time consuming) we had 90% of hashcat's speed. For now
> difference will be ever worst for super fast gpus and slow cpu.
> Besides cpu side code utilizes only 1 core. Do you have any ideas to
> get around it other than MPI? On the other side we could move all code
> to second kernel gpu.

I don't understand what you're referring to.  I just took a look at
opencl_wpapsk_fmt.c and I don't see it doing much on the CPU.  Can you
point me at specific places in the code?

Do you think my run on the 7970 was somehow CPU-bound?  I doubt it.

> > user@...l:~/john/magnum-jumbo/run$ ./john -te -fo=wpapsk-opencl -pla=1 -dev=1
> > OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
> > Using device 1: AMD FX(tm)-8120 Eight-Core Processor
> > Optimal Group work Size = 16
> > Benchmarking: wpapsk-opencl [pbkdf2-sha1]... DONE
> > Raw:    2133 c/s real, 267 c/s virtual
> >
> > This is also reasonable, although a CPU-specific implementation using
> > the intrinsics should be much faster.
> 
> My current code is based on openssl (it's not yet in jumbo), and It
> gives ~230 c/s on i3 2100, in Aircrack-ng i have 510 c/s (uses 4
> threads).
> As far as I know openssl is not for super optimized, and with
> intrisics we should get much better results, am I right?

OpenSSL is optimized reasonably well for its interface, but the
interface is not well-suited for password cracking.  We have to make
three calls to compute one SHA-1 hash, and the SHA1_Final() call
probably wastes some time cleaning the "sensitive data" out from memory.
What's more important, it only computes one hash at a time, whereas with
the intrinsics we can do 4 at a time (on SSE2 or better), or even more
with mixed instructions for greater instruction issue rates.

So, yes, we should get much better results with the intrinsics.

Am I correct that we compute SHA-1 8192 times per WPA password checked
(4096 iterations, and two SHA-1's per HMAC)?  If so, the 23M+ c/s at raw
SHA-1 that I am getting with the XOP code on one CPU core would
translate to over 2800 c/s at WPA.  That's on one core.  This does not
include overhead to implement HMAC and PBKDF2, though.  So maybe
something between 5000 and 10000 c/s on this FX-8120 CPU is realistic.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.