john-dev - Re: Idea to increase plaintext length for GPU based hashes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51488165.4050507@gmail.com>
Date: Tue, 19 Mar 2013 12:16:53 -0300
From: Claudio André <claudioandre.br@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Idea to increase plaintext length for GPU based hashes

Em 18-03-2013 22:21, magnum escreveu:
> Another approach (not necessarily mutex to yours) would be to split the transfer. Let's say we have a work size of 1M. At, say, the 256K'th call to set_key(), it could initiate a transfer of this first fourth of keys to GPU. This transfer will not stall the host side, it will take place while we continue with the next 256K keys. And so on. If we can balance this properly we should get rid of much of the transfer delay. Maybe we should split it in 8 or 16, maybe less.

Well, it is easy to (somehow) implement your idea. Good gain.

-- From:
(unstable)$ ../run/john -fo:raw-sha256-opencl -t
OpenCL platform 0: AMD Accelerated Parallel Processing, 2 device(s).
Device 0: Juniper (AMD Radeon HD 6700 Series)
Max local worksize 256, Optimal local worksize 128
(to avoid this test on next run, put "rawsha256_LWS = 128" in john.conf, 
section [Options:OpenCL])
Local worksize (LWS) 128, global worksize (GWS) 1310720
Benchmarking: Raw SHA-256 (pwlen < 32) [OpenCL (inefficient, development 
use mostly)]... DONE
Raw:    13653K c/s real, 47953K c/s virtual

-- To:
(bleeding)$ ../run/john -fo:raw-sha256-opencl -t
Device 0: Juniper (AMD Radeon HD 6700 Series)
Local worksize (LWS) 128, global worksize (GWS) 1310720
Benchmarking: Raw SHA-256 (pwlen < 32) [OpenCL (inefficient, development 
use mostly)]... DONE
Raw:    17096K c/s real, 36408K c/s virtual

$ ../run/john -fo:cisco4-opencl -t
Device 0: Juniper (AMD Radeon HD 6700 Series)
Local worksize (LWS) 128, global worksize (GWS) 1310720
Benchmarking: Cisco "type 4" hashes SHA-256 [OpenCL (inefficient, 
development use mostly)]... DONE
Raw:    17554K c/s real, 39321K c/s virtual

------
But, this not enought to do the trick. Below, I measure only the GPU 
part in auto-tune. I expect to be at 113M not 17M.
Is it only set_key() to blame? Btw: auto-tune on unstable goes at 19M.

$ GWS=0 STEP= DETAILS= ../run/john -fo:cisco4-opencl -t
Device 0: Juniper (AMD Radeon HD 6700 Series)
Calculating best global worksize (GWS) for LWS=128 and max. 1.0 s duration.

Raw speed figures including buffer transfers:
pass xfer: 0.26 ms, crypt: 0.29 ms, result xfer: 0.66 ms
gws:     65536      54107339 c/s    1.211 ms per crypt_all()+
pass xfer: 0.49 ms, crypt: 0.51 ms, result xfer: 0.79 ms
gws:    131072      73165517 c/s    1.791 ms per crypt_all()+
pass xfer: 0.10 ms, crypt: 1.00 ms, result xfer: 1.40 ms
gws:    262144     104964874 c/s    2.497 ms per crypt_all()+
pass xfer: 0.24 ms, crypt: 1.98 ms, result xfer: 2.60 ms
gws:    524288     108720820 c/s    4.822 ms per crypt_all()+
pass xfer: 0.39 ms, crypt: 3.94 ms, result xfer: 4.96 ms
gws:   1048576     112859326 c/s    9.291 ms per crypt_all()+
pass xfer: 0.76 ms, crypt: 7.87 ms, result xfer: 9.91 ms
gws:   2097152     113157718 c/s   18.533 ms per crypt_all()
Optimal global worksize 1048576
(to avoid this test on next run, put "rawsha256_GWS = 1048576" in 
john.conf, section [Options:OpenCL])
Local worksize (LWS) 128, global worksize (GWS) 1048576
Benchmarking: Cisco "type 4" hashes SHA-256 [OpenCL (inefficient, 
development use mostly)]... DONE
Raw:    16930K c/s real, 35720K c/s virtual

Claudio
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.