Date: Tue, 12 Apr 2011 01:23:10 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: sha256 format patches On Mon, Apr 11, 2011 at 10:47:48PM +0200, ?ukasz Odzioba wrote: > fast hashes: classic sha256 with 64 iterations In the context of MD* and SHA* functions, these 64 are normally called "steps" or "rounds" (although the word "round" is sometimes used to refer to the 16-step groups). I suggest that we use the word "iteration" to refer to iterations of the entire cryptographic primitive (the entire SHA-256 or at least its compression function in this case). > http://sphere.pl/~ukasz/gsoc2011/slow.png > http://sphere.pl/~ukasz/gsoc2011/fast.png That's very nice. Can you possibly start using the Openwall wiki for hosting things like this? Maybe create sub-pages under: http://openwall.info/wiki/john/GPU ...and its current content may be moved to a sub-page, too. > Short analysis leads to following conclusions: > Fast hashes: > -mostly ALU bounded > -optimizing pci-e transfers can be profitable > -requires changes in cpu code to utilize gpu power Yes, but you're not hitting the PCIe bandwidth limit yet, although with synchronous transfers some time is in fact wasted until the transfer is complete and before actual processing starts. > Slow hashes: > -extreme ALU bounded > -pci-e data transfers are negligible > -great scalability Right. > Sha256 is simple function and can be efficently computed with ?few? > registers. Every hash calculation requires 96 bytes to be transfered > through PCI-e bus so: You may transfer less than 96 bytes per hash. Most candidate passwords are a lot shorter and you can weed them out with partial hashes. So you can reduce the amounts of data to transfer in both directions. > Fast sha256 calculates ~2000k c/s - 185 MB/s PCI-e transfer This is 5% of the 16x PCIe 1.x bandwidth, which is 4 GBytes/sec. > Slow sha256 calculates 380 c/s - 35kB/s PCI-e transfer I think you wrongly substituted the on-CPU speed here (not GPU), but that didn't affect your conclusion. > CUDA api delivers page locked memory allocation so data through PCIe > can be transfered in DMA mode. It is great advantage and gives +8% > boost (checked 3205k c/s). It can be connected with asynchronous > transfer and gpu code execution which should provide another 10-15 > percent speedup (on fast hashes). On Fermi architecture data can be > transfered in both sides while gpu code executes, on G80 only one side > background is allowed. Picture below describes idea similar to > pipeline. > http://sphere.pl/~ukasz/gsoc2011/streams.png Looks good. Is this already implemented in your revision -1 of the patch? (I haven't looked at it yet.) > Hovewer there are three problems with this approach: > -pinned memory allocation is more expensive > -we should not allocate a lot of pinned memory because of system stability > -JtR format does not provide mechanism to do cleanup after > computations, so I suppose we would have to alocate and dealocate > memory for every crypt_all call, but it is not healthy. Why not do it in init() and keep the allocation until the "john" process terminates? > Because my major topic is slow hashes, changes i did have affected on > slow sha256 computation. > In term of slow sha256 I got following results: > CPU - 380 c/s > GPU - 2520 c/s > > So GPU version is 6.6x faster that the cpu. Slow hashes are better > scalable so taking 2x faster GPU will can easily get 12x faster code > comparing to cpu. That's nice. Now how about implementing SHA-crypt? You'll also need to implement SHA-512 for that, which is trickier (64-bit integers). > Those results do not include unrolling loops modification. Is this just because you haven't implemented unrolling for the slow hashes case yet? > This is link to newest version sha256patch: > http://sphere.pl/~ukasz/gsoc2011/john-1.7.6-sha256cuda-1.diff > Code is a bit dirty I will do cleanup and link to wiki soon. Thanks. In fact, please upload this to the wiki (not just link). > Solar Designer benchmarked sha256cuda-version 0 patch and got > following results on GeForce 8800 GTS 512: > (...)" > 4456k c/s real 4412k c/s virtual > after a certain change, i got this to: > 5255k c/s real 5308k c/s virtual > running two processes simultaneously gives a combined speed of 9k" That "certain change" I had mentioned was switching to 64-bit partial hashes. So 4 times less data to transfer from the GPU. In my hack of the code, I did not deal with potential false positives in any way, but a proper implementation will need to do it, likely by having cmp_exact() invoke an on-CPU implementation (it doesn't need to be fast). Then even 32-bit partial hashes would work, for further speedup and memory savings. Thanks again, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.