Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 12 Apr 2011 01:23:10 +0400
From: Solar Designer <>
Subject: Re: sha256 format patches

On Mon, Apr 11, 2011 at 10:47:48PM +0200, ?ukasz Odzioba wrote:
> fast hashes: classic sha256 with 64 iterations

In the context of MD* and SHA* functions, these 64 are normally called
"steps" or "rounds" (although the word "round" is sometimes used to
refer to the 16-step groups).  I suggest that we use the word "iteration"
to refer to iterations of the entire cryptographic primitive (the entire
SHA-256 or at least its compression function in this case).


That's very nice.

Can you possibly start using the Openwall wiki for hosting things like
this?  Maybe create sub-pages under:

...and its current content may be moved to a sub-page, too.

> Short analysis leads to following conclusions:
> Fast hashes:
>    -mostly ALU bounded
>    -optimizing pci-e transfers can be profitable
>    -requires changes in cpu code to utilize gpu power

Yes, but you're not hitting the PCIe bandwidth limit yet, although with
synchronous transfers some time is in fact wasted until the transfer is
complete and before actual processing starts.

> Slow hashes:
>    -extreme ALU bounded
>    -pci-e data transfers are negligible
>    -great scalability


> Sha256 is simple function and can be efficently computed with ?few?
> registers. Every hash calculation requires 96 bytes to be transfered
> through PCI-e bus so:

You may transfer less than 96 bytes per hash.  Most candidate passwords
are a lot shorter and you can weed them out with partial hashes.  So you
can reduce the amounts of data to transfer in both directions.

> Fast sha256 calculates ~2000k c/s - 185 MB/s PCI-e transfer

This is 5% of the 16x PCIe 1.x bandwidth, which is 4 GBytes/sec.

> Slow sha256 calculates 380 c/s - 35kB/s PCI-e transfer

I think you wrongly substituted the on-CPU speed here (not GPU), but
that didn't affect your conclusion.

> CUDA api delivers page locked memory allocation so data through PCIe
> can be transfered in DMA mode. It is great advantage and gives +8%
> boost (checked 3205k c/s). It can be connected with asynchronous
> transfer and gpu code execution which should provide another 10-15
> percent speedup (on fast hashes). On Fermi architecture data can be
> transfered in both sides while gpu code executes, on G80 only one side
> background is allowed. Picture below describes idea similar to
> pipeline.

Looks good.  Is this already implemented in your revision -1 of the patch?
(I haven't looked at it yet.)

> Hovewer there are three problems with this approach:
>    -pinned memory allocation is more expensive
>    -we should not allocate a lot of pinned memory because of system stability
>    -JtR format does not provide mechanism to do cleanup after
> computations, so I suppose we would have to alocate and dealocate
> memory for every crypt_all call, but it is not healthy.

Why not do it in init() and keep the allocation until the "john" process

> Because my major topic is slow hashes, changes i did have affected on
> slow sha256 computation.
> In term of slow sha256 I got following results:
> CPU - 380 c/s
> GPU - 2520 c/s
> So GPU version is 6.6x faster that the cpu. Slow hashes are better
> scalable so taking 2x faster GPU will can easily get 12x faster code
> comparing to cpu.

That's nice.

Now how about implementing SHA-crypt?  You'll also need to implement
SHA-512 for that, which is trickier (64-bit integers).

> Those results do not include unrolling loops modification.

Is this just because you haven't implemented unrolling for the slow
hashes case yet?

> This is link to newest version sha256patch:
> Code is a bit dirty I will do cleanup and link to wiki soon.

Thanks.  In fact, please upload this to the wiki (not just link).

> Solar Designer benchmarked sha256cuda-version 0 patch and got
> following results on GeForce 8800 GTS 512:
> (...)"
> 4456k c/s real 4412k c/s virtual
> after a certain change, i got this  to:
> 5255k c/s real 5308k c/s virtual
> running two processes simultaneously gives a combined speed of 9k"

That "certain change" I had mentioned was switching to 64-bit partial
hashes.  So 4 times less data to transfer from the GPU.  In my hack of
the code, I did not deal with potential false positives in any way, but
a proper implementation will need to do it, likely by having cmp_exact()
invoke an on-CPU implementation (it doesn't need to be fast).  Then even
32-bit partial hashes would work, for further speedup and memory savings.

Thanks again,


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ