Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 11 Apr 2011 22:47:48 +0200
From: Łukasz Odzioba <>
Subject: Re: sha256 format patches

I've updated sha256cuda patch for JtR and wanted to share results with you.
Test Environment:
CPU: Intel Core 2 Duo P7350@...z
16 Gflops - theoretical
GPU: nVidia GeForce 9600m - 32 cuda cores
110 Gflops - theoretical

Gflops are not good mertics to compare gpu and cpu, but it's difficult
to find something well fitted.
I considered two types of hashes:
fast hashes: classic sha256 with 64 iterations
slow hashes: sha256 calculated 5000 time - 320 000 iterations

I used nVidia Visual profiler to distinguish difference between slow
and fast hashes. Following charts shows time consumtion during
accordingly fast and slow hash computation:

Short analysis leads to following conclusions:
Fast hashes:
   -mostly ALU bounded
   -optimizing pci-e transfers can be profitable
   -requires changes in cpu code to utilize gpu power
Slow hashes:
   -extreme ALU bounded
   -pci-e data transfers are negligible
   -great scalability

Sha256 is simple function and can be efficently computed with few
registers. Every hash calculation requires 96 bytes to be transfered
through PCI-e bus so:
Fast sha256 calculates ~2000k c/s - 185 MB/s PCI-e transfer
Slow sha256 calculates 380 c/s - 35kB/s PCI-e transfer

First version of my sha256cuda patch was able to calculate ~1795k c/s,
comparing to CPU version ~1134k c/s (it is not optimal).

Sha256 cuda after following optimizations:
--use constant memory for initialization values +12%
--find best threads per block configuration +18%
--throw away unused table initialization +5%
--unrolling loops +23%
achieved ~2987k c/s.

To unroll loops I used nvcc #pragma unroll N macro but it resulted
utilization 56 registers per thread while unrolled version only 18
registers. So I've crated my own unrolling macros but huge number of
used registers didn't change, and there was big performance loss. This
could be easily solved by using nvcc --maxrregcount N option and
resulted a big speedup.

CUDA api delivers page locked memory allocation so data through PCIe
can be transfered in DMA mode. It is great advantage and gives +8%
boost (checked 3205k c/s). It can be connected with asynchronous
transfer and gpu code execution which should provide another 10-15
percent speedup (on fast hashes). On Fermi architecture data can be
transfered in both sides while gpu code executes, on G80 only one side
background is allowed. Picture below describes idea similar to

We need to divide computations in smaller parts and utilize memory
controller asynchronous data transfer mode. These are percentage time
on my gpu:
   -Reading data from Host do Device - 17%
   -Calculating hashes - 68%
   -Writing data from Device to Host - 14%

Hovewer there are three problems with this approach:
   -pinned memory allocation is more expensive
   -we should not allocate a lot of pinned memory because of system stability
   -JtR format does not provide mechanism to do cleanup after
computations, so I suppose we would have to alocate and dealocate
memory for every crypt_all call, but it is not healthy.

I was wondering about my bitwise ror macro but according to .asm i
.ptx files  nvcc and gcc optimized it well.
There are still places where I could gain some performance:
-shared memory
-texture memory (better than global because is cached)
-looking deeper into ptx files
-developing something similar to Alain's code searching for best
thread/block/registers settings in running environment

However i might do not have time to play with it before Friday.

Because my major topic is slow hashes, changes i did have affected on
slow sha256 computation.
In term of slow sha256 I got following results:
CPU - 380 c/s
GPU - 2520 c/s

So GPU version is 6.6x faster that the cpu. Slow hashes are better
scalable so taking 2x faster GPU will can easily get 12x faster code
comparing to cpu.
Those results do not include unrolling loops modification.

This is link to newest version sha256patch:
Code is a bit dirty I will do cleanup and link to wiki soon.

Solar Designer benchmarked sha256cuda-version 0 patch and got
following results on GeForce 8800 GTS 512:
4456k c/s real 4412k c/s virtual
after a certain change, i got this  to:
5255k c/s real 5308k c/s virtual
running two processes simultaneously gives a combined speed of 9k"

and also Raw MD5 opencl patch on 8800 GTS 512:

Raw: 36296K c/s real 94371K c/s virtual
running two at once:
Raw: 32263K c/s real 86037K c/s virtual
Raw: 32577K c/s real 79891K c/s virtual
so that's 64k combined speed


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ