Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 7 Apr 2012 04:55:30 +0200
From: Lukas Odzioba <lukas.odzioba@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: fast hashes on GPU

2012/4/6 myrice <qqlddg@...il.com>:
>
>
> On Tue, Apr 3, 2012 at 8:08 PM, Lukas Odzioba <lukas.odzioba@...il.com>
> wrote:
>>
>> You can try split that copy and overlap it with kernel execution. It
>> is possible on fermi and newer cards.
>
> Thanks for  this. I change each cudaMemcpy to cudaMemcpyAsync. Am I right?
> However, no performance gains. Could you provide some example by doing this?
>
> Thanks!
> Dongdong Li
>
Here are my sha256cuda results:
GTX460 pcie 4x
Raw:    13631K c/s real, 13769K c/s virtual
+ page locked memory
Raw:    15335K c/s real, 15335K c/s virtual
+ overlap
Raw:    16515K c/s real, 16515K c/s virtual


I am sending you this kernel it is not ready for deployment but you
can look at code.
Two things I must admit:
1) you must use page locked memory to use overlaping
2) this introduces runtime problems sometimes, so you have to have two
variants (with and without OvL), and first detect is overlapping
supported. Because sometimes it wont work properly test your code 10
times harder:-)

Profiler shots:
http://sphere.pl/~ukasz/sha256_overlap.png
http://sphere.pl/~ukasz/sha256_nonoverlap.png

As you can see, the faster format is => cpu becomes worst bottleneck.

Lukas

[ CONTENT OF TYPE application/octet-stream SKIPPED ]

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ