Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 7 Apr 2012 04:55:30 +0200
From: Lukas Odzioba <>
Subject: Re: fast hashes on GPU

2012/4/6 myrice <>:
> On Tue, Apr 3, 2012 at 8:08 PM, Lukas Odzioba <>
> wrote:
>> You can try split that copy and overlap it with kernel execution. It
>> is possible on fermi and newer cards.
> Thanks for  this. I change each cudaMemcpy to cudaMemcpyAsync. Am I right?
> However, no performance gains. Could you provide some example by doing this?
> Thanks!
> Dongdong Li
Here are my sha256cuda results:
GTX460 pcie 4x
Raw:    13631K c/s real, 13769K c/s virtual
+ page locked memory
Raw:    15335K c/s real, 15335K c/s virtual
+ overlap
Raw:    16515K c/s real, 16515K c/s virtual

I am sending you this kernel it is not ready for deployment but you
can look at code.
Two things I must admit:
1) you must use page locked memory to use overlaping
2) this introduces runtime problems sometimes, so you have to have two
variants (with and without OvL), and first detect is overlapping
supported. Because sometimes it wont work properly test your code 10
times harder:-)

Profiler shots:

As you can see, the faster format is => cpu becomes worst bottleneck.


Download attachment "" of type "application/octet-stream" (6610 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.