Date: Sat, 7 Apr 2012 04:55:30 +0200 From: Lukas Odzioba <lukas.odzioba@...il.com> To: john-dev@...ts.openwall.com Subject: Re: fast hashes on GPU 2012/4/6 myrice <qqlddg@...il.com>: > > > On Tue, Apr 3, 2012 at 8:08 PM, Lukas Odzioba <lukas.odzioba@...il.com> > wrote: >> >> You can try split that copy and overlap it with kernel execution. It >> is possible on fermi and newer cards. > > Thanks for this. I change each cudaMemcpy to cudaMemcpyAsync. Am I right? > However, no performance gains. Could you provide some example by doing this? > > Thanks! > Dongdong Li > Here are my sha256cuda results: GTX460 pcie 4x Raw: 13631K c/s real, 13769K c/s virtual + page locked memory Raw: 15335K c/s real, 15335K c/s virtual + overlap Raw: 16515K c/s real, 16515K c/s virtual I am sending you this kernel it is not ready for deployment but you can look at code. Two things I must admit: 1) you must use page locked memory to use overlaping 2) this introduces runtime problems sometimes, so you have to have two variants (with and without OvL), and first detect is overlapping supported. Because sometimes it wont work properly test your code 10 times harder:-) Profiler shots: http://sphere.pl/~ukasz/sha256_overlap.png http://sphere.pl/~ukasz/sha256_nonoverlap.png As you can see, the faster format is => cpu becomes worst bottleneck. Lukas Download attachment "rawsha256.cu" of type "application/octet-stream" (6610 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.