Date: Tue, 10 Jul 2012 11:16:17 +0530 From: Sayantan Datta <std2048@...il.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl On Tue, Jul 10, 2012 at 9:49 AM, Solar Designer <solar@...nwall.com> wrote: > On Tue, Jul 10, 2012 at 09:23:11AM +0530, Sayantan Datta wrote: > > On Tue, Jul 10, 2012 at 8:16 AM, Solar Designer <solar@...nwall.com> > wrote: > > > > > Shouldn't we expect more like a 50% improvement, based on the speeds > for > > > the implementation using global memory that you had before? Compared > to > > > your LDS-using implementation, we're adding uses of computing and > memory > > > resources that would otherwise be completely idle. > > > > We are still not capable of utilizing 100% of the hardware. > > Of course not. But I don't see what prevents us from achieving the > combined speed of your global-memory-using and your LDS-using > implementations. Yes, the former tried to use all SIMDs (if I > understand correctly), even though it kept them stalled waiting for data > most of the time, but can't we achieve roughly the same speed with fewer > SIMDs (such as with just two per CU, which we're not using for LDS), > since the task is memory speed bound anyway? I think we'll saturate the > 384-bit bus even with just two SIMDs per CU, or even with just one per > CU, for that matter (so we may save some electricity and heat > dissipation by leaving one SIMD per CU completely unused). > > Am I missing something? > > Alexander > I remember that during actual cracking how speed were limited to somewhere near 1000 c/s on the kernel using global memory although benchmarking suggested much higher 2400c/s. This suggest that we were incurring stalls during actual cracking which we weren't during benchmarking. I think this is the ultimate which we can achieve using global memory. Also I could achive nearly the same numbers using global memory alone despite of heavily under utilizing the CU. I limited global no. of work items to 512 and work group size to 8 which produced 1019 c/s in actual cracking. This puts my revised value of x to be 4 not 8. So we will see upto 25% extra using global memory. One more thing I would like you to know that your Sptr implemntation performs nearly same as before on nvidia after a 4x loop unroll of the 512 iteration loop. Regards, Sayantan Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.