Date: Mon, 9 Jul 2012 13:02:12 +0530 From: Sayantan Datta <std2048@...il.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl (was: Sayantan:Weekly Report #11) On Mon, Jul 9, 2012 at 7:37 AM, Solar Designer <solar@...nwall.com> wrote: > On Thu, Jul 05, 2012 at 12:11:14PM +0530, SAYANTAN DATTA wrote: > > Therfore > > your calculation the utilization of one out of four SIMD units per CU on > > 7970 is valid for current kernel. > > Do you know if we currently utilize one of four SIMD units or maybe > 1/4th of each SIMD unit (vector width) or another combination (e.g., > maybe 8 out of 16 vector elements in two SIMD units, leaving the other > two completely idle)? > One wavefront or workgroup (whichever is less ) is scheduled on one SIMD unit. So we are using two SIMD units and using only half Processing Elemnts on each SIMD. > > I once tried two john builds together on 7970. One running on LDS and the > > other on global memory but it caused an asic hang. Maybe we need to merge > > them together under one build. But how to mearge them remains a question. > > One of the two possible way is to call two clEnqueKernels per crypt. Or > we > > can merge the two kernels. Also how the two branches get scheduled on > gpu > > will impact performance. > > I'd try setting WORK_GROUP_SIZE to 12, keep the declaration of S_Buffer > at its current size - introduce some new macro for this, like > LDS_GROUP_SIZE, which we'd keep at 8 for 7970. If lid is < > LDS_GROUP_SIZE, then use the current code. If lid is >= LDS_GROUP_SIZE > (would be 8, 9, 10, or 11 under this example), then use new code that > would use global memory instead (just modify the supplied BF_current_S > directly?) > I'll try this first. > > Meanwhile, attached is a quick hack that uses simpler addressing modes > (maybe, depending on what the code is compiled into). No significant > performance change on 7970 from this (but the source code size is > reduced), yet you could want to benchmark this more carefully. There > appears to be a 10% slowdown on GTX 570 from this, but that's with > non-optimal settings (the same as 7970's), so it might not be relevant. > You could want to play with this too. > > Also, the 512-iterations loop in BF_body() could be partially unrolled - > maybe try a 2x unroll first (256 iterations of the new loop). Maybe > this would let a few instructions of one iteration of the original loop > be intermixed with instructions from the other iteration, hiding some > latencies. > > Thanks, > > Alexander > I'll try them too. Regards, Sayantan Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.