Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 9 Jul 2012 13:02:12 +0530
From: Sayantan Datta <>
Subject: Re: (was: Sayantan:Weekly Report #11)

On Mon, Jul 9, 2012 at 7:37 AM, Solar Designer <> wrote:

> On Thu, Jul 05, 2012 at 12:11:14PM +0530, SAYANTAN DATTA wrote:
> > Therfore
> > your calculation the utilization of one out of four SIMD units per CU on
> > 7970 is valid for current kernel.
> Do you know if we currently utilize one of four SIMD units or maybe
> 1/4th of each SIMD unit (vector width) or another combination (e.g.,
> maybe 8 out of 16 vector elements in two SIMD units, leaving the other
> two completely idle)?

One wavefront  or workgroup (whichever is less ) is scheduled on one SIMD
unit. So we are using two SIMD units and using only half Processing Elemnts
on each SIMD.

> > I once tried two john builds together on 7970. One running on LDS and the
> > other on global memory but it caused an asic hang. Maybe we need to merge
> > them together under one build. But how to mearge them remains a question.
> > One of the two possible way is to call two clEnqueKernels per crypt. Or
> we
> > can merge the two kernels. Also how  the  two branches get scheduled on
> gpu
> > will impact performance.
> I'd try setting WORK_GROUP_SIZE to 12, keep the declaration of S_Buffer
> at its current size - introduce some new macro for this, like
> LDS_GROUP_SIZE, which we'd keep at 8 for 7970.  If lid is <
> LDS_GROUP_SIZE, then use the current code.  If lid is >= LDS_GROUP_SIZE
> (would be 8, 9, 10, or 11 under this example), then use new code that
> would use global memory instead (just modify the supplied BF_current_S
> directly?)

I'll try this first.

> Meanwhile, attached is a quick hack that uses simpler addressing modes
> (maybe, depending on what the code is compiled into).  No significant
> performance change on 7970 from this (but the source code size is
> reduced), yet you could want to benchmark this more carefully.  There
> appears to be a 10% slowdown on GTX 570 from this, but that's with
> non-optimal settings (the same as 7970's), so it might not be relevant.
> You could want to play with this too.
> Also, the 512-iterations loop in BF_body() could be partially unrolled -
> maybe try a 2x unroll first (256 iterations of the new loop).  Maybe
> this would let a few instructions of one iteration of the original loop
> be intermixed with instructions from the other iteration, hiding some
> latencies.
> Thanks,
> Alexander

I'll try them too.


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.