john-dev - Re: bf_kernel.cl (was: Sayantan:Weekly Report #11)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+TsHUDv2mP6bjuDjQ1g_NEDWMUOxQ7HmmpTjc-4k-gBq6dZFg@mail.gmail.com>
Date: Tue, 10 Jul 2012 00:12:14 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf_kernel.cl (was: Sayantan:Weekly Report #11)

On Mon, Jul 9, 2012 at 10:55 PM, Sayantan Datta <std2048@...il.com> wrote:

> Hi Alexander,
>
>
> On Mon, Jul 9, 2012 at 1:02 PM, Sayantan Datta <std2048@...il.com> wrote:
>
>> I'd try setting WORK_GROUP_SIZE to 12, keep the declaration of S_Buffer
>>> at its current size - introduce some new macro for this, like
>>> LDS_GROUP_SIZE, which we'd keep at 8 for 7970.  If lid is <
>>> LDS_GROUP_SIZE, then use the current code.  If lid is >= LDS_GROUP_SIZE
>>> (would be 8, 9, 10, or 11 under this example), then use new code that
>>> would use global memory instead (just modify the supplied BF_current_S
>>> directly?)
>>
>>
> I think this is not a good idea because because all of the 12 work items
> in a work group would be executed on a single SIMD and we might see a
> slowdown instead of any speed up.  Here's the reason:
> 1. Due to branching within work group the execution of two branches would
> get serialized.
> 2. You still cannot increase the number of SIMD units used per CU because
> each workgroup still eats 32KB LDS. So we are again limited 2 SIMD units
> per CU.
>
> Here's my plan:
> Let us assume that each work-item using LDS consume T seconds time.
> Also assume each work-item using Glbal Memory uses xT seconds. Where x is
> unkown to be determined by experiments.
> Keep work group size 8 as before.
> Scheduling the work items like this:
> [(16LDS work items + 16GDS work items)X 32 + (x-1)(16LDS items)X 32] X
> MULTIPLIER
>
> in this case there is no branching within a work group which means each
> work group can execute independently on seperate SIMDs. Also all SIMD units
> will be utilized even though occupation within the SIMDs still remains
> halved. Depending upon x we would get the speedup :
>
> Here's the calculation:
>
> When using only LDS :speed= 512 hashes/T
> When using both LDS and GDS : speed= {1024 + (x-1)512}/xT
>
> speed up factor = both LDS and GDS/only LDS
>                        = (x+1)/x
>
> So for x=5 we have speed up factor = 1.2 or 20%
> for x=10 speed up would be 10%
>
> Regards,
> Sayantan
>
>
I also made an attempt to find out the value of x. It comes out to be
around 8. So we would see around 10% performance improvement over pure LDS
implementation.

Regards,
Sayantan

Content of type "text/html" skipped

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.