john-dev - Re: GCN: indexed access to VGPRs

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20121213014949.GA11207@openwall.com>
Date: Thu, 13 Dec 2012 05:49:49 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: GCN: indexed access to VGPRs

On Mon, Dec 10, 2012 at 03:44:36AM +0400, Solar Designer wrote:
> On Sun, Dec 09, 2012 at 02:38:18PM +0200, Milen Rangelov wrote:
> > Perhaps though, smaller chunk of the sbox in VGPRs would be beneficial, I
> > just did not try that possibility.
> 
> We'd have an if/else then - and if it's implemented with eager
> execution, then we incur the LDS access latency even when the data is in
> fact in a register.  What we gain is a slightly higher number of
> concurrent bcrypt instances per CU (18 instead of 16 if we put one half
> of one S-box into registers?)

I've experimented with this a bit, based on Sayantan's code.  When using
a "? ... :", I got a local maximum at 176 elements in the private array.
However, in absolute terms the speed is poor (much worse than LDS-only).

When trying to use 128 elements (one half of S-box 4) and bitselect(),
self-test fails on 7970 - but works fine on GTX 570 (slow).  I tried
re-arranging the code in various ways, but no luck.  Seems like we're
hitting some AMD bug.  Will need to try again after upgrading Catalyst
on bull (is there a newer version of the SDK too?)  I find it unlikely
that we'll see any performance gain from this, though.

The important lines are:

	tmp1 = L & 0x7f; \
	tmp1 = bitselect(Sptr4[tmp1], S4_2[tmp1], (uint) -(int) !!(L & 0x80)); \

and there are alternatives to them in the #if 0 ... #endif block, e.g.:

	tmp1 = L & 0x7f; \
	tmp1 = bitselect(Sptr4[tmp1], S4_2[tmp1], (uint)((int)(L & 0x80) << 24 >> 31)); \

(also tested on GTX 570, works fine there).

(Oh, I just realized that in the version with shifts, I don't need the
"& 0x80" since those 7 bits would be shifted out by the right shift.)

On the 7970, the versions with "? ... :" work, but all of those that use
a bitmask mysteriously fail.

Alexander

View attachment "bf_kernel.cl.diff" of type "text/plain" (4921 bytes)

View attachment "bf_kernel.cl" of type "text/plain" (9367 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.