john-dev - Re: bf

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+TsHUAHmJ3VBbxmtsckm0qCkX0hbeLLG-z74c4BGyVxCKC2vg@mail.gmail.com>
Date: Wed, 11 Jul 2012 16:41:22 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf_kernel.cl

On Wed, Jul 11, 2012 at 1:33 PM, Solar Designer <solar@...nwall.com> wrote:

> On Tue, Jul 10, 2012 at 08:13:22PM +0530, Sayantan Datta wrote:
> > I was looking at the IL generated on 7970 using LDS only.  Each Encrypt
> > call has approximately 540 instruction at IL level.  However according to
> > your previous estimates each Encrypt call has 16*16+4+5 = 275
>
> Any idea why we have so many more instructions?  Is it possibly because
> they're for two SIMD units?
>

At IL level I can see that each lds load or store is associted with 5 extra
instructions to specify the address of the data to be fetched or stored.
Here's an example
This is the start of a BF_ENCRYPTION. Other than the first two instructions
all instructions are for loading the data.

    ixor r79.x___, r79.z,
r79.x
;r79.x == L
    iand r80.x___, r79.x,
r69.x
;r69.x == 255
    ior r80.x___, r80.x,
r69.z
;r69.z == 768    ,add 768
    ishl r80.x___, r80.x,
r72.y
;r72.y== 2 , multiply 4 i.e for addresing 4byte uint
    iadd r80.x___, r74.z,
r80.x
;r74.z== 0 ,maybe some loop mangement etc don't know for sure
    mov r1010.x___,
r80.x
;r1010.x stores the address
    lds_load_id(1) r1011.x,
r1010.x
;result stored in r1011.x
    mov r80.x___, r1011.x
                                     ;result moved back to working register


Everything else seems to be OK

> rusling in an estimated speed of 52K c/s.

That was a theoretical/optimistic estimate, assuming that results of
> each instruction are available for use the very next cycle and that
> scatter/gather addressing is used.
>
> > Since the number of instruction is doubled we
> > should expect at least half of your previous estimates say roughly 26K
> c/s.
>
> You probably meant "at most".  However, this is not necessarily right -
> we need to find the cause of the doubled instruction count first.
> If it's because of explicit separate instructions for use of two SIMD
> units, then this does not halve the estimated speed.
>
>
So it is not because of 2 SIMD.


> > But we are nowhere near that.  I guess your previous estimates were based
> > on the fact that each instruction takes 1 clock cycle to execute, is it?
> > But it looks like not all instructions rquire same number of clock cycle
> on
> > gpu.
>
> I think all relevant instructions can execute in 1 cycle in terms of
> throughput when there's sufficient parallelism available, but we do not
> have sufficient parallelism here (because of limited LDS size), so we
> incur stalls because of instruction latencies greater than 1 cycle.
> This is no surprise.  My estimates were in fact theoretical/optimistic
> rather than realistic.
>
> However, one thing to double-check is whether we're using gather
> addressing for the S-box lookups or maybe not (which would mean that we
> use at most one vector element per SIMD unit).  In the latter case, a
> speedup should be possible from using 4 instead of 2 SIMD units.  If you
> see no such speedup, then this suggests that we're probably using gather
> just fine now, and not reaching the theoretical speeds (by far) is
> merely/primarily because of instruction latencies.
>
>
Looking at IL I can't say for sure whether it using gather addressing. It
is the same cl code written in assembly.  Although looking at benchmarks it
seems like we are using gather addressing.

Using 4 SIMD ie setting work group size=4

OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
****Please see 'opencl_bf_std.h' for device specific optimizations****
Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
Raw:    4241 c/s real, 238933 c/s virtual

Using 2 SIMD i.e setting work group size =8

std2048@...l:~/bin/run$ ./john -te -fo=bf-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
****Please see 'opencl_bf_std.h' for device specific optimizations****
Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE
Raw:    4216 c/s real, 238933 c/s virtual

Although there is small speedup from using 4 SIMDs but that may be due
reduced bank conflicts.


> BTW, in the wiki page at http://openwall.info/wiki/john/GPU/bcrypt you
> mention Bulldozer's L2 cache.  JFYI, this is irrelevant, since on CPUs
> the S-boxes fit in L1 cache.  We only run 4 concurrent instances of
> bcrypt per Bulldozer module (two per thread, and there's no need to run
> more), so that's only 16 KB for the S-boxes.
>
> Alexander
>

Thanks for clearing that. I would change that statemant.

Regards,
Sayantan

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.