john-dev - Re: cmp_all on gpu for descrypt-opencl

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130522102213.GB6483@openwall.com>
Date: Wed, 22 May 2013 14:22:13 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: cmp_all on gpu for descrypt-opencl

On Wed, May 22, 2013 at 11:08:43AM +0530, Sayantan Datta wrote:
> On Wed, May 22, 2013 at 12:21 AM, Solar Designer <solar@...nwall.com> wrote:
> > In my testing, the optimal numbers were 20 iterations in the first loop
> > and 44 in the second.  This was not in our descrypt-opencl code, but it
> > was in another bitslice DES implementation on GPU, which is similar in
> > this respect.  I think the same or similar numbers will be optimal for
> > us as well.  From the reddit comment:
> 
> In our cmp_all() code we only have 32 iterations split into 4(full unroll)
> and 28(2x unroll) iterations. Which loop are we talking about here ?

This one same loop.  On CPU, we currently split it into four pieces, of
the following iteration counts: 2, 2, 28, and 32 - these last 32 are in
cmp_one().  You're referring to the unroll counts, but these are a
separate thing.  Note that there are "if ... goto next_depth;" lines
inside the unrolled loops.

As to the unrolls, we should tune them as well.

> > "[...] split this loop in two, only proceeding with the second half if
> > the first indicates there still might be a match.  These specific speed
> > numbers are for 20 iterations in the first loop (and 44 in the second).
> > The first loop's iteration count needs to be high enough that a branch
> > will actually be made most of the time, letting all SIMDs in a CU fully
> > skip the second loop."
> 
> I understand what you mean. But if a branch is made in the first 20
> iterations , we skip the next loop with 44 iterations but we still have a
> branch!!

Yes, but this one branch acts as if it were not data-dependent most of
the time.

> How is that different from branching out of a 64 iter loop ?

When you include the possibility of branching out of the loop early on,
the resulting code (at ISA level) includes overhead associated with that
to manage the execution masks and eventually actually branch out (when
the condition is finally met for all vector elements in the current SIMD
unit).  Each loop's iteration for a given work-item becomes conditional
upon completion of the previous iterations, so the compiler might have
difficulty inter-mixing instructions for multiple iterations of the
unrolled loop as would be desirable to hide their latencies.  It'd think
it needs to adjust the execution masks before proceeding with each
iteration, thereby turning potentially parallel execution into sequential.

(This aspect is seen on CPU as well, but to a much lesser extent.  This
is why I grouped the first 2 and the second 2 loop iterations together,
with no check for possible (unrolled) loop exit between first and second,
and between third and fourth iterations.)

Additionally, even after a SIMD unit has actually branched out of the
loop (and reached the end of kernel?), there's presumably some waiting
time incurred if computation for other work-items does not complete as
quickly, so there's no gain from exiting too early.  This may be related
to the OpenCL implementation (might be avoidable with lower-level
programming), and I guess it's a per-kernel rather than per-loop
synchronization issue.  I am not familiar with this.  I make this guess
from the symptoms: I think the optimal first loop iteration count would
be a lot lower than 20 otherwise.

> How
> is skipping the second loop going to improve performance (the next 44 iters
> would be skipped even if the loop has 64 iters ) ?

Yes, they would be skipped in the same cases anyway, but early loop
iterations' cost would be higher if the loop could be exited on any
iteration rather than only after 20 iterations.

> If both the loops have
> same instructions we are not going to see any improvement in terms of
> i-cache perspective, right ?

I was not focusing on I-cache usage, but since you brought this up:
actually, the version of the first (unrolled) loop without early exit
possibility probably has smaller code size (than similarly unrolled loop
with possible exits after each iteration).

Think of it this way: what's the probability that a SIMD unit will
actually branch out after the first iteration? after the second
iteration? after the third?  It's quite small.  Is it worth the overhead
of checking and the (higher) overhead of avoiding parallel (or rather
pipelined) computation of these early iterations?  No.  It takes quite
some iterations for the probability to become high enough for the
overhead to be worth it.

>    Unless what you mean is that there is no branch in first 20 iter loop
> but somehow we are able to figure out if we need to skip the second loop
> based on the first loop. We might add a jump statement after the first loop
> and before second loop.

This is what I mean.  However, we can also have this check inside the
second loop, because the probability of (not so) early exit is already
high enough by the time we reach it and the probability of not exiting
the loop by a given iteration number is halved with each iteration.

We may also try some hybrid schemes, such as checking for (not so) early
exit at every N iterations of the second loop (unrolled), testing values
of N from 1 to 22 (if the second loop's iteration count is 44).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.