john-dev - Re: cmp_all on gpu for descrypt-opencl

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+TsHUALu9EfUzEkWsjdr9P7W6zD7WcKF90yX1nicejeUA39KA@mail.gmail.com>
Date: Wed, 22 May 2013 11:08:43 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: cmp_all on gpu for descrypt-opencl

Hello,

On Wed, May 22, 2013 at 12:21 AM, Solar Designer <solar@...nwall.com> wrote:

> On Tue, May 21, 2013 at 10:55:04PM +0530, Sayantan Datta wrote:
> > I have been able to run cmp_all() on GPU but the results aren't that
> good.
> > In fact we have little performance penalty with cmp_all() on gpu.
>
> I guess you mean "a little", not "little".
>
> > I think the overhead involved is too much and the kernel has lot of
> > branching(that depends on user data).
>
> Yes, in cmp_all() the branching in fact depends on user data, unlike in
> the current crypt_all().  To reduce the impact of this, the up to 64
> iterations loop may be split in two, as I mentioned in a comment here:
>
>
> http://www.reddit.com/r/crypto/comments/162ufx/research_project_opencl_bitslice_des_bruteforce/
>
> In my testing, the optimal numbers were 20 iterations in the first loop
> and 44 in the second.  This was not in our descrypt-opencl code, but it
> was in another bitslice DES implementation on GPU, which is similar in
> this respect.  I think the same or similar numbers will be optimal for
> us as well.  From the reddit comment:
>

In our cmp_all() code we only have 32 iterations split into 4(full unroll)
and 28(2x unroll) iterations. Which loop are we talking about here ?


>
> "[...] split this loop in two, only proceeding with the second half if
> the first indicates there still might be a match.  These specific speed
> numbers are for 20 iterations in the first loop (and 44 in the second).
> The first loop's iteration count needs to be high enough that a branch
> will actually be made most of the time, letting all SIMDs in a CU fully
> skip the second loop."
>

I understand what you mean. But if a branch is made in the first 20
iterations , we skip the next loop with 44 iterations but we still have a
branch!! How is that different from branching out of a 64 iter loop ? How
is skipping the second loop going to improve performance (the next 44 iters
would be skipped even if the loop has 64 iters ) ? If both the loops have
same instructions we are not going to see any improvement in terms of
i-cache perspective, right ?
   Unless what you mean is that there is no branch in first 20 iter loop
but somehow we are able to figure out if we need to skip the second loop
based on the first loop. We might add a jump statement after the first loop
and before second loop.


>
> > Also the amount of work is very less
> > compared to overhead. One possible way to improve performance might be
>  to
> > load all hashes instead of just one per cmp_all() call. But that doesn't
> > seems possible with current format.
>
> It is possible with the current (new) format: you implement the
> comparisons in crypt_all() instead of in cmp_all().  This is why I added
> the "struct db_salt *salt" argument to crypt_all().  That way, you also
> avoid a second kernel invocation.
>
> I posted an example of this, for LM hashes on CPU, to john-dev last summer:
>
> http://www.openwall.com/lists/john-dev/2012/07/21/1
>
> You need to look at the changes for DES_bs_crypt_LM() - you may search
> that posting for "This integer division may be slow - need to revise" to
> find the right place.  (BTW, this comment gives the reason why I haven't
> merged this portion of the changes into the core tree... but I think
> magnum did merge it into bleeding anyway.)
>
> Alexander
>

I found that post and downloaded the patches( BTW DES_bs_crypt_LM() patch
is not in bleeding). I applied the patch to current bleeding and some of
the variables seems to magically appear during compilation e.g retval.

            for (index = start; index < end; index++) {
                unsigned int hash = salt->index(index);
                if (salt->bitmap[hash / (sizeof(*salt->bitmap) * 8)] &
                    (1U << (hash % (sizeof(*salt->bitmap) * 8)))) {
                    struct db_password *pw;
                    pw = salt->hash[hash >> PASSWORD_HASH_SHR];
                    do {
                        if (DES_bs_cmp_one(pw->binary, 64, index)) {
#pragma omp critical
                            retval = keys_count;
                            goto done;
                        }
                    } while ((pw = pw->next_hash));
                }
            }
done:
            ;
        }
#endif
    }

#if DES_bs_mt
    return retval;

Focusing on the main loop  it seems like we should return
keys_count(updated keys count if generating passwords) on finding a match
else we should return 0, right ? Also I didn't understand the bitmap stuff
above.
I saw loader.h for bitmap but it is still not clear.

Regards,
Sayantan

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.