john-dev - Re: des-opencl

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130107175644.GA20530@openwall.com>
Date: Mon, 7 Jan 2013 21:56:44 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: des-opencl

Hi Sayantan -

On Mon, Jan 07, 2013 at 02:51:04PM +0530, Sayantan Datta wrote:
> On Mon, Jan 7, 2013 at 9:01 AM, Solar Designer <solar@...nwall.com> wrote:
> 
> > Another curious detail about mysterymath's implementation is that it
> > obviously exceeds instruction cache, yet is reasonably fast on GCN.
> > We should try that too as it avoids pointer indirection and reduces
> > register pressure due to hard-coding key bit indices (no key schedule
> > array needed anymore).  Lukas reported very poor performance on 5850,
> > though.
> 
> Yes I tried that too, but speed was far worse due to hardcoding. Although I
> tried this by hard-coding only k=0 and k=96 , there was a huge performance
> drop, probably due iCache overrun. To fully hard-code the kernel we need to
> do that for k=0 to 720 with a gap of 48 which should be around 16 times
> more code(instructions) than we currently have in the kernel.

There's some confusion here.  What exactly did you try?  Fully unroll
the DES rounds loop (all 16 rounds are sequential pieces of code, with
only the 25-iterations loop around them), but then only substituted some
fixed key bit indices, continuing to use the key schedule array for the
rest (even though you no longer had to)?  I am not sure we understand
each other correctly here...

Why 0 to 720?  I think it's 0 to 767.  Why gap?

Why 16 times more code?  We currently have 2 DES rounds unrolled
(looping over them 8 times).  Fully unrolled would be at most 8 times
larger, and in practice less than that because the reduced pointer
indirection also reduces code size - so maybe ~6x bigger code overall.

Yes, it wouldn't fit in the instruction cache on current devices, but
maybe that's acceptable when the number of waves is large enough (so
each portion of code fetched into the cache is made use of multiple times
by the different waves).

> > The speed seen during actual cracking is a lot lower - about 1M c/s,
> > not 44M c/s as --test would suggest (this is for many salts).  Why is
> > that?  How to avoid this problem?
> 
> If you check the speed just after it starts, it should be quite low. It
> should increase rapidly after the compilation of all kernels with different
> salt is finished.

Oh, you're right.  Somehow I thought this version of code didn't have
that.  We need a version of code that uses neither multiple kernels nor
runtime binary patching - just with the E[] array as we had before.
Yes, it'd be slower, but I think we should keep that as well anyway - to
be used when the approach with specializing kernels for a salt value
somehow fails or when the startup time needs to be reduced.

I'm not sure how to keep both (or all three?) approaches in the same
source tree best, though.  3 formats?  Or a format with compile-time
fallbacks (e.g., use binary patching when the target GPU type is one of
those where we've tested this and it works, and fallback to E[] for
other devices?)  Perhaps we'll make a "final" determination on that at a
later time, but for now we simply need to have these available.

Meanwhile, more comments were posted to the reddit thread - including a
comment by atom.  He managed to get 6 billion keys/s with a revision of
the OP's code - looks like we need to upgrade Catalyst and experiment
with this some more, too.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.