john-dev - Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c769a4a2f37f3fb2bf6f52dfac9b34dc@smtp.hushmail.com>
Date: Sat, 29 Sep 2012 01:35:32 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

On 28 Sep, 2012, at 23:26 , Milen Rangelov <gat3way@...il.com> wrote:

> > My code is not at all optimal and with better SET_AB this idiotic Endian_Reverses can possibly be skipped. That's something I will investigate soon.
> 
> That is very easy and gives a noticable speedup. Just rewrite SET_AB for big endian. Do the endian swap when initializing d0, d1, ... and then skip all other endian swaps except when writing the serial.
> 
> Then adding the counter values becomes more complex. However this shouldn't be a stopper.

That's just a trivial macro. I tried this, it was 10-15% gain on a crappy 9600GT. Not sure how it would end up on a powerful AMD though.

> I see lots of other minor things than can be skipped (maybe you already have): For example, this:
> 
> w[GLI][0]=w[GLI][1]=w[GLI][2]=w[GLI][3]=w[GLI][4]=w[GLI][5]=w[GLI][6]=w[GLI][7]=w[GLI][8]=w[GLI][9]=w[GLI][10]=w[GLI][11]=w[GLI][12]=w[GLI][13]=w[GLI][14]=w[GLI][15]=0;
> LOOP_BODY(16384*12);
> 
> can be replaced by just
> 
> w[GLI][0]=0;
> LOOP_BODY(16384*12);
> 
> Because all the others are nulled in LOOP_BODY anyway. Not much of a boost though.
> 
> 
> I think the compiler would eliminate those anyway. This looks like an optimization that can be easily done in compile-time.

For non-OpenCL code I would agree but even you said we can't trust a simple thing like #pragma unroll. If we can't trust that, I will not trust the compiler to do *any* decent optimizations. I have, though, confirmed that all your use of SHA-1 constants (using ints) indeed end up as constants not using registers. OTOH&BTW you use a define for get_global_id(0) which might be OK for AMD but it sure kills performance for nvidia - it benefits from storing gid (and lid) in a register.

> BTW I have this idea:
> 
> At init, create a buffer that holds "password.salt.000" four times in a row in local memory (already endian swapped of course). Regardless of password length, this buffer can be used  in the inner loop for 32-bit aligned copy to the sha1 buffer. No bitshifts, no char macros. I just need to come up with some macros for finding the offset to copy and where to update the serials.
> 
> 
> I've spent some time thinking about that actually...the biggest problem is that serial number updates. It makes things complicated :(
> 
>  
> Then in the inner loop, just build a whole 64-byte block at a time (i.e. think "blocks" instead of "iterations" - but it's tricky!), update the serials and call sha1_update(). If this can be cleverly implemented I think it should be very fast.
> 
> 
> yes, that would spare you one branch at least. The bad thing is that I can't think of a way to update the counter values (serials) without branching, so no big win :(

I think it can be a win in the end. The serial updates are fixed with a fairly trivial (although not yet defined) macro. Everything else should be a win. Just as a thought, we could save that quad buffer a couple times more and not copy it at all - just run a non-destroying (the one I currently use thrashes the input buffer) sha1_block() on the proper offset. This starts to get hairy but I will try it.

> I also feel an absolute need for splitting the kernel so each invocation is 100-200 ms (probably an inner loop kernel with 512 iterations). But this format has a lot of data needing to be kept in global memory, especially if implementing that quad buffer idea.
> 
> You could at least avoid keeping the w[] state if you split it into chunks of the "right" iteration count, depending on plain len. This would make the last kernel a bit more complex though. Anyway I don't like the idea of reading and writing to global memory all the time...

I'm still trying to resist fixed-length kernels. If I see a 25% boost I will give in... but I presume you are right. The last kernel should not be an issue though, it will be a walk in the park compared to the inner loop one.

Anyways, when I did that split-kernel thing to office-opencl I was surprised it did not harm performance at all - despite it storing and loading global memory every 128 iterations out of a 100,000.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.