Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 28 Sep 2012 21:24:55 +0200
From: magnum <>
Subject: Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

> On Fri, Sep 28, 2012 at 12:58 AM, magnum <> wrote:
> I used the version for GCN and hoped it would be fairly good for nvidia too. But to my surprise it's 7-8% slower than my kernel on GTX 570, with 3952 c/s @16384 and a duration of 4.2 seconds (my kernel does 4250 c/s in 3.8 seconds). But then again your kernel is optimised for AMD, I do see details that should be changed for nvidia so it might end up faster if tweaked.
> On the 7970 though, it's more than three times faster than mine at over 13000 c/s (mine only does 4172 c/s). For some reason it currently fails self-test but that's probably trivial and not important now so disregarding that, the raw speed is 13162 c/s at a GWS of 16384, and a kernel duration of "only" 1.2 seconds.
> magnum

On 28 Sep, 2012, at 0:10 , Milen Rangelov <> wrote:
> Yes, I know why exactly it behaves bad on nvidia (SET_AB macro particularily to blame yet not only there). Anyway, very recently I did some testing, shifting w[] values back to private memory and to my surprise it was  faster on AMD (and not on my NVidia 640GT which was expected, Kepler is very easily GPR-starved :( ).

It is? I thought it had plenty.

> My code is not at all optimal and with better SET_AB this idiotic Endian_Reverses can possibly be skipped. That's something I will investigate soon.

That is very easy and gives a noticable speedup. Just rewrite SET_AB for big endian. Do the endian swap when initializing d0, d1, ... and then skip all other endian swaps except when writing the serial.

I see lots of other minor things than can be skipped (maybe you already have): For example, this:


can be replaced by just


Because all the others are nulled in LOOP_BODY anyway. Not much of a boost though.

It took me half day to understand how the h3ck you can do all IV blocks always aligned just like the initial IV block, and the last final() always empty (just 0x80 and length). But that is, of course, how it ends up regardless of plaintext length due to the x16384 and % 64. I never realised that. Too bad that optimisation is out of the inner loop.

BTW I have this idea:

At init, create a buffer that holds "password.salt.000" four times in a row in local memory (already endian swapped of course). Regardless of password length, this buffer can be used  in the inner loop for 32-bit aligned copy to the sha1 buffer. No bitshifts, no char macros. I just need to come up with some macros for finding the offset to copy and where to update the serials.

Then in the inner loop, just build a whole 64-byte block at a time (i.e. think "blocks" instead of "iterations" - but it's tricky!), update the serials and call sha1_update(). If this can be cleverly implemented I think it should be very fast.

I also feel an absolute need for splitting the kernel so each invocation is 100-200 ms (probably an inner loop kernel with 512 iterations). But this format has a lot of data needing to be kept in global memory, especially if implementing that quad buffer idea.


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ