john-dev - Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABh=JREHCHWvZ9x2LiMsOwcKDF6LUbTuAtuWGCeHcgzBT=DbLQ@mail.gmail.com>
Date: Sat, 29 Sep 2012 00:26:10 +0300
From: Milen Rangelov <gat3way@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

Hello,


> It is? I thought it had plenty.
>
>
Uh...no. Kepler is anti-GPGPU architecture :( You've probably read that
register file per CU was doubled as compared to Fermi. However, Fermi
(sm_20) packed just 48 cores into a CU while Kepler's compute unit has 192.
The warp size is the same, differences are in the way instructions are
scheduled. With sm21, instruction scheduling was changed in a way that 4
instructions were issued in 2 warps instead of 2 intructions in 2 warps,
which in turn made vectorization necessary to achieve better speeds. This
still holds true for Kepler (though it's 8/4 now, not 4/2). However while
Kepler has 2x the register file per CU, it also has 4x the cores per CU.
That was interesting so I finally I got a Kepler GPU (cheap GT640) to see
how it behaves. Well, now vectorization is useless cause it leads to slower
code, unless the kernel is very lightweight (such as md4) and does not
utilize a lot of GPRs. Together with that, "fatter" kernels are GPR-starved
enough so that occupancy begins to suffer a lot. With rar you can't notice
that anyway since you can't run it with high GWS (I run mine with GWS=4096)
because it is damn slow (kernel execution time of several seconds, desktop
unresponsive, etc). Anyway there are kernels (such as wpa for example) that
tolerate higher GWS and there I  noticed that my previous GT430 is
faster...that's because unlike GT430, the GT640 cannot allow good occupancy
so that latencies are "hidden".



> > My code is not at all optimal and with better SET_AB this idiotic
> Endian_Reverses can possibly be skipped. That's something I will
> investigate soon.
>
> That is very easy and gives a noticable speedup. Just rewrite SET_AB for
> big endian. Do the endian swap when initializing d0, d1, ... and then skip
> all other endian swaps except when writing the serial.
>

Then adding the counter values becomes more complex. However this shouldn't
be a stopper.



>
> I see lots of other minor things than can be skipped (maybe you already
> have): For example, this:
>
>
> w[GLI][0]=w[GLI][1]=w[GLI][2]=w[GLI][3]=w[GLI][4]=w[GLI][5]=w[GLI][6]=w[GLI][7]=w[GLI][8]=w[GLI][9]=w[GLI][10]=w[GLI][11]=w[GLI][12]=w[GLI][13]=w[GLI][14]=w[GLI][15]=0;
> LOOP_BODY(16384*12);
>
> can be replaced by just
>
> w[GLI][0]=0;
> LOOP_BODY(16384*12);
>
> Because all the others are nulled in LOOP_BODY anyway. Not much of a boost
> though.
>
>
I think the compiler would eliminate those anyway. This looks like an
optimization that can be easily done in compile-time.



> It took me half day to understand how the h3ck you can do all IV blocks
> always aligned just like the initial IV block, and the last final() always
> empty (just 0x80 and length). But that is, of course, how it ends up
> regardless of plaintext length due to the x16384 and % 64. I never realised
> that. Too bad that optimisation is out of the inner loop.
>
>
yes, that's trying to play smart :) However it does not help much since
it's nothing as compared to the big loop..



> BTW I have this idea:
>
> At init, create a buffer that holds "password.salt.000" four times in a
> row in local memory (already endian swapped of course). Regardless of
> password length, this buffer can be used  in the inner loop for 32-bit
> aligned copy to the sha1 buffer. No bitshifts, no char macros. I just need
> to come up with some macros for finding the offset to copy and where to
> update the serials.
>
>
I've spent some time thinking about that actually...the biggest problem is
that serial number updates. It makes things complicated :(



> Then in the inner loop, just build a whole 64-byte block at a time (i.e.
> think "blocks" instead of "iterations" - but it's tricky!), update the
> serials and call sha1_update(). If this can be cleverly implemented I think
> it should be very fast.
>
>
yes, that would spare you one branch at least. The bad thing is that I
can't think of a way to update the counter values (serials) without
branching, so no big win :(



> I also feel an absolute need for splitting the kernel so each invocation
> is 100-200 ms (probably an inner loop kernel with 512 iterations). But this
> format has a lot of data needing to be kept in global memory, especially if
> implementing that quad buffer idea.


You could at least avoid keeping the w[] state if you split it into chunks
of the "right" iteration count, depending on plain len. This would make the
last kernel a bit more complex though. Anyway I don't like the idea of
reading and writing to global memory all the time...

Regards,
Milen

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.