Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 28 Sep 2012 01:10:27 +0300
From: Milen Rangelov <>
Subject: Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

On Fri, Sep 28, 2012 at 12:58 AM, magnum <> wrote:

> On 17 Aug, 2012, at 20:02 , magnum <> wrote:
> > On 2012-08-17 19:54, Milen Rangelov wrote:
> >> On Fri, Aug 17, 2012 at 3:26 PM, magnum <>
> wrote:
> >>> On 2012-08-17 09:19, Milen Rangelov wrote:
> >>>> May I borrow it for my project?
> >>>
> >>> But of course! If you like you can send me a rar kernel I can get hints
> >>> from, as a courtesy ;-)  Doesn't need to be complete runnable code,
> just
> >>> a kernel. I think my key stretching loop is the bottleneck.
> >>
> >> Yeah, here it is, but I warn you, it's scary :)
> >>
> >
> > Thanks! I'll have a look. It just can't be any more scary than mine :)
> OK it was more scary, LOL. Took me a while to figure out. I could not even
> understand the code until I got the idea to auto-indent it. Then things got
> more clear. It's good code, only one single branch - and that one's simply
> not avoidable.
> Just as a benchmark, I tucked the 6-char version of it into our RAR format
> and made some slight adjustments to the argument list and output layout to
> make it work. To my utter surprise it did on first try, it passes self-test
> on nvidia. I did not really expect everything to be correct without
> changing some data layout in host code. This won't be useful as-is, I can't
> use fixed-length except when testing (using that length test vectors :) but
> it gives me a benchmark - and if possible I will try to make my kernel
> better stealing ideas from it.
> I used the version for GCN and hoped it would be fairly good for nvidia
> too. But to my surprise it's 7-8% slower than my kernel on GTX 570, with
> 3952 c/s @16384 and a duration of 4.2 seconds (my kernel does 4250 c/s in
> 3.8 seconds). But then again your kernel is optimised for AMD, I do see
> details that should be changed for nvidia so it might end up faster if
> tweaked.
> On the 7970 though, it's more than three times faster than mine at over
> 13000 c/s (mine only does 4172 c/s). For some reason it currently fails
> self-test but that's probably trivial and not important now so disregarding
> that, the raw speed is 13162 c/s at a GWS of 16384, and a kernel duration
> of "only" 1.2 seconds.
> magnum

Yes, I know why exactly it behaves bad on nvidia (SET_AB macro
particularily to blame yet not only there). Anyway, very recently I did
some testing, shifting w[] values back to private memory and to my surprise
it was  faster on AMD (and not on my NVidia 640GT which was expected,
Kepler is very easily GPR-starved :( ). My code is not at all optimal and
with better SET_AB this idiotic Endian_Reverses can possibly be skipped.
That's something I will investigate soon. Another problem I have is with
sha512unix where I decided to do sbytes and pbytes on CPU. That was a bad,
bad, bad decision I regret now :(

Content of type "text/html" skipped

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ