Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Sat, 29 Sep 2012 03:02:46 +0300
From: Milen Rangelov <gat3way@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject)

>
> That's just a trivial macro. I tried this, it was 10-15% gain on a crappy
> 9600GT. Not sure how it would end up on a powerful AMD though.
>
>
Hmmm I definitely need to try. My guess is that the gain on AMD would be
less in terms of percentage (as reversals use bitselect thus bfi_int on
amd), but of course more in terms of absolute c/s.



> For non-OpenCL code I would agree but even you said we can't trust a
> simple thing like #pragma unroll. If we can't trust that, I will not trust
> the compiler to do *any* decent optimizations. I have, though, confirmed
> that all your use of SHA-1 constants (using ints) indeed end up as
> constants not using registers. OTOH&BTW you use a define for
> get_global_id(0) which might be OK for AMD but it sure kills performance
> for nvidia - it benefits from storing gid (and lid) in a register.
>
>
Well that's quite basic optimization IMO, dead code elimination. The OpenCL
compiler sometimes does quite well at optimizing code. For the nvidia part
though....well, the nvidia compiler is just stupid. In fact get_global_id()
works by doing a multiplication of group id and wavefront size and then
adding the local id (all of them are kept in hardware registers). AMD's
opencl compiler generated code multiplies that in the beginning and stores
the result for any subsequent get_global_id() call. Apparently nvidia's one
is stupid. BTW obvious optimization for AMD is declaring your own
get_global_id() function using mad24 in case your GWS is known to be <2^24.
This is a single instruction and on VLIW hardware unlike MULLO_INT which
opencl uses, it does not have restrictions regarding the xyzwt units. You
often achieve one clause less which can be a moderate win :) This
optimization has small, but noticeable effects on some fast plugins like
md5 or ntlm on vliw hardware :)


> I think it can be a win in the end. The serial updates are fixed with a
> fairly trivial (although not yet defined) macro. Everything else should be
> a win. Just as a thought, we could save that quad buffer a couple times
> more and not copy it at all - just run a non-destroying (the one I
> currently use thrashes the input buffer) sha1_block() on the proper offset.
> This starts to get hairy but I will try it.
>
>
I hope you have more success with this than me..



> I'm still trying to resist fixed-length kernels. If I see a 25% boost I
> will give in... but I presume you are right. The last kernel should not be
> an issue though, it will be a walk in the park compared to the inner loop
> one.
>
>
I hate fixed-len kernels, rar is more like an exception. They are hard to
maintain and slow to compile. In the RAR case though I did that for
performance reasons.



> Anyways, when I did that split-kernel thing to office-opencl I was
> surprised it did not harm performance at all - despite it storing and
> loading global memory every 128 iterations out of a 100,000.
>
>
if you can spawn enough wavefronts per CU, latency would be hidden by other
wavefronts executing on the CU while memory is being fetched (just like
with hyperthreading on intel CPUs). __global accesses OTOH are bad when you
are either GPR-starved or using a lot of __local memory...



> magnum
>

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.