john-dev - Re: New RAR OpenCL kernel

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a09eea9ecd01f055858ecab2682aa7dd@smtp.hushmail.com>
Date: Wed, 25 Apr 2012 00:31:44 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel

I can't find any documentation on "VGPRS" (vs just GPRS) but I am
assuming it's the (non-scratch) GPR use (perhaps SGPRS are shader
registers?). The new code do not use local memory anymore (unless you
change a define) and apparently I use too many registers (for this card)
now. Do you get rid of the register spills with lower LWS? Anyway I have
a feeling this is not the biggest problem, I'm not sure what is. Your
card should perform better. BTW, have you ever tried cRARk on this same
card? I bet it's 10x faster than this.

Anyway if you like, you could try re-enabling local memory with the
defines of LMEM_PER_THREAD in both source files and see if it changes a
run with same LWS & KPC to the better or worse.

Today's commit should fix the find_best_() functions so they are faster,
more accurate (hopefully) and will pick smallest workgroup of several
with same c/s (so in your case it would settle for 2560 and not 7680). I
also optimised the kernel for various nvidia cards but that may just as
well be worse for AMD, who knows. And I'm 20% behind cRARk on nvidia anyway.

magnum


On 04/24/2012 04:04 PM, Claudio André wrote:
> Since, it is complaining about the message size, i'll break it.
> -----------------
> 
> Hi, see atached files. Please, try to see that 2560 seems to be a "magic
> number".
> 
> - TXT: raw results (no profiler)
> - The same CSV file.
> - And some more summary information.
> 
> Profiler using:
> Local worksize (LWS) 256, Global worksize (KPC) 2560
> 
> ----
>   src/opencl/rar_kernel.cl |   34 ++++++++------
>   src/rar_fmt.c            |  116
> ++++++++++++++++++++++++++++++++++++++++-----
>   2 files changed, 122 insertions(+), 28 deletions(-)
> ----
> 
> 
> 
> Em 22-04-2012 22:07, magnum escreveu:
>> On 04/23/2012 12:02 AM, Claudio AndrÃ© wrote:
>>>> Would both these figures by closer to 100 in a dream scenario, or what?
>>>>
>>>> By the way my previous version of rar got an "occupancy" of 0.01 or so
>>>> (lol) in nvidia profiler. We'll see if there is any change now.
>>>>
>>>> magnum
>>>>
>>> I like the "dream scenario". Valid explanation. And 100 is the target.
>>>
>>> Alu packing has a ">  70" expectation.
>>> Alubusy is where 100% is optimal.
>>>
>>> I agree that sprofile is not very useful, but is better than nothing (or
>>> simple guessing). Since you have NVIDIA tools, it is not that important.
>> I think sprofile is useful, it's just that my laptop GPU is so weak I
>> can't draw any conclusions.
>>
>> Your profiling info was with LWS=GWS. Please try this if you have the
>> time:
>>
>> 1. Pull latest git
>> 2. Run with KPC=0 (I expect it to pick 4096 or higher as best)
>> 3. Do another profiling run with the best KPC
>>
>> The ALU figures (and speed) should go up a lot (I hope). If they are
>> not, the profiling info should tell why.
>>
>> thanks,
>> magnum
>>
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.