Date: Wed, 25 Apr 2012 00:31:44 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: New RAR OpenCL kernel I can't find any documentation on "VGPRS" (vs just GPRS) but I am assuming it's the (non-scratch) GPR use (perhaps SGPRS are shader registers?). The new code do not use local memory anymore (unless you change a define) and apparently I use too many registers (for this card) now. Do you get rid of the register spills with lower LWS? Anyway I have a feeling this is not the biggest problem, I'm not sure what is. Your card should perform better. BTW, have you ever tried cRARk on this same card? I bet it's 10x faster than this. Anyway if you like, you could try re-enabling local memory with the defines of LMEM_PER_THREAD in both source files and see if it changes a run with same LWS & KPC to the better or worse. Today's commit should fix the find_best_() functions so they are faster, more accurate (hopefully) and will pick smallest workgroup of several with same c/s (so in your case it would settle for 2560 and not 7680). I also optimised the kernel for various nvidia cards but that may just as well be worse for AMD, who knows. And I'm 20% behind cRARk on nvidia anyway. magnum On 04/24/2012 04:04 PM, Claudio André wrote: > Since, it is complaining about the message size, i'll break it. > ----------------- > > Hi, see atached files. Please, try to see that 2560 seems to be a "magic > number". > > - TXT: raw results (no profiler) > - The same CSV file. > - And some more summary information. > > Profiler using: > Local worksize (LWS) 256, Global worksize (KPC) 2560 > > ---- > src/opencl/rar_kernel.cl | 34 ++++++++------ > src/rar_fmt.c | 116 > ++++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 122 insertions(+), 28 deletions(-) > ---- > > > > Em 22-04-2012 22:07, magnum escreveu: >> On 04/23/2012 12:02 AM, Claudio AndrÃ© wrote: >>>> Would both these figures by closer to 100 in a dream scenario, or what? >>>> >>>> By the way my previous version of rar got an "occupancy" of 0.01 or so >>>> (lol) in nvidia profiler. We'll see if there is any change now. >>>> >>>> magnum >>>> >>> I like the "dream scenario". Valid explanation. And 100 is the target. >>> >>> Alu packing has a "> 70" expectation. >>> Alubusy is where 100% is optimal. >>> >>> I agree that sprofile is not very useful, but is better than nothing (or >>> simple guessing). Since you have NVIDIA tools, it is not that important. >> I think sprofile is useful, it's just that my laptop GPU is so weak I >> can't draw any conclusions. >> >> Your profiling info was with LWS=GWS. Please try this if you have the >> time: >> >> 1. Pull latest git >> 2. Run with KPC=0 (I expect it to pick 4096 or higher as best) >> 3. Do another profiling run with the best KPC >> >> The ALU figures (and speed) should go up a lot (I hope). If they are >> not, the profiling info should tell why. >> >> thanks, >> magnum >> >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.