Date: Tue, 24 Apr 2012 20:44:15 -0300 From: Claudio André <claudioandre.br@...il.com> To: john-dev@...ts.openwall.com Subject: Re: New RAR OpenCL kernel Em 24-04-2012 19:31, magnum escreveu: > I can't find any documentation on "VGPRS" (vs just GPRS) but I am > assuming it's the (non-scratch) GPR use (perhaps SGPRS are shader > registers?). Me neither. As a long time learning process, i found some snippets (pages 4-77 and 4-79) in  and VGPR in . But is very slow process and you need to really want to be an AMD expert. > The new code do not use local memory anymore (unless you > change a define) and apparently I use too many registers (for this card) > now. Do you get rid of the register spills with lower LWS? No, i tried it in your old code. But your *newer* code does not have register spilling anymore. And 40% gain. claudio@...udioandre-desktop:~/bin/john/to_commit/src$ LWS=64 KPC=2560 ../run/john -test -fo:rar OpenCL platform 0: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Juniper Compilation log: LOOP UNROLL: pragma unroll (line 264) Unrolled as requested! Local worksize (LWS) 64, Global worksize (KPC) 2560 Benchmarking: RAR3 (6 characters) [OpenCL]... DONE Raw: 409 c/s real, 256000 c/s virtual > Anyway I have > a feeling this is not the biggest problem, I agree. As the hardware owner/user, i would say that nothing different from 32 or 64 for LWS is going to be the silver bullet. And the spill is not 10x slower. > I'm not sure what is. Your > card should perform better. BTW, have you ever tried cRARk on this same > card? I bet it's 10x faster than this. I'll. Can you give me the command line you recommend? Or should i experiment? I never used it. > > Anyway if you like, you could try re-enabling local memory with the > defines of LMEM_PER_THREAD in both source files and see if it changes a > run with same LWS& KPC to the better or worse. It fails. I double checked if i did the change right and in two places. Check your code again. But it can be the compiler silliness (i saw it doing crazy things here). Local worksize (LWS) 64, Global worksize (KPC) 2560 Benchmarking: RAR3 (6 characters) [OpenCL]... FAILED (cmp_all(95)) > Today's commit should fix the find_best_() functions so they are faster, > more accurate (hopefully) and will pick smallest workgroup of several > with same c/s (so in your case it would settle for 2560 and not 7680). As a user of your software, i liked to see the output, the tests and the results. If i have a real problem to solve, i would like to hide the memory transfers, so i 'll prefer KPC=7680 or better KPC=10240. Looking at the output i realized 2560 and multiples are the solution. So, the stuff is not bad. I did a quick test here, and discover that 2560 is a good number to me (SHA-512). Probably (this is more) hardware related. I'll try to create a similar find_best_kpc here. Claudio > I > also optimised the kernel for various nvidia cards but that may just as > well be worse for AMD, who knows. And I'm 20% behind cRARk on nvidia anyway. > > magnum > > > On 04/24/2012 04:04 PM, Claudio André wrote: >> Since, it is complaining about the message size, i'll break it. >> ----------------- >> >> Hi, see atached files. Please, try to see that 2560 seems to be a "magic >> number". >> >> - TXT: raw results (no profiler) >> - The same CSV file. >> - And some more summary information. >> >> Profiler using: >> Local worksize (LWS) 256, Global worksize (KPC) 2560 >> >> ---- >> src/opencl/rar_kernel.cl | 34 ++++++++------ >> src/rar_fmt.c | 116 >> ++++++++++++++++++++++++++++++++++++++++----- >> 2 files changed, 122 insertions(+), 28 deletions(-) >> ---- >> >> >> >> Em 22-04-2012 22:07, magnum escreveu: >>> On 04/23/2012 12:02 AM, Claudio AndrÃ© wrote: >>>>> Would both these figures by closer to 100 in a dream scenario, or what? >>>>> >>>>> By the way my previous version of rar got an "occupancy" of 0.01 or so >>>>> (lol) in nvidia profiler. We'll see if there is any change now. >>>>> >>>>> magnum >>>>> >>>> I like the "dream scenario". Valid explanation. And 100 is the target. >>>> >>>> Alu packing has a "> 70" expectation. >>>> Alubusy is where 100% is optimal. >>>> >>>> I agree that sprofile is not very useful, but is better than nothing (or >>>> simple guessing). Since you have NVIDIA tools, it is not that important. >>> I think sprofile is useful, it's just that my laptop GPU is so weak I >>> can't draw any conclusions. >>> >>> Your profiling info was with LWS=GWS. Please try this if you have the >>> time: >>> >>> 1. Pull latest git >>> 2. Run with KPC=0 (I expect it to pick 4096 or higher as best) >>> 3. Do another profiling run with the best KPC >>> >>> The ALU figures (and speed) should go up a lot (I hope). If they are >>> not, the profiling info should tell why. >>> >>> thanks, >>> magnum >>> >  http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf  http://developer.amd.com/afds/assets/presentations/2620_final.pdf Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.