john-dev - Re: New RAR OpenCL kernel

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F973ACF.60409@gmail.com>
Date: Tue, 24 Apr 2012 20:44:15 -0300
From: Claudio André <claudioandre.br@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel

Em 24-04-2012 19:31, magnum escreveu:
> I can't find any documentation on "VGPRS" (vs just GPRS) but I am
> assuming it's the (non-scratch) GPR use (perhaps SGPRS are shader
> registers?).
Me neither. As a long time learning process, i found some snippets 
(pages 4-77 and 4-79) in [1] and VGPR in [2].
But is very slow process and you need to really want to be an AMD expert.
> The new code do not use local memory anymore (unless you
> change a define) and apparently I use too many registers (for this card)
> now. Do you get rid of the register spills with lower LWS?
No, i tried it in your old code.
But your *newer* code does not have register spilling anymore. And 40% gain.

claudio@...udioandre-desktop:~/bin/john/to_commit/src$ LWS=64 KPC=2560 
../run/john -test -fo:rar
OpenCL platform 0: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Juniper
Compilation log: LOOP UNROLL: pragma unroll (line 264)
     Unrolled as requested!

Local worksize (LWS) 64, Global worksize (KPC) 2560
Benchmarking: RAR3 (6 characters) [OpenCL]... DONE
Raw:    409 c/s real, 256000 c/s virtual

> Anyway I have
> a feeling this is not the biggest problem,
I agree. As the hardware owner/user, i would say that nothing different 
from 32 or 64 for LWS is going to be the silver bullet. And the spill is 
not 10x slower.
> I'm not sure what is. Your
> card should perform better. BTW, have you ever tried cRARk on this same
> card? I bet it's 10x faster than this.
I'll. Can you give me the command line you recommend? Or should i 
experiment? I never used it.
>
> Anyway if you like, you could try re-enabling local memory with the
> defines of LMEM_PER_THREAD in both source files and see if it changes a
> run with same LWS&  KPC to the better or worse.
It fails. I double checked if i did the change right and in two places.
Check your code again. But it can be the compiler silliness (i saw it 
doing crazy things here).

Local worksize (LWS) 64, Global worksize (KPC) 2560
Benchmarking: RAR3 (6 characters) [OpenCL]... FAILED (cmp_all(95))

> Today's commit should fix the find_best_() functions so they are faster,
> more accurate (hopefully) and will pick smallest workgroup of several
> with same c/s (so in your case it would settle for 2560 and not 7680).
As a user of your software, i liked to see the output, the tests and the 
results.
If i have a real problem to solve, i would like to hide the memory 
transfers, so i 'll prefer KPC=7680 or better KPC=10240.
Looking at the output i realized 2560 and multiples are the solution.

So, the stuff is not bad. I did a quick test here, and discover that 
2560 is a good number to me (SHA-512). Probably (this is more) hardware 
related.
I'll try to create a similar find_best_kpc here.

Claudio

>   I
> also optimised the kernel for various nvidia cards but that may just as
> well be worse for AMD, who knows. And I'm 20% behind cRARk on nvidia anyway.
>
> magnum
>
>
> On 04/24/2012 04:04 PM, Claudio André wrote:
>> Since, it is complaining about the message size, i'll break it.
>> -----------------
>>
>> Hi, see atached files. Please, try to see that 2560 seems to be a "magic
>> number".
>>
>> - TXT: raw results (no profiler)
>> - The same CSV file.
>> - And some more summary information.
>>
>> Profiler using:
>> Local worksize (LWS) 256, Global worksize (KPC) 2560
>>
>> ----
>>    src/opencl/rar_kernel.cl |   34 ++++++++------
>>    src/rar_fmt.c            |  116
>> ++++++++++++++++++++++++++++++++++++++++-----
>>    2 files changed, 122 insertions(+), 28 deletions(-)
>> ----
>>
>>
>>
>> Em 22-04-2012 22:07, magnum escreveu:
>>> On 04/23/2012 12:02 AM, Claudio AndrÃ© wrote:
>>>>> Would both these figures by closer to 100 in a dream scenario, or what?
>>>>>
>>>>> By the way my previous version of rar got an "occupancy" of 0.01 or so
>>>>> (lol) in nvidia profiler. We'll see if there is any change now.
>>>>>
>>>>> magnum
>>>>>
>>>> I like the "dream scenario". Valid explanation. And 100 is the target.
>>>>
>>>> Alu packing has a ">   70" expectation.
>>>> Alubusy is where 100% is optimal.
>>>>
>>>> I agree that sprofile is not very useful, but is better than nothing (or
>>>> simple guessing). Since you have NVIDIA tools, it is not that important.
>>> I think sprofile is useful, it's just that my laptop GPU is so weak I
>>> can't draw any conclusions.
>>>
>>> Your profiling info was with LWS=GWS. Please try this if you have the
>>> time:
>>>
>>> 1. Pull latest git
>>> 2. Run with KPC=0 (I expect it to pick 4096 or higher as best)
>>> 3. Do another profiling run with the best KPC
>>>
>>> The ALU figures (and speed) should go up a lot (I hope). If they are
>>> not, the profiling info should tell why.
>>>
>>> thanks,
>>> magnum
>>>
>
[1] 
http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
[2] http://developer.amd.com/afds/assets/presentations/2620_final.pdf

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.