john-dev - Re: New RAR OpenCL kernel

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d46cb96b84a9acc4b0c8cb3a636030de@smtp.hushmail.com>
Date: Wed, 25 Apr 2012 02:54:21 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel

On 04/25/2012 01:44 AM, Claudio André wrote:
> Em 24-04-2012 19:31, magnum escreveu:
>> The new code do not use local memory anymore (unless you
>> change a define) and apparently I use too many registers (for this card)
>> now. Do you get rid of the register spills with lower LWS?
> No, i tried it in your old code.

That is odd. I saw similar things when experimenting with Loveland and
Cedar cards. Even using really low figures like 16, I never got rid of
register spill. I must have hit a per-thread max rather than a total max.

> But your *newer* code does not have register spilling anymore. And 40%
> gain.

Great! I did not really dare to hope for that. BTW there are some AMD
alternatives commented out and/or defined out in the kernel, that in the
future should be enabled when applicable, things like this:

#if 0 // slower on GTX580
#define F(x,y,z)       bitselect(z, y, x)
#else
#define F(x,y,z)       (z ^ (x & (y ^ z)))
#endif

I think we should come up with a couple of -Ddefines that are
automagically added by common-opencl at (JIT-)build time, depending on
device. I think we could use these or more:

-DAMD or -DNVIDIA for starters.
And perhaps -DGCN, -DFERMI, I'm not sure. I know Milen use -DOLD_ATI for
4xxx (btw I just re-read everything he ever wrote to this list and it
was well worth the time)

and maybe even things like

-DLOC_MEM_SIZE=xxxx

But we should not go overboard with this, just the minimum stuff needed
for decent adoption to GPU. I'm not sure exactly how to pick what
defines to send.

Another related thing is I'd like to send defines like PLAINTEXT_LENGTH,
ROUNDS and LMEM_PER_THREAD from the host code to the kernel when
building. Maybe another argument to opencl_init()? Or maybe I should
just start using a rar.h file that's included by both host and kernel code.

> claudio@...udioandre-desktop:~/bin/john/to_commit/src$ LWS=64 KPC=2560
> ../run/john -test -fo:rar
> OpenCL platform 0: AMD Accelerated Parallel Processing, 2 device(s).
> Using device 0: Juniper
> Compilation log: LOOP UNROLL: pragma unroll (line 264)
>     Unrolled as requested!
> 
> Local worksize (LWS) 64, Global worksize (KPC) 2560
> Benchmarking: RAR3 (6 characters) [OpenCL]... DONE
> Raw:    409 c/s real, 256000 c/s virtual

If you don't give any LWS and KPC, will it pick decent figures
automatically? No matter how I do it, find_workgroup_size is suboptimal
on some cards. The current code works fine on 9600GT and GTX580 but
tends to pick a low LWS for GTX680 because I use number of SP's as a
parameter but there's no way to tell how many cores each SP have.

>> BTW, have you ever tried cRARk on this same
>> card? I bet it's 10x faster than this.
> I'll. Can you give me the command line you recommend? Or should i
> experiment? I never used it.

Create a -hp mode test archive:

$ rar a -hppassword test.rar README

Then benchmark just like this, using the OpenCL version of cRARk:

$ ./crark-hp -b test.rar

It takes half a minute. I'm fairly sure it will benchmark 6 characters.
If it doesn't, add -l6 -g6

>> Anyway if you like, you could try re-enabling local memory with the
>> defines of LMEM_PER_THREAD in both source files and see if it changes a
>> run with same LWS&  KPC to the better or worse.
> It fails. I double checked if i did the change right and in two places.
> Check your code again. But it can be the compiler silliness (i saw it
> doing crazy things here).
> 
> Local worksize (LWS) 64, Global worksize (KPC) 2560
> Benchmarking: RAR3 (6 characters) [OpenCL]... FAILED (cmp_all(95))

That's odd, it works fine on nvidias and on CPU. I need to shake life in
that poor Cedar again.

>> Today's commit should fix the find_best_() functions so they are faster,
>> more accurate (hopefully) and will pick smallest workgroup of several
>> with same c/s (so in your case it would settle for 2560 and not 7680).
> As a user of your software, i liked to see the output, the tests and the
> results.
> If i have a real problem to solve, i would like to hide the memory
> transfers, so i 'll prefer KPC=7680 or better KPC=10240.

Maybe, but transfers doesn't seem to be much of an issue yet since my
code (and the format itself) is so slow. I'm thinking if I ever get some
real speed I could start juggling with two buffers so the GPU can start
working on next batch while CPU is testing the first.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.