Date: Wed, 25 Apr 2012 02:54:21 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: New RAR OpenCL kernel On 04/25/2012 01:44 AM, Claudio André wrote: > Em 24-04-2012 19:31, magnum escreveu: >> The new code do not use local memory anymore (unless you >> change a define) and apparently I use too many registers (for this card) >> now. Do you get rid of the register spills with lower LWS? > No, i tried it in your old code. That is odd. I saw similar things when experimenting with Loveland and Cedar cards. Even using really low figures like 16, I never got rid of register spill. I must have hit a per-thread max rather than a total max. > But your *newer* code does not have register spilling anymore. And 40% > gain. Great! I did not really dare to hope for that. BTW there are some AMD alternatives commented out and/or defined out in the kernel, that in the future should be enabled when applicable, things like this: #if 0 // slower on GTX580 #define F(x,y,z) bitselect(z, y, x) #else #define F(x,y,z) (z ^ (x & (y ^ z))) #endif I think we should come up with a couple of -Ddefines that are automagically added by common-opencl at (JIT-)build time, depending on device. I think we could use these or more: -DAMD or -DNVIDIA for starters. And perhaps -DGCN, -DFERMI, I'm not sure. I know Milen use -DOLD_ATI for 4xxx (btw I just re-read everything he ever wrote to this list and it was well worth the time) and maybe even things like -DLOC_MEM_SIZE=xxxx But we should not go overboard with this, just the minimum stuff needed for decent adoption to GPU. I'm not sure exactly how to pick what defines to send. Another related thing is I'd like to send defines like PLAINTEXT_LENGTH, ROUNDS and LMEM_PER_THREAD from the host code to the kernel when building. Maybe another argument to opencl_init()? Or maybe I should just start using a rar.h file that's included by both host and kernel code. > claudio@...udioandre-desktop:~/bin/john/to_commit/src$ LWS=64 KPC=2560 > ../run/john -test -fo:rar > OpenCL platform 0: AMD Accelerated Parallel Processing, 2 device(s). > Using device 0: Juniper > Compilation log: LOOP UNROLL: pragma unroll (line 264) > Unrolled as requested! > > Local worksize (LWS) 64, Global worksize (KPC) 2560 > Benchmarking: RAR3 (6 characters) [OpenCL]... DONE > Raw: 409 c/s real, 256000 c/s virtual If you don't give any LWS and KPC, will it pick decent figures automatically? No matter how I do it, find_workgroup_size is suboptimal on some cards. The current code works fine on 9600GT and GTX580 but tends to pick a low LWS for GTX680 because I use number of SP's as a parameter but there's no way to tell how many cores each SP have. >> BTW, have you ever tried cRARk on this same >> card? I bet it's 10x faster than this. > I'll. Can you give me the command line you recommend? Or should i > experiment? I never used it. Create a -hp mode test archive: $ rar a -hppassword test.rar README Then benchmark just like this, using the OpenCL version of cRARk: $ ./crark-hp -b test.rar It takes half a minute. I'm fairly sure it will benchmark 6 characters. If it doesn't, add -l6 -g6 >> Anyway if you like, you could try re-enabling local memory with the >> defines of LMEM_PER_THREAD in both source files and see if it changes a >> run with same LWS& KPC to the better or worse. > It fails. I double checked if i did the change right and in two places. > Check your code again. But it can be the compiler silliness (i saw it > doing crazy things here). > > Local worksize (LWS) 64, Global worksize (KPC) 2560 > Benchmarking: RAR3 (6 characters) [OpenCL]... FAILED (cmp_all(95)) That's odd, it works fine on nvidias and on CPU. I need to shake life in that poor Cedar again. >> Today's commit should fix the find_best_() functions so they are faster, >> more accurate (hopefully) and will pick smallest workgroup of several >> with same c/s (so in your case it would settle for 2560 and not 7680). > As a user of your software, i liked to see the output, the tests and the > results. > If i have a real problem to solve, i would like to hide the memory > transfers, so i 'll prefer KPC=7680 or better KPC=10240. Maybe, but transfers doesn't seem to be much of an issue yet since my code (and the format itself) is so slow. I'm thinking if I ever get some real speed I could start juggling with two buffers so the GPU can start working on next batch while CPU is testing the first. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.