Date: Wed, 11 Jul 2012 22:30:56 +0530 From: Sayantan Datta <std2048@...il.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl On Wed, Jul 11, 2012 at 6:30 PM, Solar Designer <solar@...nwall.com> wrote: > On Wed, Jul 11, 2012 at 04:41:22PM +0530, Sayantan Datta wrote: > > At IL level I can see that each lds load or store is associted with 5 > extra > > instructions to specify the address of the data to be fetched or stored. > > I think it'd help to look at GCN instructions to see whether these IL > instructions are compacted into fewer GCN instructions (and how many), > and what addressing modes are used by different revisions of OpenCL code. > > > ;r79.x == L > > iand r80.x___, r79.x, > > r69.x > > ;r69.x == 255 > > ior r80.x___, r80.x, > > r69.z > > ;r69.z == 768 ,add 768 > > ishl r80.x___, r80.x, > > r72.y > > ;r72.y== 2 , multiply 4 i.e for addresing 4byte uint > > iadd r80.x___, r74.z, > > r80.x > > First, this appears to be for a pre-Sptr revision of the code, right? > Do you get fewer IL instructions for the Sptr code revision? > No this is the sptr version. > > Second, it appears that my alternative version of BF_ROUND for > "architectures with no complicated addressing modes supported" would > result in fewer IL instructions (the shift by two bits would be > avoided for 3 out of 4 S-boxes). (Whether it would also reduce the GCN > instruction count or not is another question.) Can you try it? > Yes I did try it but it caused hash fails FAILED (get_hash(4)) . I used the fooling code in bf_rounds: #define BF_ROUND_LDS(ctx_S,ctx_P, L, R, N, tmp1, tmp2, tmp3, tmp4) \ tmp1 = L & 0xFF; \ tmp1 <<= 2; \ tmp2 = L >> 6; \ tmp2 &= 0x3FC; \ tmp3 = L >> 14; \ tmp3 &= 0x3FC; \ tmp4 = L >> 22; \ tmp4 &= 0x3FC; \ tmp1 = (*((__local uint *)(((__local unsigned char *)Sptr4) + (tmp1)))); \ tmp2 = (*((__local uint *)(((__local unsigned char *)Sptr3) + (tmp2)))); \ tmp3 = (*((__local uint *)(((__local unsigned char *)Sptr2) + (tmp3)))); \ tmp3 +=(*((__local uint *)(((__local unsigned char *)Sptr) + (tmp4)))); \ tmp3 ^= tmp2; \ R ^= ctx_P[N + 1]; \ tmp3 += tmp1; \ R ^= tmp3; Hope I'm doing it all right. Also this is the only change I've made in the source. > > > Looking at IL I can't say for sure whether it using gather addressing. It > > is the same cl code written in assembly. Although looking at benchmarks > it > > seems like we are using gather addressing. > > Yes, but only through the lack of performance change between work group > size 8 and 4, assuming that this translates to using 2 vs. 4 SIMDs. > I am not 100% sure that this assumption is correct. It'd be nice to > validate it explicitly, such as through review of GCN code or through > running some analyzer/profiler that would report which execution units > were in use by a given kernel. > command line profiler or kernel analyzer(windows only) does not report the details of execution unit. Maybe gdebugger could provide some more details but it is GUI toolkit, so not possible to use. I would have to analyze the GCN code then but I would need some time. > > > > BTW, in the wiki page at http://openwall.info/wiki/john/GPU/bcrypt you > > > mention Bulldozer's L2 cache. JFYI, this is irrelevant, since on CPUs > > > the S-boxes fit in L1 cache. We only run 4 concurrent instances of > > > bcrypt per Bulldozer module (two per thread, and there's no need to run > > > more), so that's only 16 KB for the S-boxes. > > > > Thanks for clearing that. I would change that statemant. > > Thanks. > > Now you're saying that Bulldozer has 16 KB L1 cache per module, but > actually I think it has 16 KB L1 data cache per core, or 32 KB (in two > separate L1 data caches) per module. This makes a difference since with > the P arrays we exceed 16 KB per module a little bit. Since the caches > are actually larger than what you write, both S and P do fit in them. > > > http://en.wikipedia.org/wiki/File:AMD_Bulldozer_block_diagram_(8_core_CPU).PNG > > As to your guess that APUs would be good for bcrypt due to their L3 > cache, I doubt that they'd soon exceeds speeds of CPUs where we use L1 > cache. L3 cache is usually quite slow, and data is read from it (into > faster caches) in entire cache lines, whereas we only use 32 bits per > lookup. It's the same problem we're seeing with global memory on GPU. > L3 cache might be a little bit faster (maybe even a few times faster), > but otherwise similar. > > What will in fact enable much faster bcrypt cracking is AVX2 (will be > generally available in CPUs next year) and Intel MIC (already available > as a coprocessor in supercomputers, including one entry on the recent > top 500 list, but not available for purchase by mere mortals), assuming > that it inherited scatter/gather addressing from Larrabee. The latter > might provide a 25x speedup per-chip over what we currently have > (assuming that it will be limited by 32 KB L1 data cache per core; > otherwise 50x speedup seems possible). > > Alexander > I would add this extra details. I tried various combinations of global and local memory in same kernel but couldn't get any meaningful results. I'm attaching the codes here. The codes are currently using LDS only but can modified very easily. So take a look when you can. Also how random are the S-box memory access ? I mean if 90% or more access are concentrated to a specific portion of S-box then we could use this fact to cahche the data effectively. Regards, Sayantan Content of type "text/html" skipped Download attachment "bf_opencl_src.tar.gz" of type "application/x-gzip" (3219 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.