Date: Wed, 11 Jul 2012 17:00:00 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl On Wed, Jul 11, 2012 at 04:41:22PM +0530, Sayantan Datta wrote: > At IL level I can see that each lds load or store is associted with 5 extra > instructions to specify the address of the data to be fetched or stored. I think it'd help to look at GCN instructions to see whether these IL instructions are compacted into fewer GCN instructions (and how many), and what addressing modes are used by different revisions of OpenCL code. > ;r79.x == L > iand r80.x___, r79.x, > r69.x > ;r69.x == 255 > ior r80.x___, r80.x, > r69.z > ;r69.z == 768 ,add 768 > ishl r80.x___, r80.x, > r72.y > ;r72.y== 2 , multiply 4 i.e for addresing 4byte uint > iadd r80.x___, r74.z, > r80.x First, this appears to be for a pre-Sptr revision of the code, right? Do you get fewer IL instructions for the Sptr code revision? Second, it appears that my alternative version of BF_ROUND for "architectures with no complicated addressing modes supported" would result in fewer IL instructions (the shift by two bits would be avoided for 3 out of 4 S-boxes). (Whether it would also reduce the GCN instruction count or not is another question.) Can you try it? > Looking at IL I can't say for sure whether it using gather addressing. It > is the same cl code written in assembly. Although looking at benchmarks it > seems like we are using gather addressing. Yes, but only through the lack of performance change between work group size 8 and 4, assuming that this translates to using 2 vs. 4 SIMDs. I am not 100% sure that this assumption is correct. It'd be nice to validate it explicitly, such as through review of GCN code or through running some analyzer/profiler that would report which execution units were in use by a given kernel. > > BTW, in the wiki page at http://openwall.info/wiki/john/GPU/bcrypt you > > mention Bulldozer's L2 cache. JFYI, this is irrelevant, since on CPUs > > the S-boxes fit in L1 cache. We only run 4 concurrent instances of > > bcrypt per Bulldozer module (two per thread, and there's no need to run > > more), so that's only 16 KB for the S-boxes. > > Thanks for clearing that. I would change that statemant. Thanks. Now you're saying that Bulldozer has 16 KB L1 cache per module, but actually I think it has 16 KB L1 data cache per core, or 32 KB (in two separate L1 data caches) per module. This makes a difference since with the P arrays we exceed 16 KB per module a little bit. Since the caches are actually larger than what you write, both S and P do fit in them. http://en.wikipedia.org/wiki/File:AMD_Bulldozer_block_diagram_(8_core_CPU).PNG As to your guess that APUs would be good for bcrypt due to their L3 cache, I doubt that they'd soon exceeds speeds of CPUs where we use L1 cache. L3 cache is usually quite slow, and data is read from it (into faster caches) in entire cache lines, whereas we only use 32 bits per lookup. It's the same problem we're seeing with global memory on GPU. L3 cache might be a little bit faster (maybe even a few times faster), but otherwise similar. What will in fact enable much faster bcrypt cracking is AVX2 (will be generally available in CPUs next year) and Intel MIC (already available as a coprocessor in supercomputers, including one entry on the recent top 500 list, but not available for purchase by mere mortals), assuming that it inherited scatter/gather addressing from Larrabee. The latter might provide a 25x speedup per-chip over what we currently have (assuming that it will be limited by 32 KB L1 data cache per core; otherwise 50x speedup seems possible). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.