Date: Wed, 11 Jul 2012 12:03:53 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl On Tue, Jul 10, 2012 at 08:13:22PM +0530, Sayantan Datta wrote: > I was looking at the IL generated on 7970 using LDS only. Each Encrypt > call has approximately 540 instruction at IL level. However according to > your previous estimates each Encrypt call has 16*16+4+5 = 275 Any idea why we have so many more instructions? Is it possibly because they're for two SIMD units? > rusling in an estimated speed of 52K c/s. That was a theoretical/optimistic estimate, assuming that results of each instruction are available for use the very next cycle and that scatter/gather addressing is used. > Since the number of instruction is doubled we > should expect at least half of your previous estimates say roughly 26K c/s. You probably meant "at most". However, this is not necessarily right - we need to find the cause of the doubled instruction count first. If it's because of explicit separate instructions for use of two SIMD units, then this does not halve the estimated speed. > But we are nowhere near that. I guess your previous estimates were based > on the fact that each instruction takes 1 clock cycle to execute, is it? > But it looks like not all instructions rquire same number of clock cycle on > gpu. I think all relevant instructions can execute in 1 cycle in terms of throughput when there's sufficient parallelism available, but we do not have sufficient parallelism here (because of limited LDS size), so we incur stalls because of instruction latencies greater than 1 cycle. This is no surprise. My estimates were in fact theoretical/optimistic rather than realistic. However, one thing to double-check is whether we're using gather addressing for the S-box lookups or maybe not (which would mean that we use at most one vector element per SIMD unit). In the latter case, a speedup should be possible from using 4 instead of 2 SIMD units. If you see no such speedup, then this suggests that we're probably using gather just fine now, and not reaching the theoretical speeds (by far) is merely/primarily because of instruction latencies. BTW, in the wiki page at http://openwall.info/wiki/john/GPU/bcrypt you mention Bulldozer's L2 cache. JFYI, this is irrelevant, since on CPUs the S-boxes fit in L1 cache. We only run 4 concurrent instances of bcrypt per Bulldozer module (two per thread, and there's no need to run more), so that's only 16 KB for the S-boxes. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.