Date: Wed, 11 Jul 2012 16:41:22 +0530 From: Sayantan Datta <std2048@...il.com> To: john-dev@...ts.openwall.com Subject: Re: bf_kernel.cl On Wed, Jul 11, 2012 at 1:33 PM, Solar Designer <solar@...nwall.com> wrote: > On Tue, Jul 10, 2012 at 08:13:22PM +0530, Sayantan Datta wrote: > > I was looking at the IL generated on 7970 using LDS only. Each Encrypt > > call has approximately 540 instruction at IL level. However according to > > your previous estimates each Encrypt call has 16*16+4+5 = 275 > > Any idea why we have so many more instructions? Is it possibly because > they're for two SIMD units? > At IL level I can see that each lds load or store is associted with 5 extra instructions to specify the address of the data to be fetched or stored. Here's an example This is the start of a BF_ENCRYPTION. Other than the first two instructions all instructions are for loading the data. ixor r79.x___, r79.z, r79.x ;r79.x == L iand r80.x___, r79.x, r69.x ;r69.x == 255 ior r80.x___, r80.x, r69.z ;r69.z == 768 ,add 768 ishl r80.x___, r80.x, r72.y ;r72.y== 2 , multiply 4 i.e for addresing 4byte uint iadd r80.x___, r74.z, r80.x ;r74.z== 0 ,maybe some loop mangement etc don't know for sure mov r1010.x___, r80.x ;r1010.x stores the address lds_load_id(1) r1011.x, r1010.x ;result stored in r1011.x mov r80.x___, r1011.x ;result moved back to working register Everything else seems to be OK > rusling in an estimated speed of 52K c/s. That was a theoretical/optimistic estimate, assuming that results of > each instruction are available for use the very next cycle and that > scatter/gather addressing is used. > > > Since the number of instruction is doubled we > > should expect at least half of your previous estimates say roughly 26K > c/s. > > You probably meant "at most". However, this is not necessarily right - > we need to find the cause of the doubled instruction count first. > If it's because of explicit separate instructions for use of two SIMD > units, then this does not halve the estimated speed. > > So it is not because of 2 SIMD. > > But we are nowhere near that. I guess your previous estimates were based > > on the fact that each instruction takes 1 clock cycle to execute, is it? > > But it looks like not all instructions rquire same number of clock cycle > on > > gpu. > > I think all relevant instructions can execute in 1 cycle in terms of > throughput when there's sufficient parallelism available, but we do not > have sufficient parallelism here (because of limited LDS size), so we > incur stalls because of instruction latencies greater than 1 cycle. > This is no surprise. My estimates were in fact theoretical/optimistic > rather than realistic. > > However, one thing to double-check is whether we're using gather > addressing for the S-box lookups or maybe not (which would mean that we > use at most one vector element per SIMD unit). In the latter case, a > speedup should be possible from using 4 instead of 2 SIMD units. If you > see no such speedup, then this suggests that we're probably using gather > just fine now, and not reaching the theoretical speeds (by far) is > merely/primarily because of instruction latencies. > > Looking at IL I can't say for sure whether it using gather addressing. It is the same cl code written in assembly. Although looking at benchmarks it seems like we are using gather addressing. Using 4 SIMD ie setting work group size=4 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 4241 c/s real, 238933 c/s virtual Using 2 SIMD i.e setting work group size =8 std2048@...l:~/bin/run$ ./john -te -fo=bf-opencl -pla=1 OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s). Using device 0: Tahiti ****Please see 'opencl_bf_std.h' for device specific optimizations**** Benchmarking: OpenBSD Blowfish (x32) [OpenCL]... DONE Raw: 4216 c/s real, 238933 c/s virtual Although there is small speedup from using 4 SIMDs but that may be due reduced bank conflicts. > BTW, in the wiki page at http://openwall.info/wiki/john/GPU/bcrypt you > mention Bulldozer's L2 cache. JFYI, this is irrelevant, since on CPUs > the S-boxes fit in L1 cache. We only run 4 concurrent instances of > bcrypt per Bulldozer module (two per thread, and there's no need to run > more), so that's only 16 KB for the S-boxes. > > Alexander > Thanks for clearing that. I would change that statemant. Regards, Sayantan Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.