Date: Mon, 14 May 2012 05:25:02 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Blowfish (bcrypt) on CPU vs. GPU (FX-8120 vs. HD 7970) Hi, I tried to estimate possible speeds for bcrypt (Eksblowfish) on GCN (HD 7970), using the known speeds on Bulldozer (FX-8120) as reference (to verify my math, as well as to see the possible speedup over CPU, if any). Let's assume the $2a$05 cost setting (32 iterations), which is what we historically use for benchmarking (even though modern systems tend to use higher settings, such as $2a$08 or even $2a$10). The number of Blowfish encryptions performed per bcrypt hash computed is: (9+512)*2*32 + 9+512+64*3 = 34057 or if we move some final encryptions into cmp_exact(), which is normally not called, then: (9+512)*2*32 + 9+512+64 = 33929 So let's just use 34000 as our estimate. FX-8120 running at stock clocks does 955 c/s on one core. Presumably, this is at 4.0 GHz (half load turbo). In terms of cycles per byte encrypted, this translates to: 4*10^9 / (34000*8*955) = 15.4 For comparison (albeit somewhat unrelated to the purpose of this posting), on Core i7-2600K at stock clocks (3.8 GHz max turbo) I am getting: 3.8*10^9 / (34000*8*940) = 14.9 Core 2 Duo E6550 o/c to 3.15 GHz: 3.15*10^9 / (34000*8*884) = 13.1 All of these compare favorably to 18 cycles/byte (not including higher-level overhead) for Blowfish on the original Pentium: http://www.schneier.com/blowfish-speed.html Yet they're pretty close, thus realistic. Of course, in these builds JtR actually computes two bcrypt hashes in parallel with mixed instructions, but the reported c/s rate takes into account the fact that two hashes were computed at a time. So 34000 Blowfish encryptions is the correct number to use here (along with the reported c/s rate) even though twice more are done per crypt_all() call (non-OpenMP). BF_ROUND has 14 operations in it. How these translate to x86 instructions and micro-ops is not entirely obvious. For example, x86.S (which is not used in the x86-64 builds used in the benchmarks above, though) has two versions of BF_ROUND: in the version for the original Pentium (yes, this old) it uses single micro-op instructions only (to avoid imperfect pairing on that processor), and it has 18 instructions as a result. In the version for most newer processors, it has 11 instructions, many of which are complex (I think these produce at least 15 micro-ops on current CPUs). Assuming 15 micro-ops (although in practice there are probably more), we get issue rates of: 34000*(15*16+4+5)*955 / (4*10^9) = 2.02 34000*(15*16+4+5)*884 / (3.15*10^9) = 2.38 for Bulldozer and Core 2, respectively. The actual micro-op issue rates are probably slightly higher (since there are probably more micro-ops). FX-8120 has 4 modules, each capable of decoding 4 x86 instructions per cycle - thus producing at least 4 micro-ops per cycle per module. Assuming that we have 15 micro-ops per Blowfish round, and that 4 micro-ops are executed per cycle (we're doing 4 separate Blowfish encryptions per Bulldozer module, so this is realistic), we get: 4*4*3.4*10^9/((15*16+4+5)*34000) = 6426 c/s as a speed estimate for an OpenMP-enabled build of the current source code running on FX-8120 at its full load turbo frequency of 3.4 GHz. In practice, we're getting around 5500 c/s on such build. Now to GPU stuff. Disclaimer: I am still unfamiliar with GPUs, I am just learning. I'll be using info on GCN from: http://developer.amd.com/afds/assets/presentations/2620_final.pdf The GPU on an HD 7970 card contains 32 compute units (CUs), each of them having one scalar unit, four 16-lane SIMD units, 64 KB of 32-bank Local Data Share (LDS) memory, 16 KB read-write L1 data cache, and other important stuff. Also relevant to what I'll be talking about below is the 16 KB 4-way set associative (and 4-port?) scalar read-only L1 cache shared by 4 CUs (so we have only 8 of these total in a 7970's GPU?) There's more memory in registers (64 KB per SIMD unit, 8 KB per scalar unit), but I guess those are not indirectly addressable? We can issue up to 5 instructions per cycle per CU - apparently, this maximum is reached with 1 scalar and 4 SIMD instructions. With four 16-lane SIMD units, we'd normally have 64 work-items per wavefront, and we'd have at least 10 waves/SIMD, 40 waves/CU, 2560 work-items per CU (as per slide 11). However, the 64 KB of LDS only lets us keep up to 16 sets of Blowfish S-boxes in it (we need 4 KB per set). So maybe we can/should only run 16 work-items per CU, thus making use of only 1/4 of total available lanes and incurring stalls on data dependencies after high-latency instructions. And this is assuming that we can do gather loads into VGPRs from LDS - which the slides suggest that we can (I am surprised!), although I am not 100% sure of that crucial detail. Maybe we can also make use of the scalar unit for our actual computation, even though it's normally intended for control. We also happen to have exactly 4 KB of read-only L1 cache per scalar unit. (It is not clear to me whether the scalar unit can access the LDS memory or not.) Blowfish S-boxes are not constant, but writes into them are relatively rare (as compared to reads). In Eksblowfish, we do one 64-bit write per Blowfish block encrypted. The 16 KB read-write L1 data cache is probably unusable for the S-boxes as it seems very unlikely that there's gather support for such loads, and even if we abused vector loads for scalar operations (using only one vector element for real), we'd be thrashing this small cache even with just one SIMD unit doing that (it'd need 64 KB then), unless unaligned vector loads are supported (maybe they are?) Now to speed estimates. Assuming that we can use the LDS and the scalar read-only L1 cache fully, don't use anything else, somehow are able to issue an instruction every cycle with no stalls (what's the latency for LDS gather loads?), and have no other overhead, the theoretical speed may be: 32*(16+1)*925*10^6/((16*16+4+5)*34000) = 55849 c/s Note that I assume 16 instructions per round here, because per the slides loads and ALU ops appear to be separate instructions on GCN. If the number of instructions is actually lower for some reason (e.g., if there are byte extract instructions similar to Alpha's EXTBL), then the theoretical speed may be slightly higher. In practice, we will incur stalls, we will have overhead (more instructions), and we might not be able to use the scalar unit like that. Without the scalar unit, we'll get: 32*16*925*10^6/((16*16+4+5)*34000) = 52564 c/s Without gather loads, but using the scalar unit, we might have up to: 32*(4+1)*925*10^6/((16*16+4+5)*34000) = 16426 c/s if we're somehow able to use the 4 SIMDs and LDS anyway (effectively doing only one Blowfish encryption per SIMD unit), although from the slides it appears like gather loads are supported. Finally, using the scalar unit alone, we'd get up to: 32*1*925*10^6/((16*16+4+5)*34000) = 3285 c/s (or less because of the occasional writes, which will be very slow). Thus, the possible speedup over CPU should be at most 10x for HD 7970 vs. FX-8120 at stock clocks. So far I've ignored the possibility of using L2 caches and global memory, though. Maybe adding that to make at least some (poor) use of the otherwise idle SIMD units or lanes would increase the total speed a little bit. Maybe it would partially compensate the overall speed decrease from stalls and overhead, which I did not make an attempt at estimating above. Overall, I think that we should give this a try. BTW, on CPU some speedup might be possible from mixing in MMX instructions - just to make use of the otherwise-idle vector unit. I did not include this in any of the estimates above. Yet we may try implementing it too. I'd appreciate any comments. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.