Date: Wed, 24 Jun 2015 07:10:07 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: optimizing bcrypt cracking on x86 Hi, This is to share some of my thoughts and results from a while ago. As a side-effect of Katja's project, we realized that with our current bcrypt code we're still way below the peak L1 data cache reads rate on x86 CPUs: http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-51.html http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-52.html As you can see, if L1 data cache reads were the bottleneck, i7-4770K could theoretically do up to ~11.1k c/s, but it only achieves ~6.6k c/s with our current code. That's about 60%. Not bad, but there might be room for improvement. The 11.1k figure assumes the 3.7 GHz max turbo clock rate with all cores in use (I think it's realistic), but P reads from cache (not registers) and moreover no combining of P reads. Since P is being read sequentially, it may be possible to combine its 32-bit reads into fewer larger reads (whether for one bcrypt instance or across instances), and then we'd need fewer than 5 reads from cache per Blowfish round on average (but rather 4.5 with 64-bit reads, or even closer to 4 with wider than 64-bit reads). With this, the 11.1k figure might even be an under-estimate. At 4.5 reads/round, it'd be 12.3k. Now, the bottleneck does indeed appear to be different. It appears to be taking too many instructions and uops to extract all of the S-box indices and compute the effective addresses to keep the L1 data cache read ports 100% busy at all times. (We only keep them about 60% busy.) Can we optimize the instruction stream such that fewer uops would be needed per S-box lookup on average? My gut feeling, especially after the experiments that I'll describe below, is that we can't do much on current CPUs. But nevertheless we can probably do a little bit better. We've previously tried AVX2 with no luck (on Haswell; future CPUs might or might not do better): http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-42.html Alain has recently mentioned similar poor results for his attempts to use AVX2, and he also got similar speeds with SSE4.1 (which means doing the gather loads manually). It is not surprising that there's little difference between Haswell's microcoded AVX2 gather loads and manually performed ones at x86 instruction stream level. By "similar poor results" I mean speeds of around 4k c/s on quad-core Haswell, whereas our interleaving of scalar instructions, even when implemented in C, achieves 6k+ c/s (such as the ~6.6k figure I mentioned above). Alain also mentioned trying BMI (the extra scalar instructions that were introduced at the same time with AVX2), with slight speed improvement. I had previously tried BMI (in Nov 2013, per file timestamps) at C source code level, and IIRC this didn't help (but didn't hurt much either, leaving the chance that a pure asm implementation would be good). The relevant BMI instructions are BEXTR, PEXT, and SHRX. BEXTR and PEXT are available via _bextr_u32() and _pext_u64() intrinsics, respectively. For SHRX, I tried inline asm, which hurts gcc's instruction scheduling. I've now tried BEXTR in pure asm code as well, see below. I haven't yet re-tried SHRX in pure asm; this is worth trying. Anyway, my recent idea was to try MMX2, in particular making use of the PEXTRW instruction introduced with Pentium 3 (along with the first SSE). I must admit this didn't appear to allow us to use Haswell's ports any more optimally, but I decided to give it a try anyway. http://www.realworldtech.com/haswell-cpu/4/ Even with use of MMX2, there are still plenty of scalar operations for extracting the S-box indices, and if that's not enough to use Haswell's port 6 fully, we can mix in 1 or 2 scalar implementations as well. Here are the speeds I got on Pentium 3, 1.0 GHz: Original Pentium optimized asm code - 144 c/s Pentium Pro/2/3 optimized asm code - 171 c/s Pure C, 2x interleaving, gcc 4.6.3 - 211 c/s New 2x2 (4 instances) MMX2 asm code - 244 c/s (The "Pure C, 2x interleaving" version is what's actually in JtR for a while now, when building for 32-bit x86 with not-too-old gcc.) Wow. That's a 15%+ improvement over the previous best result. I wish speeds on this CPU were still practically relevant. Moving to i7-4770K, the same 32-bit binary still runs a bit faster than a 32-bit build of JtR as publicly released, compiled with "gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)". One instance: Old: 1049 c/s New: 1347 c/s 8 concurrent instances (separate processes in both cases): Old: 760*8 = 6080 c/s New: 772*8 = 6176 c/s This is only a 1.6% improvement, but perhaps for CPUs without HT the improvement is much greater (as seen from the single instance speeds). On 64-bit builds, though, I only got this to run at cumulative speeds like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with OpenMP (and even worse than the slightly better speeds that can be seen with separate independent processes). Replacing one of the MMX2 instances with one or two scalar instances (thus, 3 or 4 instances total per thread) somehow didn't help. (Maybe I just didn't try hard enough.) So we appear to have an improvement for crippled systems - ancient CPUs like Pentium 3 or, more importantly, 32-bit builds running on lower-cost HT-less versions of recent CPUs. Unfortunately, since the improvement is not universal and since data layout changes are needed (64-bit P reads that I mentioned above), I can't just commit this code as-is. The 2x2 MMX2 code looks like: #define P(N) BF_current+16*N #define Sa(N) P(18)+0x400*N #define Sb(N) P(18)+0x1000+0x400*N #define Sc(N) P(18)+0x2000+0x400*N #define Sd(N) P(18)+0x3000+0x400*N #define BF_ROUND(Lab, Lcd, Rab, Rcd, N) \ movd Lab,tmp1; \ shldl $16,tmp1,tmp2; \ movzbl tmp1_hi,tmp5; \ pextrw $3,Lab,tmp4; \ movzbl tmp2_hi,tmp6; \ pextrw $2,Lab,tmp3; \ andl $0xFF,tmp1; \ movd Sa(0)(,tmp6,4),tmpm1; \ andl $0xFF,tmp2; \ movd Sa(1)(,tmp2,4),tmpm2; \ movzbl tmp4_hi,tmp6; \ movd Sa(2)(,tmp5,4),tmpm3; \ andl $0xFF,tmp4; \ movd Sa(3)(,tmp1,4),tmpm4; \ movd Lcd,tmp1; \ movzbl tmp3_hi,tmp5; \ punpckldq Sb(0)(,tmp6,4),tmpm1; \ andl $0xFF,tmp3; \ punpckldq Sb(1)(,tmp4,4),tmpm2; \ shldl $16,tmp1,tmp2; \ punpckldq Sb(2)(,tmp5,4),tmpm3; \ movzbl tmp1_hi,tmp5; \ paddd tmpm2,tmpm1; \ pextrw $3,Lcd,tmp4; \ punpckldq Sb(3)(,tmp3,4),tmpm4; \ movzbl tmp2_hi,tmp6; \ pxor tmpm3,tmpm1; \ pextrw $2,Lcd,tmp3; \ pxor P(N)+16,Rab; \ andl $0xFF,tmp1; \ paddd tmpm4,tmpm1; \ andl $0xFF,tmp2; \ pxor tmpm1,Rab; \ movd Sc(0)(,tmp6,4),tmpm1; \ movd Sc(1)(,tmp2,4),tmpm2; \ movzbl tmp4_hi,tmp6; \ movd Sc(2)(,tmp5,4),tmpm3; \ andl $0xFF,tmp4; \ movd Sc(3)(,tmp1,4),tmpm4; \ movzbl tmp3_hi,tmp5; \ punpckldq Sd(0)(,tmp6,4),tmpm1; \ andl $0xFF,tmp3; \ punpckldq Sd(1)(,tmp4,4),tmpm2; \ punpckldq Sd(2)(,tmp5,4),tmpm3; \ paddd tmpm2,tmpm1; \ punpckldq Sd(3)(,tmp3,4),tmpm4; \ pxor tmpm3,tmpm1; \ pxor P(N)+24,Rcd; \ paddd tmpm4,tmpm1; \ pxor tmpm1,Rcd This is 50 instructions per round for 4 instances combined, so 12.5 instructions per round per instance. I think it's pretty good. I've also tried BMI in several ways, including taking it to the extreme in terms of minimizing x86 instruction count (likely not the optimal choice since we actually need to optimize uops, and latency hiding). That version is: #define P(N) BF_current+8*N #define Sa(N) P(18)+0x400*N #define Sb(N) P(18)+0x1000+0x400*N #define BF_ROUND(La, La_lo, La_hi, Lb, Lb_lo, Lb_hi, Ra, Rb, N) \ bextr %r14d,La,tmp1; \ bextr %r14d,Lb,tmp5; \ xorl P(N)+8,Ra; \ xorl P(N)+12,Rb; \ bextr %r15d,La,tmp2; \ bextr %r15d,Lb,tmp6; \ movl Sa(0)(,tmp1_64,4),tmp1; \ movl Sb(0)(,tmp5_64,4),tmp5; \ movzbl La_hi,tmp3; \ movzbl Lb_hi,tmp7; \ addl Sa(1)(,tmp2_64,4),tmp1; \ addl Sb(1)(,tmp6_64,4),tmp5; \ movzbl La_lo,tmp4; \ movzbl Lb_lo,tmp8; \ xorl Sa(2)(,tmp3_64,4),tmp1; \ xorl Sb(2)(,tmp7_64,4),tmp5; \ addl Sa(3)(,tmp4_64,4),tmp1; \ addl Sb(3)(,tmp8_64,4),tmp5; \ xorl tmp1,Ra; \ xorl tmp5,Rb where the out-of-loop initialization code included: movl $0x0818,%r14d movl $0x0810,%r15d This is 20 instructions per round for two interleaved instances, so 10 instructions per round per instance. Thus, BMI can beat MMX2 in x86 instruction count, but like I said that is likely suboptimal in other aspects. So many experiments and no luck yet on current non-crippled CPUs and systems (64-bit with HT). If I find more time for this, I'll probably proceed with high-level optimizations first, just to improve things a little bit, achieving something like 6.8k c/s on the i7-4770K with OpenMP: http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-50.html by making the uses of Blowfish outside of the variable cost loop use the interleaved implementation as well. And maybe 7k with independent instances (no thread safety gives an extra register). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.