Date: Thu, 15 Mar 2012 03:37:25 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: AMD Bulldozer and XOP (was: RAR format finally proper) On Thu, Mar 15, 2012 at 12:34:16AM +0200, Milen Rangelov wrote: > Actually (I know that's offtopic) this CPU demonstrates some weird > behavior. Why, this is on topic for JtR development since you're looking into and talking about issues that are relevant to code optimizations in JtR as well. Thank you. > With my SSE2 code, a 4-core Phenom II @3.2GHz is almost as fast > as the 6-core FX-6100 @3.3 GHz. That's as expected. In fact, you're lucky that the FX is not slower in your case. The 6-core (and 8-core for FX-81x0) is actually 3-module (or 4-module, respectively), where each module has two sets of register files (like with Intel's Hyperthreading) and two sets of integer ALUs and AGUs (that's a new thing compared to Intel's Hyperthreading), but only a shared set of other execution units (including vector). So when you primarily use the vector ops, the CPU is effectively 3-core (or 4-core for FX-81x0) with SMT. http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)#Bulldozer_core_.28module.29 > At first that seemed strange, then I > implemented the XOP codepaths for MD5/MD4/SHA1 and then things look better > now, the _mm_roti_epi32/_mm_cmov_si128 optimizations lead to ~ 40% > improvement as compared to the SSE2 code (still worse than what I expected > though). Same for SHA1 and MD4. Yeah, I was planning to try that in JtR as well, but didn't get around to it yet. It's good news that this worked well for you. > Then came the DES stuff. I decided to test Alexander's s-boxes. They're Roman's, not mine; my role was to choose the versions producing more optimal code (considering register pressure and parallelism). > Well that was surprising. I used the ones for architectures supporting > bitselect instructions that should have much less gates than the original > Matthew Kwan s-boxes I used until now. Yet, I got the same speeds. That's puzzling indeed. XOP does provide some decent speedup over AVX for bitslice DES in JtR when benchmarked on the same Bulldozer CPU. You can see the numbers here: http://openwall.info/wiki/john/benchmarks 18527K / 14247K 128/128 BS XOP-16 vs. 16442K / 12792K 128/128 BS AVX-16 4700K / 4418K 128/128 BS XOP-16 vs. 3951K / 3786K 128/128 BS AVX-16 These are for "FX-8120 o/c 3.6 GHz + turbo". These were user-contributed numbers. Somehow I am getting numbers similar to the above on my FX-8120 without any overclocking (well, maybe only slightly lower numbers). I'll do more benchmarking with different clock rates and update the wiki later. BTW, Core i7-2600 at stock clock rate (3.4 GHz + turbo) is faster at these despite of only having AVX: 22773K / 18284K 128/128 BS AVX-16 (8 threads) 5802K / 5491K 128/128 BS AVX-16 (non-OpenMP) FX-8120 has slight advantage over Core i7-2600 at ALU-heavy code such as bcrypt, though: approx. 5500 c/s vs. 4800 c/s for 8 threads. These correspond to approx. 5.8x vs. 5.1x increase over single thread speed. > hashkill > and jtr are similar in design as far as the bitslice DES part is concerned, > the biggest difference being the way keys are set up. I guess I'd spend > some more days investigating that What specific speeds did you get, and for what hash types? For LM hashes, the key setup may easily eat up more time than the DES encryption step does. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.