john-dev - AMD Bulldozer and XOP (was: RAR format finally proper)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120314233725.GA6015@openwall.com>
Date: Thu, 15 Mar 2012 03:37:25 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: AMD Bulldozer and XOP (was: RAR format finally proper)

On Thu, Mar 15, 2012 at 12:34:16AM +0200, Milen Rangelov wrote:
> Actually (I know that's offtopic) this CPU demonstrates some weird
> behavior.

Why, this is on topic for JtR development since you're looking into and
talking about issues that are relevant to code optimizations in JtR as
well.  Thank you.

> With my SSE2 code, a 4-core Phenom II @3.2GHz is almost as fast
> as the 6-core FX-6100 @3.3 GHz.

That's as expected.  In fact, you're lucky that the FX is not slower in
your case.

The 6-core (and 8-core for FX-81x0) is actually 3-module (or
4-module, respectively), where each module has two sets of register
files (like with Intel's Hyperthreading) and two sets of integer ALUs
and AGUs (that's a new thing compared to Intel's Hyperthreading), but
only a shared set of other execution units (including vector).  So when
you primarily use the vector ops, the CPU is effectively 3-core (or
4-core for FX-81x0) with SMT.

http://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)#Bulldozer_core_.28module.29

> At first that seemed strange, then I
> implemented the XOP codepaths for MD5/MD4/SHA1 and then things look better
> now, the _mm_roti_epi32/_mm_cmov_si128 optimizations lead to ~ 40%
> improvement as compared to the SSE2 code (still worse than what I expected
> though). Same for SHA1 and MD4.

Yeah, I was planning to try that in JtR as well, but didn't get around
to it yet.  It's good news that this worked well for you.

> Then came the DES stuff. I decided to test Alexander's s-boxes.

They're Roman's, not mine; my role was to choose the versions producing
more optimal code (considering register pressure and parallelism).

> Well that was surprising. I used the ones for architectures supporting
> bitselect instructions that should have much less gates than the original
> Matthew Kwan s-boxes I used until now. Yet, I got the same speeds.

That's puzzling indeed.  XOP does provide some decent speedup over AVX
for bitslice DES in JtR when benchmarked on the same Bulldozer CPU.  You
can see the numbers here: http://openwall.info/wiki/john/benchmarks

18527K / 14247K 128/128 BS XOP-16
vs.
16442K / 12792K 128/128 BS AVX-16

4700K / 4418K 128/128 BS XOP-16
vs.
3951K / 3786K 128/128 BS AVX-16

These are for "FX-8120 o/c 3.6 GHz + turbo".  These were
user-contributed numbers.  Somehow I am getting numbers similar to the
above on my FX-8120 without any overclocking (well, maybe only slightly
lower numbers).  I'll do more benchmarking with different clock rates
and update the wiki later.  BTW, Core i7-2600 at stock clock rate
(3.4 GHz + turbo) is faster at these despite of only having AVX:

22773K / 18284K 128/128 BS AVX-16 (8 threads)
5802K / 5491K 128/128 BS AVX-16 (non-OpenMP)

FX-8120 has slight advantage over Core i7-2600 at ALU-heavy code such as
bcrypt, though: approx. 5500 c/s vs. 4800 c/s for 8 threads.  These
correspond to approx. 5.8x vs. 5.1x increase over single thread speed.

> hashkill
> and jtr are similar in design as far as the bitslice DES part is concerned,
> the biggest difference being the way keys are set up. I guess I'd spend
> some more days investigating that

What specific speeds did you get, and for what hash types?

For LM hashes, the key setup may easily eat up more time than the DES
encryption step does.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.