Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 15 Mar 2012 11:46:09 +0200
From: Milen Rangelov <>
Subject: Re: AMD Bulldozer and XOP (was: RAR format finally proper)

Hello Alexander,

That's as expected.  In fact, you're lucky that the FX is not slower in
> your case.
> The 6-core (and 8-core for FX-81x0) is actually 3-module (or
> 4-module, respectively), where each module has two sets of register
> files (like with Intel's Hyperthreading) and two sets of integer ALUs
> and AGUs (that's a new thing compared to Intel's Hyperthreading), but
> only a shared set of other execution units (including vector).  So when
> you primarily use the vector ops, the CPU is effectively 3-core (or
> 4-core for FX-81x0) with SMT.
Should have read that before I spent EUR 250 for a new CPU and motherboard,
I feel pissed now eheh.

In fact it was a bit faster for MD5 and MD4, a bit slower for SHA1 and
almost the same speed for DES-based hashes.

> Yeah, I was planning to try that in JtR as well, but didn't get around
> to it yet.  It's good news that this worked well for you.

Actually the quoted improvement percentage was not correct. I did some more
improvements (like e.g using SSE3 shuffle to speed up the byte order
reversals in SHA1 and optimizing a bit the early checks). What I got after

MD5 single hash:   128M c/s with SSE2 -> 181M c/s with XOP
NTLM single hash: 114M  c/s with SSE2 -> 162M c/s with XOP
SHA1 single hash:  39M  c/s with SSE2 -> 63M c/s with XOP
(on PhenomII X4 it used to be around 42M c/s)

All those numbers are singlehash. This skips slow bitmap lookups and I also
do an early check several steps in advance (but no MD5/MD4 step reversals
as it is incompatible with my design).

I noticed multihash MD5 improved a lot with the new CPU - I get about 120M
c/s as compared to ~65M c/s on PhenomII X4.

> They're Roman's, not mine; my role was to choose the versions producing
> more optimal code (considering register pressure and parallelism).
> That's puzzling indeed.  XOP does provide some decent speedup over AVX
> for bitslice DES in JtR when benchmarked on the same Bulldozer CPU.  You
> can see the numbers here:
> 18527K / 14247K 128/128 BS XOP-16
> vs.
> 16442K / 12792K 128/128 BS AVX-16
> 4700K / 4418K 128/128 BS XOP-16
> vs.
> 3951K / 3786K 128/128 BS AVX-16
> These are for "FX-8120 o/c 3.6 GHz + turbo".  These were
> user-contributed numbers.  Somehow I am getting numbers similar to the
> above on my FX-8120 without any overclocking (well, maybe only slightly
> lower numbers).  I'll do more benchmarking with different clock rates
> and update the wiki later.  BTW, Core i7-2600 at stock clock rate
> (3.4 GHz + turbo) is faster at these despite of only having AVX:
> 22773K / 18284K 128/128 BS AVX-16 (8 threads)
> 5802K / 5491K 128/128 BS AVX-16 (non-OpenMP)

Well, I tried using 128-bit xmm instructions, not the ymm version (which
would require a lot of rework). Speed with both the Kwan's sboxes/SSE2 and
Roman's with XOP is about 5M c/s (6 threads). I believe the problem is
somewhere else (my key/block setup to blame perhaps).

> FX-8120 has slight advantage over Core i7-2600 at ALU-heavy code such as
> bcrypt, though: approx. 5500 c/s vs. 4800 c/s for 8 threads.  These
> correspond to approx. 5.8x vs. 5.1x increase over single thread speed.

I don't support bcrypt yet :(

> What specific speeds did you get, and for what hash types?
> For LM hashes, the key setup may easily eat up more time than the DES
> encryption step does.

For LM hashes I get some improvement over Kwan's sboxes (42M c/s vs 35M
c/s). I think my bitslice key/block set up is to blame for this. I use some
loops with _mm_movemask_epi8 which at first glance look neat and fast,
however this same PMOVMSKB thing is not that fast indeed. I also suspect I
got a lot of register spills because of my not quite optimal code.



Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ