Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 28 Apr 2006 04:07:40 +0400
From: Solar Designer <>
Subject: Re: Performance tuning

I wrote, regarding compile-time benchmarking:

> > This makes sense - and John has been doing that kind of thing in
> > "generic" and 32-bit SPARC builds.  But this approach also has several
> > disadvantages.  Most likely, I'll leave it for "generic" only in the
> > next official release.

On Fri, Apr 28, 2006 at 01:29:11AM +0200, wrote:
> Why?


- it is desirable to be able to produce binary packages consistently;

- those binary packages need to have decent performance on many/all
CPUs - which can't be said of builds that happen to require large caches
just because those caches were available in the build system;

- the particular 32-bit SPARC assembly code is largely obsolete - I may
just drop it;

- cross-compiling (from another platform that can't run the target
platform's binaries at all);

- it just might not be worth the complexity (and the associated bugs and
build failures in unusual environments).

Those are some reasons to not do compile-time benchmarking.  But I
haven't made a final determination on this yet.

> > I've benchmarked the 32-bit-with-SSE code (effectively 128-bit) on
> > 32-bit and 64-bit AMD CPUs.  They behaved almost the same.
> What`s about 64Bit SSE....?

There's no such thing.  There's SSE-as-available-in-64-bit-mode.  I have
not benchmarked it yet, but I really expect no performance difference as
long as I use the same 8 registers.  For the other 8, I expect the
performance to be either the same or even worse.  This may have to do
with the overhead of x86-64 to micro-ops instruction decoding - the
x86-64/SSE instructions are one byte longer for the added 8 registers:

sse64.o:     file format elf64-x86-64

   0:   0f 56 fa                orps   %xmm2,%xmm7
   3:   45 0f 56 ec             orps   %xmm12,%xmm13

The first instruction is encoded exactly the same as it would be on
plain x86/SSE - while the second one has an added prefix.

As it relates to the slowdown with SSE on an AMD64 processor:

Benchmarking: Traditional DES [64/64 BS MMX]... DONE
Many salts:     785664 c/s real, 785664 c/s virtual
Only one salt:  721472 c/s real, 721472 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE]... DONE
Many salts:     573516 c/s real, 573516 c/s virtual
Only one salt:  537164 c/s real, 537164 c/s virtual

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 47
model name      : AMD Athlon(tm) 64 Processor 3200+
stepping        : 2
cpu MHz         : 2009.201
cache size      : 512 KB

(Running an x86-64 build of Owl.)

Honestly, I don't really care about this stuff as much as you seem to.

If I somehow allocate a substantial amount of my time to further work on
John, which is not the case currently, these architecture-specific
optimizations would not be a priority.  Rather, the high-level
architecture would need to be made even more extensible first, to
accommodate more generic parallelized implementations of cryptographic
primitives.  By that time, CPUs on the market and relative performance
of different low-level implementations could be different.

Let's end this discussion for now.  I feel that we're merely annoying
most of the john-users subscribers with it.

Alexander Peslyak <solar at>
GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598 - bringing security into open computing environments

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ