Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 11 May 2006 09:42:48 +0400
From: Solar Designer <>
Subject: Re: Performance tuning

A couple of weeks ago, I wrote:
> There's SSE-as-available-in-64-bit-mode.  I have
> not benchmarked it yet, but I really expect no performance difference as
> long as I use the same 8 registers.

I've benchmarked it now.  Confirmed - it behaves exactly the same in
32-bit and 64-bit modes.

> For the other 8, I expect the performance to be either the same or
> even worse.

I've also benchmarked this.  There's no significant performance
difference between the first 8 and the other 8 SSE registers, but the
x86-64 architecture code size is indeed different (the number of
micro-ops to cache might be the same).

> As it relates to the slowdown with SSE on an AMD64 processor:
> Benchmarking: Traditional DES [64/64 BS MMX]... DONE
> Many salts:     785664 c/s real, 785664 c/s virtual
> Only one salt:  721472 c/s real, 721472 c/s virtual
> Benchmarking: Traditional DES [128/128 BS SSE]... DONE
> Many salts:     573516 c/s real, 573516 c/s virtual
> Only one salt:  537164 c/s real, 537164 c/s virtual

This mystery is now solved, sort of.  There's no such slowdown with SSE2
instructions.  While SSE and SSE2 bitwise ops have exactly the same
performance on Intel P4s (many different ones I've tried), SSE is a lot
slower than SSE2 on AMD.  My _guess_ is that this has to do with AMD
processors maintaining some floating-point state for the "single
precision floats" that form the vectors with SSE.  I don't know whether
Intel P4s don't do that (after all, it doesn't make sense to do bitwise
ops on actual floats) or whether they manage to do it with no slowdown -
or my guess might be entirely wrong, after all.

The SSE2 benchmark for the same system is:

Benchmarking: Traditional DES [128/128 BS SSE2]... DONE
Many salts:     951193 c/s real, 951193 c/s virtual
Only one salt:  827776 c/s real, 827776 c/s virtual

> If I somehow allocate a substantial amount of my time to further work on
> John, which is not the case currently, these architecture-specific
> optimizations would not be a priority.

Well, I managed to find some hours over the last 3 days - and I intended
to spend those on pushing the SSE code that I already had into a version
I could release.  However, I ended up doing more than that...

I also did experiment with 16-registers SSE2 code as generated by a Perl
script I wrote and as generated by gcc 4.1.0 (out of a specially modified
C source file).  This does look promising, but so far the performance is
a little bit worse than that for the MMX-code-derived 8-registers SSE2
code that is currently in 1.7.1.  However, my Perl script does
absolutely no instruction scheduling (rather, I concentrated on optimal
register allocation, on reducing the instruction count, and on avoiding
operand combinations that are not supported).  With a proper instruction
scheduler, this should slightly outperform the current 8-registers SSE2
code.  The number of instructions generated per S-box is about 10%
smaller with 16 registers.

As it relates to SSE2 code generated by gcc 4.1.0, it is surprisingly
good.  gcc has improved a lot in this area.  But my special-purpose Perl
script is better. ;-)

Alexander Peslyak <solar at>
GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598 - bringing security into open computing environments

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ