announce - John the Ripper 1.7.2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20060522015640.GA12202@openwall.com>
Date: Mon, 22 May 2006 05:56:40 +0400
From: Solar Designer <solar@...nwall.com>
To: announce@...ts.openwall.com
Subject: John the Ripper 1.7.2

Hi,

John the Ripper 1.7.2 (a "development" version) adds bitslice DES
assembly code for x86-64 making use of the 64-bit mode extended SSE2
with 16 XMM registers.  You can download it at the usual location:

	http://www.openwall.com/john/

The SSE2 code introduced in 1.7.1 was only available with 32-bit x86
builds of John.  Now John 1.7.2 is able to use SSE2 with 64-bit builds
as well, which in practice brings their performance on current 64-bit
capable AMD processors to the same level that 1.7.1 had achieved with
32-bit builds running on those same processors.  So if your operating
system is 64-bit only (no 32-bit compatibility) and thus you were unable
to make use of the SSE2 acceleration in 1.7.1, you should be able to do
so now. :-)

There's no performance improvement for current 64-bit capable Intel
processors with this version.  Those were happy running native x86-64
code, achieving the same level of performance that they do with SSE2
code.  There's, however, one advantage to using this version on Intel
processors as well: the performance should be more stable, regardless of
C compiler version and options.

Some technical detail for the curious:

Yes, the new 64-bit mode SSE2 code is also different.  The DES S-box
implementations are not derived from the 32-bit mode MMX/SSE2 ones, but
rather they have been generated from Matthew Kwan's optimized S-box
expressions anew, making use of the 16 XMM registers.  A Perl script was
written and used to compile the expressions into a virtual 3-operand
architecture, convert to 2-operand and allocate virtual registers,
allocate real registers and spill whatever didn't fit into registers to
memory locations, and do some instruction scheduling based on published
SSE2 instruction latencies for the Pentium 4 (which are larger than
those for current AMD64 processors, so should be good for both camps).
Then the script's output required some manual work to resolve the few
occurrences of unsupported operand combinations.  Finally, the
individual S-boxes were subjected to "brute-force instruction
scheduling", which tested thousands of different ways in which
instructions could be re-ordered, measuring the actual execution times
in clock cycles (yes, code was actually being run on the CPU multiple
times for each of the thousands of code versions being tested).

Now, you would expect excellent performance from the above, wouldn't
you?  Well, it's not so great in practice.  The 8-registers MMX code
from which the non-64-bit-mode SSE2 code was derived is excellent,
making it very hard to do things better - even with more registers.
Yes, the availability of 16 registers helps save some moves and reduces
the instruction count in S-boxes by 10%, but it turns out that accessing
the added 8 XMM registers may be slower than accessing memory (actually,
L1 cache) in some cases.  I think that this has to do with limitations
of the instruction decoder.  For example, blindly converting the "old"
8-registers SSE2 code to use the extra 8 registers instead of
temporaries in memory results in an 8% slowdown on an Athlon 64, but
makes no difference on a Xeon - which is consistent with the "decoder
theory".  Brute-re-scheduling that code does get back some of the lost
8%, but it does not restore the original 8-registers code performance.

So for the time being, I am satisfied with the brand new 16-registers
code being about as fast as the old 8-registers one is.  In future
versions, I might further improve the Perl script to do smarter register
allocation, taking into account the fact that consecutive instructions
which access registers XMM8-XMM15 have a higher cost on AMD64
processors.  (In my testing, the slowdown is seen with 4+ consecutive
instructions like that, but in real-world code there are more
constraints.)  I might also generate some hybrid native x86-64 + SSE2
code, which might make this issue irrelevant... or it might not. :-)

-- 
Alexander Peslyak <solar at openwall.com>
GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598
http://www.openwall.com - bringing security into open computing environments
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.