Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 25 Jun 2011 03:48:46 +0400
From: Solar Designer <>
Subject: Re: John the Ripper 1.7.8: DES speedup

On Thu, Jun 23, 2011 at 09:59:27PM +0100, Larry Bonner wrote:
> For anyone interested, Brian published his algorithm on the following
> page:


On Thu, Jun 23, 2011 at 8:14 AM, Simon Marechal <> wrote:
> This sound pretty interesting. Is there a writeup coming

I'm not sure.  We have no specific plans on this.

> or some description of the approach ?

I'll try to explain it briefly:

Roman's primary idea was to start with breadth-first search - that is,
find Boolean functions for each possible truth table.  Doing it for the
full 6 inputs is impractical (we'd need 2**64 memory), but for 5 inputs
it is practical.  So we allocate a 2**32 entry array, indexed by truth
table of 5-to-1 functions.  Each filled entry holds the number of logic
operations needed to reach that truth table from a certain initial state
(typically from the 5 inputs supplied as-is).

Since our actual S-boxes are 6-to-4 rather than 5-to-1, this is followed
by depth-first search to choose 8 1-bit outputs (and their corresponding
32-bit truth tables, considering function complexities) that can be
combined to produce our desired 4 output bits (with pre-defined target
64-bit truth tables), with one of the 6 input bits being the selector bit.

There's also the task of choosing which bit to use as the selector and
which two operations (OR and XOR, AND and XOR, or AND-NOT and XOR) to
use to apply it to a pair of 5-to-1 outputs (the selector bit is chosen
for all 4 outputs at once, but the two operations may be chosen
differently per output).  Luckily, that's a small enough number of
alternatives to try.

Additionally, in some cases Roman chose to provide other than 5 direct
inputs of the S-box as the initial state to measure function
complexities against.

Finally, there's the task of reconstructing the chosen 5-to-1 functions
and generating the proper expressions with intermediate results reuse
(for the 8 partial outputs).

Much of this requires a lot of manual tuning and re-tuning per S-box.

I'm sure I have missed many details, some of them for simplicity and
some because Roman and not I was the one to work on this.  (So I don't
recall or even don't know some of the details, even though I do have a
copy of the program code.)

As I had mentioned, my involvement was code generation and feedback to
Roman to make sure we're able to generate decent code.  Roman was
running my Perl scripts to estimate the number of registers and 2-op
instructions needed for different S-box expressions.  Then I also ran
them, and improved them further, to choose the most suitable ones of the
remaining S-box expressions (thousands of them).  My final revisions
also estimated available parallelism, separately for 3-op and 2-op
architectures.  You can see some selection criteria in somewhat cryptic
comments in the S-box files in JtR 1.7.8, like this:

/* s8-019630, 41 gates, 14 regs, 11 andn, 4/21/60/101/143 stalls, 62 biop */

Not surprisingly, the goal to require fewer registers sometimes
conflicted with the desire to have more parallelism.  There were other
conflicting criteria as well.  Thus, the final S-box files in 1.7.8 have
several versions of almost every S-box, with #if's.

I also suggested a way to halve the big array (to 2**31 entries), but we
never implemented that.  This would be desirable if we decided to get a
community involved in searching for even more optimal S-box expressions -
such that people with 32-bit systems could participate too.
(As currently implemented, Roman's program requires a 64-bit system with
at least 5 GB RAM, and this might not always be enough.)


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.