john-users - Re: Breaking UNIX crypt() on the PlayStation 3

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081012161508.GA21821@openwall.com>
Date: Sun, 12 Oct 2008 20:15:08 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Breaking UNIX crypt() on the PlayStation 3

Hi Marc,

Thank you for posting this - it's good stuff!

On Tue, Oct 07, 2008 at 12:45:26AM +0000, Marc Bevand wrote:
> The implementation I came up with averages 45.5 gates per S-box and is
> capable of testing 11.5 million password/sec on a PS3 (1.9 Mpw/sec
> per SPU core).
> 
> Slides and source code:
>   http://www.epitech.eu/v4/perso/~bevand_m/talks/breaking-unix-crypt.pdf
>   http://www.epitech.eu/v4/perso/~bevand_m/cell-bf/cell-bf-1.tar.gz

This is quite impressive.  Here are some comments on your slides:

Your implementation is specialized for a given target salt at compile
time, and it tries candidate passwords "sequentially".  Both of these
things allow for greater candidate passwords per second rate, but they
limit usability and reduce the success rate (defined as successful
guesses per unit of time when cracking multiple target hashes, which, by
the way, your code does not even support).  This is fine, but it means
that your comparison of cell-bf on Cell/PS3 vs. JtR on Q6700 is not
exactly "fair" if you use it to illustrate relative performance of these
CPUs and these programs - because the two programs differ in what they
actually do.

If we similarly modify the code in JtR to only allow for trying
candidate passwords in an algorithm-dependent order, then its
performance at "only one salt" will approach that at "many salts".
That's roughly a 10% improvement.  If we "hard-code" the salt, this
might provide another 10% improvement at traditional crypt(3).
Together, these changes bring us almost to the speed you've achieved on
a Cell/PS3.  (Obviously, "hard-coding" of the salt may be done at
runtime with self-modifying code.  Then it won't reduce usability on a
given supported platform, but it will reduce portability and/or require
separate implementations for different platforms.  I had this as a
questionable item on my to-do for years.)

On the other hand, you're not making use of the PPU yet, and you might
have under-estimated the performance improvement it will provide.

As to your price comparison, it does not hold for many parts of the
world.  I've just checked the prices for PlayStation 3 (retail, new) in
Moscow, Russia - they appear to start at $560.  The prices for Q6700
(just the CPU) start at $250 (and Q6600 starts at $200), and it should
be possible to have a reasonable cut-down computer for less than $500.

Overall, I'd say that these are similar in terms of both performance per
machine and performance per dollar.

> The source code also includes an "S-box circuit bruteforcer" I wrote
> based on Matthew Kwan's ideas (he made the S-boxes used in John the
> Ripper itself). I ran it for 2-3 hours to find good S-box circuits
> leveraging all the logical instructions available to the SPU.

Thank you for releasing this code!  I think that Matthew never released
his code (he only released the resulting S-box C files, then released
the paper describing the approach some years later - but never the
actual optimizer code), nor am I aware of anyone else doing that (some
people claimed impressive results, but never released anything).

I think that your results are almost directly reusable for "traditional"
PowerPC with AltiVec.  The "selb" instruction appears to be equivalent
to AltiVec's "vsel" and the vec_sel() C intrinsic.  The only real
difference appears to be in the number of registers (128 vs. 32), so
maybe some of the S-boxes will need extra instructions on AltiVec.
Something I am tempted to try (but don't have time right now) is to
convert your "best.c" to AltiVec C intrinsics and put it into JtR,
letting the compiler deal with register allocation - then see if it
provides a performance improvement compared to the code currently in JtR
(which is essentially Matthew Kwan's nonstd.c).  Have you looked into
this yet?

Then, it'd be curious/useful to consider reusing/enhancing your
optimizer for other target architectures.  The major obstacle here
appears to be your reliance on a mux instruction, which is usually
missing.  Can you suggest whether and how your current approach and code
can reasonably be modified to avoid this dependency?  (Of course, it is
trivial to "emulate" mux with several other instructions, but the
resulting instruction counts would likely be unacceptable.  For example,
in your best "circuit" for S4, as many as 15 out of 34 "gates" are mux.)

The "next level" could be making the "circuit" optimizer aware of more
subtle properties of the target instruction set architecture (e.g.,
extra "moves" needed on 2-operand architectures, register count, invalid
operand combinations) and specific CPU (instruction latencies and
multiple-issue rules) - optimize the cycle count, not the "gate count".
On "non-trivial" architectures/CPUs, it is almost certain that the most
optimal "circuit", resulting in the best cycle count, is not the
smallest one.

Thanks again,

Alexander

-- 
To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply
to the automated confirmation request that will be sent to you.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.