Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 13 Oct 2008 02:18:38 +0000 (UTC)
From:  Marc Bevand <>
Subject:  Re: Breaking UNIX crypt() on the PlayStation 3

Solar Designer <solar@...> writes:
> Your implementation is specialized for a given target salt at compile
> time, and it tries candidate passwords "sequentially".  Both of these
> things allow for greater candidate passwords per second rate, but they
> limit usability and reduce the success rate (defined as successful
> guesses per unit of time when cracking multiple target hashes, which, by
> the way, your code does not even support).  This is fine, but it means
> that your comparison of cell-bf on Cell/PS3 vs. JtR on Q6700 is not
> exactly "fair" if you use it to illustrate relative performance of these
> CPUs and these programs - because the two programs differ in what they
> actually do.

Yes I totally agree. It is virtually impossible to do fair comparisons between 
bruteforcing tools, because the feature sets are never the same.

> If we similarly modify the code in JtR to only allow for trying
> candidate passwords in an algorithm-dependent order, then its
> performance at "only one salt" will approach that at "many salts".
> That's roughly a 10% improvement.  If we "hard-code" the salt, this
> might provide another 10% improvement at traditional crypt(3).
> Together, these changes bring us almost to the speed you've achieved on
> a Cell/PS3.

...almost the same absolute speed, yes. But still not quite the same perf/$ or 
perf/W (curently 1.6x and 1.5x better than JtR). This is why I really 
emphasize these 2 metrics in my slides: ultimately, absolute speed never 
matters; what only matters is how much can be done with a given amount of 
money or energy.

As you points out in another paragraph, the price depends on where you live. 
So, yes the area where cell-bf beats JtR is only this very narrow 
need-to-crack-only-1-unix-crypt-hash- for-X-dollars-and-Y-watts-in- 
incremental-mode-with-hw-bought-in- countries-with-cheap-PS3s area.

> On the other hand, you're not making use of the PPU yet, and you might
> have under-estimated the performance improvement it will provide.

I talked about that in a private discussion with Simon Marechal. The PPU 
indeed implements the AltiVec SIMD instruction set, which includes about the 
same logical instructions as the SPU ones (except nand, or-not, xnor), and 32 
128-bit registers. An S-box would almost fit in the 32 registers and may need 
about 5-10 more gates because of the lack of nand/or-not/xnor. So I think the 
PPU could provide 70-90% the perf of one SPU.

> [...sbox-gen...]
> Thank you for releasing this code!  I think that Matthew never released
> his code (he only released the resulting S-box C files, then released
> the paper describing the approach some years later - but never the
> actual optimizer code), nor am I aware of anyone else doing that (some
> people claimed impressive results, but never released anything).

Neither am I aware of anybody having released similar code before.

> I think that your results are almost directly reusable for "traditional"
> PowerPC with AltiVec.  The "selb" instruction appears to be equivalent
> to AltiVec's "vsel" and the vec_sel() C intrinsic.  The only real
> difference appears to be in the number of registers (128 vs. 32), so
> maybe some of the S-boxes will need extra instructions on AltiVec.
> Something I am tempted to try (but don't have time right now) is to
> convert your "best.c" to AltiVec C intrinsics and put it into JtR,
> letting the compiler deal with register allocation - then see if it
> provides a performance improvement compared to the code currently in JtR
> (which is essentially Matthew Kwan's nonstd.c).  Have you looked into
> this yet?

A little bit. If someday I can motivate myself to add PPU support to cell-bf 
(I recently dived into GPGPU programming), I may look into. Converting my 
best.c to AltiVec intrinsics for JtR would be pretty easy and probably boost 
its perf by 10-15%.

> Then, it'd be curious/useful to consider reusing/enhancing your
> optimizer for other target architectures.  The major obstacle here
> appears to be your reliance on a mux instruction, which is usually
> missing.  Can you suggest whether and how your current approach and code
> can reasonably be modified to avoid this dependency?  (Of course, it is
> trivial to "emulate" mux with several other instructions, but the
> resulting instruction counts would likely be unacceptable.  For example,
> in your best "circuit" for S4, as many as 15 out of 34 "gates" are mux.)

If mux doesn't exist, there is no other option than to replace with other 
instructions. To regain the lost speed, the bruteforcing space of sbox-gen can 
simply be extended to search through more circuits. The core function of 
sbox-gen, get_crackin(), tries various unary, binary, and ternary "operators". 
By extending it to 4-ary, or 5-ary (or more) operators, it should produce even 
better circuits. According to my estimations, bruteforcing up to 5-ary 
operators should take about 50-100 days on a 2 GHz AMD64 CPU.

However, as I explained to Simon Marechal in the same discussion, I have 
noticed that dumb bruteforcing like this doesn't always produce the best 
result. By that I mean finding the smallest number of gates outputting bit 0, 
then finding the smallest number of gates outputing bit 1 (based on the 
circuit built so far for bit 0), etc doesn't necessarily end up with the best 
circuit. Sometimes a solution with more gates to output bit 0 can lead to a 
better circuit because more gates can be shared between bit 0 and bit 1.

So a simple solution is to increase the bruteforcing space by also considering 
the 2nd, 3rd, etc best solutions for each bits. But it quickly increases the 
bruteforcing time... This is not surprising. After all, what we are trying to 
achive, generating an optimal bitslice circuit, is an unsolved pb in computer 

> The "next level" could be making the "circuit" optimizer aware of more
> subtle properties of the target instruction set architecture (e.g.,
> extra "moves" needed on 2-operand architectures, register count, invalid
> operand combinations) and specific CPU (instruction latencies and
> multiple-issue rules) - optimize the cycle count, not the "gate count".
> On "non-trivial" architectures/CPUs, it is almost certain that the most
> optimal "circuit", resulting in the best cycle count, is not the
> smallest one.

Yeah these are good ideas. For the Cell, we would only need to make sbox-gen 
aware of the instruction latencies (always 2 cycles for logical ones). For 
other archs, it seems a very hard task to implement all of this. And maybe not 
worth it because a human can do these optimizations pretty easily, after you 
give him or her a good, low-gate count output of sbox-gen.

Marc Bevand

To unsubscribe, e-mail and reply
to the automated confirmation request that will be sent to you.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.