john-users - Re: Breaking UNIX crypt() on the PlayStation 3

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081017002539.GA16615@openwall.com>
Date: Fri, 17 Oct 2008 04:25:39 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Breaking UNIX crypt() on the PlayStation 3

On Mon, Oct 13, 2008 at 02:18:38AM +0000, Marc Bevand wrote:
> I talked about that in a private discussion with Simon Marechal. The PPU 
> indeed implements the AltiVec SIMD instruction set, which includes about the 
> same logical instructions as the SPU ones (except nand, or-not, xnor), and 32 
> 128-bit registers. An S-box would almost fit in the 32 registers and may need 
> about 5-10 more gates because of the lack of nand/or-not/xnor. So I think the 
> PPU could provide 70-90% the perf of one SPU.

Isn't the PPU capable of issuing more instructions per cycle (than the
SPUs)?  If not more AltiVec instructions, then maybe more PowerPC
instructions intermixed with AltiVec ones?  BTW, this additional
optimization - using native 64-bit instructions intermixed with vector
ones - might work on all of: SSE2, AltiVec, SPU.

> If mux doesn't exist, there is no other option than to replace with other 
> instructions. To regain the lost speed, the bruteforcing space of sbox-gen can 
> simply be extended to search through more circuits. The core function of 
> sbox-gen, get_crackin(), tries various unary, binary, and ternary "operators". 
> By extending it to 4-ary, or 5-ary (or more) operators, it should produce even 
> better circuits. According to my estimations, bruteforcing up to 5-ary 
> operators should take about 50-100 days on a 2 GHz AMD64 CPU.

This should probably be done, including for architectures that do have a
mux instruction.

> However, as I explained to Simon Marechal in the same discussion, I have 
> noticed that dumb bruteforcing like this doesn't always produce the best 
> result. By that I mean finding the smallest number of gates outputting bit 0, 
> then finding the smallest number of gates outputing bit 1 (based on the 
> circuit built so far for bit 0), etc doesn't necessarily end up with the best 
> circuit. Sometimes a solution with more gates to output bit 0 can lead to a 
> better circuit because more gates can be shared between bit 0 and bit 1.
> 
> So a simple solution is to increase the bruteforcing space by also considering 
> the 2nd, 3rd, etc best solutions for each bits. But it quickly increases the 
> bruteforcing time... This is not surprising. After all, what we are trying to 
> achive, generating an optimal bitslice circuit, is an unsolved pb in computer 
> science.

Yes.  It is also possible to start with any bit number, not necessarily
bit 1, and proceed to add gates for other bits in any order - then
choose the best solution.  This will make the search roughly 4! = 24
times slower, which is two days instead of two hours.  Of course, this
will likely not find the most optimal solution, so this approach can be
combined with what you describe above.

> > The "next level" could be making the "circuit" optimizer aware of more
> > subtle properties of the target instruction set architecture (e.g.,
> > extra "moves" needed on 2-operand architectures, register count, invalid
> > operand combinations) and specific CPU (instruction latencies and
> > multiple-issue rules) - optimize the cycle count, not the "gate count".
> > On "non-trivial" architectures/CPUs, it is almost certain that the most
> > optimal "circuit", resulting in the best cycle count, is not the
> > smallest one.
> 
> Yeah these are good ideas. For the Cell, we would only need to make sbox-gen 
> aware of the instruction latencies (always 2 cycles for logical ones).

BTW, another way to "satisfy" the latencies for the Cell would be by
using 256-bit or longer "virtual" vectors, with repeated instructions.
You have enough registers for that.

> For other archs, it seems a very hard task to implement all of this.

Yes, some of this is hard to implement, but if it is implemented for
code generation (based on S-box "circuits") anyway, then rolling it into
the same brute-forcing loop rather than having it as a separate step
might be pretty easy.  Of course, code generation becomes performance
critical then, so it'd need to be in C and not, say, in Perl, and it may
be cut down to calculating the number of instructions, maybe code size
(if instructions are not of fixed size), and expected number of cycles
(on a given CPU) - without actually emitting any code.  A difficulty
here is that some optimization approaches involve actually running the
code - e.g., I've been brute-forcing instruction scheduling for x86-64
on Athlon64 and Xeon Nocona (need to repeat that on Core 2) - and the
final cycle count is not known until that is done.

> And maybe not 
> worth it because a human can do these optimizations pretty easily, after you 
> give him or her a good, low-gate count output of sbox-gen.

Have you tried doing that for MMX or 8-register SSE2? ;-)  It's not easy
for a human to do at all.  In fact, it is not easy even for 16-register
SSE2 - and least for me it was easier to write a Perl script than to do
it all by hand.  Conversion to 2-op and register allocation should
definitely be done by a computer, at least initially (then manual
adjustments are possible).  Dealing with invalid operand combinations
may be done manually, although it is also pretty hard to do optimally
(there's simply no spare register for the extra temporaries, and having
the script leave a register unused "just in case" kills performance).

More importantly, the lowest cycle count may likely be achieved with a
"circuit" that is not the smallest.

Alexander

-- 
To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply
to the automated confirmation request that will be sent to you.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.