Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 13 Nov 2007 05:35:20 +0300
From: Solar Designer <>
Subject: Re: bitslice MD5

On Tue, Nov 13, 2007 at 01:44:51AM +0000, Larry Bonner wrote:
> thank you for sharing, i'm sure many, including myself will
> study/experiment, and attempt to optimize it further - test concept
> with other algorithms such as md4/sha-1

OK.  Let me emphasize that this is almost certainly less efficient than
straightforward vectorized implementations on CPUs that support 32-bit
vector elements (and a reasonable set of operations on such elements,
including 32-bit addition).

BTW, I just found this presentation:

On page 23, it has a very brief mention that bitslice MD5 "is slow" on
Cell SPUs: "128 adds require 94 instructions."  To me, 94 instructions
is not a lot, but perhaps a straightforward 4-way SIMD implementation is
more efficient.

> you say that it would be possible to omit the bit rotations,

Yes, and I've almost omitted them in the second revision of the code
that I've posted - now it's up to the compiler to omit them in the
resulting machine code (but it'd have to unroll and inline everything,
which might result in code not fitting in L1 I-cache).

> would it also be possible to pre-compute constant additions and store
> these in memory? rather than process at runtime.

You must be confused.  There are no constant additions.  Rather, there
are additions of 32-bit "scalar" constants to "vector" variables.
Clearly, you can't fully pre-compute them, but you (or the compiler) can
omit the branches and dead code.  Specifically, in add32c() we have:

		if (c & 1) {
		} else {
		c >>= 1;

Once this is fully inlined and unrolled, you'd leave one of the two code
paths for every iteration, dropping the other (not taken) code path.
You'd also drop the right shift (in fact, you'd drop "c" entirely).

For a simpler optimization (that won't achieve a lot), you can notice
that there's never a carry on the very first iteration of add32*(), so
you can implement that iteration separately (with slightly simpler
code).  Maybe the compiler already does that.

Similarly, you can split the loop in add32c() in two, exiting from the
first loop when "c" is zero (instead of when "i" is zero) and not
bothering with "c" in the second loop (for the last few remaining bits
of other operands).

I've already suggested combining all three add32*() into one function.
However, a simpler optimization to start with could be combining the
four MD5 functions with add32c().  You'd get four instances of add32c()
(can use cpp macros to keep the source file small), each of which will
run slightly faster than f() followed by add32c() does.

I'm sure there are many other possible optimizations, including
architecture-specific ones (e.g., use bitwise operations other than
those directly available from C code).

Finally, you would not need to implement the entire MD5 compression
function.  For example, the code in JtR's MD5_std.c: MD5_body() assumes
that the last 32-bit word in the input block is always zero.  When
cracking a single instance of MD5 rather than MD5-based crypt(3), even
more assumptions can be made (for reasonably short passwords), and not
all rounds of the MD5 compression function need to be computed (some can
be pre-computed and some reversed).  But these optimizations are not in
any way specific to bitslice implementations.

Good luck!

Alexander Peslyak <solar at>
GPG key ID: 5B341F15  fp: B3FB 63F4 D7A3 BCCC 6F6E  FC55 A2FC 027C 5B34 1F15 - bringing security into open computing environments

To unsubscribe, e-mail and reply
to the automated confirmation request that will be sent to you.

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ