Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 3 Sep 2015 18:15:29 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: MD5 I() (was: SHA-1 H())

On Wed, Sep 02, 2015 at 06:52:25PM +0300, Solar Designer wrote:
> Attached to this message is a program I used to search for possible
> optimized expressions like this.  [...]  I was hoping
> it might find a 2 operation expression for MD5's I(), but no luck.

I've enhanced the program, and had better luck today:

$ ./search3 y n n n 2>&1 | cut -d' ' -f7- | sort -u
165
sel(ff, 55, 0f) = f5; 33 ^ f5 = c6;
sel(ff, 55, 0f) = f5; f5 ^ 33 = c6;

#define I(x, y, z)	(bitselect(0xffffffffU, (x), (z)) ^ (y))

To remind, the original was:

#define I(x, y, z)	((y) ^ ((x) | ~(z)))

I think it's the first time MD5 I() has been shown to be implementable
with only 2 operations on architectures without OR-NOT (and without
XNOR, and of course also without "ternary" logic instructions such as
Maxwell's or AVX-512's, which would turn this into 1 operation with no
effort).

Now that I think of it, the expression is actually very simple and I
should have been able to arrive at it without a program.  bitselect()
with the all-ones constant is directly usable to implement OR-NOT. :-)

> It doesn't yet test two bitselect()'s per expression, though - this is
> worth adding and trying again (many possibilities to test there).

It does now, and it also tests constants as you can see.  New version
attached.  And here's a table produced with it:

$ ./search3.sh
SEL     XNOR    ORN     ANDN    COUNT   MD5_I
yes     yes     yes     yes     190     yes
yes     yes     yes     no      190     yes
yes     yes     no      yes     190     yes
yes     yes     no      no      178     yes
yes     no      yes     yes     177     yes
yes     no      yes     no      177     yes
yes     no      no      yes     177     yes
yes     no      no      no      165     yes
no      yes     yes     yes     144     yes
no      yes     yes     no      114     yes
no      yes     no      yes     114     yes
no      yes     no      no      72      no
no      no      yes     yes     131     yes
no      no      yes     no      95      yes
no      no      no      yes     95      no
no      no      no      no      59      no

This shows the number of different truth tables possible for 3 inputs
with at most 2 operations on different architectures.  (It is assumed
that AND, OR, NOT and constants are always available, so only possible
instruction set extensions are listed in the table.)  The theoretical
maximum (actually achieved with Maxwell's or AVX-512's "ternary" logic
instructions) is 256.

The last column shows whether MD5's I() is implementable with 2
operations or not.  Similar tables can be generated for other
expressions, such as those found in other MD4/MD5, SHA-1, and SHA-2
basic functions - but for those we readily knew seemingly optimal
expressions using bitselect(), so I focused on MD5's I() for now.

Unfortunately, the program became rather long.  I'm sure it can be
greatly simplified - and perhaps it should in fact be rewritten and
tested against this version in order for us to have greater confidence
it actually finds all possible truth tables rather than only a (very
large) subset.

As to machine code improvements from this change to I(), here's GCN
assembly from our md5crypt kernel.  Before the change:

  v_not_b32     v5, v7                                      // 00003DA8: 7E0A6F07
  v_or_b32      v5, v1, v5                                  // 00003DAC: 380A0B01
  v_xor_b32     v5, v2, v5                                  // 00003DB0: 3A0A0B02
  v_add_i32     v4, vcc, v13, v4                            // 00003DB4: 4A08090D
  v_add_i32     v4, vcc, v5, v4                             // 00003DB8: 4A080905
  v_add_i32     v4, vcc, 0xbd3af235, v4                     // 00003DBC: 4A0808FF BD3AF235
  v_alignbit_b32  v4, v4, v4, 22                            // 00003DC4: D29C0004 025A0904
  v_add_i32     v4, vcc, v1, v4                             // 00003DCC: 4A080901

After the change (surprisingly, register allocation and offsets look
unchanged):

  v_bfi_b32     v5, v7, v1, -1                              // 00003DA8: D2940005 03060307
  v_xor_b32     v5, v2, v5                                  // 00003DB0: 3A0A0B02
  v_add_i32     v4, vcc, v13, v4                            // 00003DB4: 4A08090D
  v_add_i32     v4, vcc, v5, v4                             // 00003DB8: 4A080905
  v_add_i32     v4, vcc, 0xbd3af235, v4                     // 00003DBC: 4A0808FF BD3AF235
  v_alignbit_b32  v4, v4, v4, 22                            // 00003DC4: D29C0004 025A0904
  v_add_i32     v4, vcc, v1, v4                             // 00003DCC: 4A080901

That's 7 instructions instead of 8.  However, there's no code size
reduction in bytes, since v_bfi_b32 occupies 8 bytes (and it does so
even when no immediate value operand is used).

The effect of this will need to be tested across multiple OpenCL kernels
and pieces of C+intrinsics code across multiple architectures.  Besides
GPUs, also on AMD CPUs with XOP.

Alexander

Download attachment "search3.sh" of type "application/x-sh" (455 bytes)

View attachment "search3.c" of type "text/x-c" (6132 bytes)

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ