Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 3 Sep 2015 23:06:59 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: MD5 I() (was: SHA-1 H())

On Thu, Sep 03, 2015 at 06:15:29PM +0300, Solar Designer wrote:
> I've enhanced the program, and had better luck today:
> 
> $ ./search3 y n n n 2>&1 | cut -d' ' -f7- | sort -u
> 165
> sel(ff, 55, 0f) = f5; 33 ^ f5 = c6;
> sel(ff, 55, 0f) = f5; f5 ^ 33 = c6;
> 
> #define I(x, y, z)	(bitselect(0xffffffffU, (x), (z)) ^ (y))
> 
> To remind, the original was:
> 
> #define I(x, y, z)	((y) ^ ((x) | ~(z)))
> 
> I think it's the first time MD5 I() has been shown to be implementable
> with only 2 operations on architectures without OR-NOT

atom confirmed that this wasn't in oclHashcat yet, but I guess now it
will be:

<solardiz> MD5_I(x, y, z) = y ^ (x | ~z) = y ^ bitselect(-1, x, z); Was this known? @hashcat Do you use it? 2 ops without OR-NOT http://www.openwall.com/lists/john-dev/2015/09/03/6
<@hashcat> @solardiz It's new to me. Surprisingly it does not have any effect on cracking speed on AMD GPU even if the instruction count is reduced
<@hashcat> @solardiz It can be used for RIPEMD's J() as well
<@hashcat> @solardiz OK, found the reason! oclHashcat reverses the MD5 I() instruction for single hashes and therefore it almost never reached I()
<@hashcat> @solardiz For situations where we can not reverse I() there's an increase. For example with phpass 2932k->3038k @ 290x
<@solardiz> @hashcat Thanks for confirming this. I'm getting 2233k->2310k for JtR's phpass-opencl on Tahiti 1050 MHz.

> $ ./search3.sh
> SEL     XNOR    ORN     ANDN    COUNT   MD5_I
> yes     yes     yes     yes     190     yes
> yes     yes     yes     no      190     yes
> yes     yes     no      yes     190     yes
> yes     yes     no      no      178     yes
> yes     no      yes     yes     177     yes
> yes     no      yes     no      177     yes
> yes     no      no      yes     177     yes
> yes     no      no      no      165     yes
> no      yes     yes     yes     144     yes
> no      yes     yes     no      114     yes
> no      yes     no      yes     114     yes
> no      yes     no      no      72      no
> no      no      yes     yes     131     yes
> no      no      yes     no      95      yes
> no      no      no      yes     95      no
> no      no      no      no      59      no

Note that MD5 I() is also implementable in 2 ops on archs with XNOR and
ANDN, but no ORN and no SEL.  Do these exist?  (Usually if XNOR is
available, then ORN is also available.)

#define I(x, y, z)	((~(x) & (z)) ^ ~(y))

> This shows the number of different truth tables possible for 3 inputs
> with at most 2 operations on different architectures.  (It is assumed
> that AND, OR, NOT and constants are always available, so only possible
> instruction set extensions are listed in the table.)  The theoretical
> maximum (actually achieved with Maxwell's or AVX-512's "ternary" logic
> instructions) is 256.

Of course, I meant AND, OR, XOR, NOT there.  (Forgot to list XOR.)

> As to machine code improvements from this change to I(), here's GCN
> assembly from our md5crypt kernel.  Before the change:
> 
>   v_not_b32     v5, v7                                      // 00003DA8: 7E0A6F07
>   v_or_b32      v5, v1, v5                                  // 00003DAC: 380A0B01
>   v_xor_b32     v5, v2, v5                                  // 00003DB0: 3A0A0B02
>   v_add_i32     v4, vcc, v13, v4                            // 00003DB4: 4A08090D
>   v_add_i32     v4, vcc, v5, v4                             // 00003DB8: 4A080905
>   v_add_i32     v4, vcc, 0xbd3af235, v4                     // 00003DBC: 4A0808FF BD3AF235
>   v_alignbit_b32  v4, v4, v4, 22                            // 00003DC4: D29C0004 025A0904
>   v_add_i32     v4, vcc, v1, v4                             // 00003DCC: 4A080901
> 
> After the change (surprisingly, register allocation and offsets look
> unchanged):
> 
>   v_bfi_b32     v5, v7, v1, -1                              // 00003DA8: D2940005 03060307
>   v_xor_b32     v5, v2, v5                                  // 00003DB0: 3A0A0B02
>   v_add_i32     v4, vcc, v13, v4                            // 00003DB4: 4A08090D
>   v_add_i32     v4, vcc, v5, v4                             // 00003DB8: 4A080905
>   v_add_i32     v4, vcc, 0xbd3af235, v4                     // 00003DBC: 4A0808FF BD3AF235
>   v_alignbit_b32  v4, v4, v4, 22                            // 00003DC4: D29C0004 025A0904
>   v_add_i32     v4, vcc, v1, v4                             // 00003DCC: 4A080901
> 
> That's 7 instructions instead of 8.  However, there's no code size
> reduction in bytes, since v_bfi_b32 occupies 8 bytes (and it does so
> even when no immediate value operand is used).
> 
> The effect of this will need to be tested across multiple OpenCL kernels
> and pieces of C+intrinsics code across multiple architectures.  Besides
> GPUs, also on AMD CPUs with XOP.

For md5crypt-opencl, we're getting some slowdown - but that kernel is
weird (as discussed elsewhere, it currently uses global memory in its
inner loop).  We need to optimize it some more, then re-test this change.

For phpass-opencl, there's decent speedup as mentioned above.

I didn't try any others yet.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.