john-users - Re: DIGEST-MD5, dominosec optimization

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20060208234107.GA16809@openwall.com>
Date: Thu, 9 Feb 2006 02:41:07 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: DIGEST-MD5, dominosec optimization

On the "dominosec" patch:

On Wed, Feb 08, 2006 at 07:43:47PM +0100, Michal Luczaj wrote:
> Still, 23% speed up just because of -march=pentium4 - is it natural?

It's excessive, but given that your innermost loop is just a few
assembly instructions, I am not surprised.  Of course, the loop is
likely unrolled by the compiler, but since there's no parallelism
available the CPU gets stalled on loads and possibly also on effective
address computations for each loop iteration anyway.  If you said that
this option made your code 23% slower, I would not be surprised either.

I think that compiler options like this will make less of a difference
once you optimize the source code.

> I even tried SSE intrinsics as a replacement for 4x4 bytes memcpy,
> memcmp and xor-ing, but it turned out to be... slower.

I don't think you should be optimizing the memcpy() and memcmp() now
(although you can get back to them later).  If you do, you might find
MMX faster than SSE - especially on other than Intel P4.  More
importantly, you must make sure that your arrays are naturally aligned -
so simply declaring them as arrays of "char" won't do - and also the
stack might not be sufficiently aligned for that.

gcc 2.95+ attempts to align the stack for SSE by default, though - and
this even has a (small) performance and size impact for code which does
not need that, so I am usually disabling this feature.  The OS must
cooperate, too.  The best thing you can do is simply not place variables
requiring an alignment larger than the architecture's natural word size -
which is 4 bytes for x86 - on the stack.  (On some systems - DJGPP and
older Cygwin - even .bss and .data sections are not sufficiently aligned
for MMX.)

As it relates to the XOR'ing, you definitely want to apply it to
quantities larger than "char" whenever possible.  (But this does not
appear to be possible in your inner loop.)

> Now I'll try some SSE to parallel multiple passwords.

I don't think you can do that.  There are no vectorized loads.

In this case, you need to be trying multiple passwords in parallel not
to use SIMD operations (you can't), but rather to relax data
dependencies and thus to let the CPU issue more instructions
simultaneously and avoid stalls after instructions which have large
latencies (such as loads).

I wrote:
> > ... need to pass weird options to recent
> > versions of gcc to get decent performance of DES-based hashes on Alpha
> 
> Aha, I see. Well, good to know. From what I see in 1.7's Makefile you
> are not embedding inline-limit nor inline-unit-growth. Do I understand
> correctly that this makes Alpha binaries build from default Makefile
> little bit crippled/slow?

Yes, -- for those built with recent gcc.  There's no problem if you use
gcc 2.95.3, for example.

-- 
Alexander Peslyak <solar at openwall.com>
GPG key ID: B35D3598  fp: 6429 0D7E F130 C13E C929  6447 73C3 A290 B35D 3598
http://www.openwall.com - bringing security into open computing environments

Was I helpful?  Please give your feedback here: http://rate.affero.net/solar
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.