Date: Thu, 9 Feb 2006 02:41:07 +0300 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: DIGEST-MD5, dominosec optimization On the "dominosec" patch: On Wed, Feb 08, 2006 at 07:43:47PM +0100, Michal Luczaj wrote: > Still, 23% speed up just because of -march=pentium4 - is it natural? It's excessive, but given that your innermost loop is just a few assembly instructions, I am not surprised. Of course, the loop is likely unrolled by the compiler, but since there's no parallelism available the CPU gets stalled on loads and possibly also on effective address computations for each loop iteration anyway. If you said that this option made your code 23% slower, I would not be surprised either. I think that compiler options like this will make less of a difference once you optimize the source code. > I even tried SSE intrinsics as a replacement for 4x4 bytes memcpy, > memcmp and xor-ing, but it turned out to be... slower. I don't think you should be optimizing the memcpy() and memcmp() now (although you can get back to them later). If you do, you might find MMX faster than SSE - especially on other than Intel P4. More importantly, you must make sure that your arrays are naturally aligned - so simply declaring them as arrays of "char" won't do - and also the stack might not be sufficiently aligned for that. gcc 2.95+ attempts to align the stack for SSE by default, though - and this even has a (small) performance and size impact for code which does not need that, so I am usually disabling this feature. The OS must cooperate, too. The best thing you can do is simply not place variables requiring an alignment larger than the architecture's natural word size - which is 4 bytes for x86 - on the stack. (On some systems - DJGPP and older Cygwin - even .bss and .data sections are not sufficiently aligned for MMX.) As it relates to the XOR'ing, you definitely want to apply it to quantities larger than "char" whenever possible. (But this does not appear to be possible in your inner loop.) > Now I'll try some SSE to parallel multiple passwords. I don't think you can do that. There are no vectorized loads. In this case, you need to be trying multiple passwords in parallel not to use SIMD operations (you can't), but rather to relax data dependencies and thus to let the CPU issue more instructions simultaneously and avoid stalls after instructions which have large latencies (such as loads). I wrote: > > ... need to pass weird options to recent > > versions of gcc to get decent performance of DES-based hashes on Alpha > > Aha, I see. Well, good to know. From what I see in 1.7's Makefile you > are not embedding inline-limit nor inline-unit-growth. Do I understand > correctly that this makes Alpha binaries build from default Makefile > little bit crippled/slow? Yes, -- for those built with recent gcc. There's no problem if you use gcc 2.95.3, for example. -- Alexander Peslyak <solar at openwall.com> GPG key ID: B35D3598 fp: 6429 0D7E F130 C13E C929 6447 73C3 A290 B35D 3598 http://www.openwall.com - bringing security into open computing environments Was I helpful? Please give your feedback here: http://rate.affero.net/solar
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.