[<prev] [next>] [<thread-prev] [thread-next>] [month] [year] [list]
Date: Fri, 17 Feb 2006 23:14:01 +0100
From: Michal Luczaj <regenrecht@...pl>
To: john-users@...ts.openwall.com
Subject: Re: DIGEST-MD5, dominosec optimization
Solar Designer wrote:
> I don't think you should be optimizing the memcpy() and memcmp() now
> (although you can get back to them later). If you do, you might
> find MMX faster than SSE - especially on other than Intel P4.
You are probably right. I did that because as a simple-minded human
being I thought: "for whole 16-byte memcpy - one load, one store - that
*must* be faster than anything". Hell yeah, it wasn't :]
> More importantly, you must make sure that your arrays are naturally
> aligned - so simply declaring them as arrays of "char" won't do - and
> also the stack might not be sufficiently aligned for that.
Sure. I'm was using MOVAPS not MOVUPS and I did take care of that.
> gcc 2.95+ attempts to align the stack for SSE by default, though -
> and this even has a (small) performance and size impact for code
> which does not need that, so I am usually disabling this feature. The
> OS must cooperate, too. The best thing you can do is simply not
> place variables requiring an alignment larger than the architecture's
> natural word size - which is 4 bytes for x86 - on the stack.
I had no problems with stack alignment. And I'm pretty sure it was
16-byte align, otherwise I would get segfault. From what I recall I only
changed ARCH_ALLOWS_UNALIGNED to 0 and set -mpreferred-stack-boundary=4.
And maybe some __attribute__ ((aligned (16)))...? I just don't remember now.
Let me explain: I had no SSE specific problems (segfaults and all that
memory related stuff). Everything worked fine. I just wonder why it
turned out to be sooo slooow.
> As it relates to the XOR'ing, you definitely want to apply it to
> quantities larger than "char" whenever possible. (But this does not
> appear to be possible in your inner loop.)
There is something like this:
for (i = 0; i < 16; ++i)
x[i] = state[i] ^ block[i];
And I switched it to:
_mm_store_ps(
(float*)x,
_mm_xor_ps(
_mm_load_ps((float*)state),
_mm_load_ps((float*)block)
)
)
Maybe this piece of code is just not important enough... But still, even
if it don't give a boost, why-oh-why it slow the whole thing down?
And the answer might be: because I'm a lousy programmer ;)
Cheers!
Michal
Hosted by DataForce ISP -
Powered by Openwall GNU/*/Linux