Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 16 Nov 2011 15:14:52 -0600
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: hmacMD5 and sse-intrisics.c  (Bartavelle, please look at this).

>From: magnum [mailto:john.magnum@...hmail.com]
>
>Great! I will re-build the .S files and post a patch.

We may want to look at changing the logic within SSEmd5body() even more.

Right now we have this:  

void SSEmd5body(__m128i* data, unsigned int * out, int init)

if init then load 'base' md5 vector.  If init==0, then reload from out, which is an array of 16 byte interleaved output.   Also all writing is done to out, and as a 16 byte interleaved output format.

I think we should change that logic to be:

If init==1 or init==0, then keep same logic.  This will keep all existing code working 100% same as it is today.

If init==2, then init the base vector from 'defaults'. However, write data to out in input format (64 byte interleaved buffers).
If init==3, then init the base vector from the out pointer, but out is in input format.  Also in this mode, we write to out, but do so in input format (64 byte interleaved buffers).

Currently, for hmacMD5, I have this code:

int i; 
SSEmd5body(ipad, ((unsigned int *)dump), 1);
SSEmd5body(cursalt, ((unsigned int *)dump), 0);
for (i = 0; i < MD5_SSE_PARA; ++i)
	memcpy(&crypt_key[64*4*i], &dump[64*i], 64);
SSEmd5body(opad, ((unsigned int *)dump), 1);
SSEmd5body(crypt_key, ((unsigned int *)dump), 0);
for (i = 0; i < MD5_SSE_PARA; ++i)
	memcpy(&crypt_key[64*4*i], &dump[64*i], 64);

Note, there are 2 buffer copies in overhead. These 'could' be avoided, by using the value 2 for init, and having the logic I list above.

SSEmd5body(ipad, ((unsigned int *)crypt_key), 2);
SSEmd5body(cursalt, ((unsigned int *) crypt_key), 3);
SSEmd5body(opad, ((unsigned int *)dump), 1);
SSEmd5body(crypt_key, ((unsigned int *) crypt_key), 3);

As long as this can be done without adding any noticeable slowdown for the additional if() commands, then I think it would benefit the code.  The reason for this type of logic, is that in SSE_PARA mode, the start of an input buffer, does not properly line up with an output buffer. They do for the .S MMX_COEF code (or if SSE_PARA was only 1).  So, this type of hash(hash(val).something) type logic requires additional memory movement in SSE_PARA mode.   I did some of this for MSCASH2, and it made a pretty decent improvement in throughput.

However, for now, let's get sse-intrisics-32.S and sse-intrisics-64.S updated to the current .c file.

Jim.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.