john-dev - Re: John core change patch (and md5-gen, etc)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <07FC1EE7269740FDA0A1CA6D5B489DAC@D9VGLK61>
Date: Mon, 9 May 2011 07:28:10 -0500
From: "JimF" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: Re: John core change patch (and md5-gen, etc)

----- Original Message ----- 
From: "bartavelle" Sent: Monday, May 09, 2011 5:07 AM
> How do you want to work from here ? Is that enough for you or do you
> want me to patch some more ? I believe a linux-x86-64-icc target would
> be handy.

I started in on this last night also, and did almost exactly the same things 
you did (commenting out blocks in md5_gen which simply are not ready, etc), 
to get things running, except for the new change you did to add 'init' or 
not init. I think your init/non-init will be all that is required to get SSE 
and PARA SSE working for 2 block data.  The uglyness in 2 block data comes 
when you are 'close', and have longer PW's that push you over the limit. 
Then, if you have a couple buffers under 56 bytes, and a couple at or just 
over, it gets ugly.  In that case, SSE is pretty much out for that block, 
unless you can pull in data from other blocks, which becomes very tricky, 
and you end up losing ALL speed benefits.  It 'may' be possible to run these 
mixed blocks, harvesting off the results for the COEF values that are 
shorter than 56 bytes, then running the 2nd loop of SSE, and get the results 
of the other COEF.  Still takes 2 loops if any are over 55 bytes, but in the 
end, all residues are correct, which is what matters.  The easiest is to 
simply handle SSE cases where all items in the block are 55 bytes or less, 
or all items in the block are 56 bytes to 119 using SSE, while all other 
cases where there are mixed sizes get processed using MD5_go2 or openssl.  I 
will have to see how much of a change it is.    There are only a few places 
within md5_gen where the actual crypt functions are called, and they are all 
exactly the same (except they take different input and output values). 
Getting it right in one of them, and then it becomes almost a cut and paste 
to get them all working properly.

The reason there are so many failures right now for PARA builds in md5_gen, 
is that I made changes to the .S sse/mmx blocks, so that processing  would 
fall through to the md5_go code, if the data was too long.  Prior versions 
simply bailed out for these formats totally if built for MMX instructions. 
The new version will do what work it can do uisng mmx, and then fall back to 
generic code, if the data is too long.  However, that code was not put into 
the PARA blocks, and some md5_gen processing instructions that allow the 
format writer to tell the runtime to switch data back and forth from Any to 
SSE. Thus the PARA does not know how to do this, and simply still has the 
old behavior of aborting out for formats the go over 55 bytes long.

As for speed diff of using PARA vs non PARA on 64 bit gcc, it is much less 
beneficial now (at least in md5_gen), since I switched over to using the 
code from md5_std.c and use the MD5_X2 logic also.  64 bit gcc 'generic' is 
running as fast as 32 bit SSE2 builds (or close). Yes, the PARA=2 for gcc 
does get a little bump over that, but not a significant amount.  But I will 
continue forward getting the port right, and if PARA is the right choice for 
all x86 64 bit compilers, then it will be the choice provided.

As for x86-64.h, for the *_PARA_SSE sellection, we should have that in 
#ifdef's.   For gcc it is 2, but icc it is 3 IIRC.  What was clang set at?

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.