Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 19 May 2010 02:38:45 +0400
From: Solar Designer <>
Subject: Re: C compiler generated SSE2 code

On Tue, May 18, 2010 at 01:49:46PM +0200, wrote:
> Le 18/05/2010 01:32, Solar Designer a ?crit :
> >Can you upload it to the wiki, please?
> >
> >
> Done, but I made a quick git patch.

Thank you!  This works for now.  When you update this code/patch, please
specify the copyright status and license for sse-intrinsics.c (new
source file added with the patch) and for your changes to MD5_fmt.c,
preferably like I have suggested here:

> This does speak for itself :) The icc does disentangle the whole stuff, 
> but is still faster with 3 loops (only 2 in the sample).

I think you need to disentangle the source code rather than leave that
for the compiler.  Specifically, I'd remove the "unneeded" MD5_PARA_DO
loops.  Instead, I'd define macros around primitives such as xor, which
would perform the required number of instances of the operation.  They
would use constants for the array indices - or, if that does not work
well enough, even use individual local variables instead of array
elements.  This is more similar to what I have in MD5_std.c, where I use
separate local variables for the two instances of MD5:

	MD5_word a0, b0 = Cb, c0 = Cc, d0;
	MD5_word a1, b1, c1, d1;
	MD5_word u, v;

I understand that you like to be able to easily adjust the number of
instances that you mix, but you'll have to achieve that by defining your
xor, etc. macros differently for common instance counts (say, 2 vs. 3).

> Do you mind giving bench of your SSE code with ICC ?

Sorry, I have no time for diving into this now.  I got too many other
tasks in my queue.

> Or just share it so that I could try it :)

I've attached a dirty patch.  IIRC, this code is in a state suitable for
the Sun Studio compiler.  You'll likely need to change the initializer
for "ones" (a trivial change) to get this to compile with gcc again.
DES_BS_VECTOR34 enables 192-bit vectors with 256-bit alignment (there
are two kinds of them - SSE2+MMX or SSE2+native).  With other settings,
you can do pure 128-bit SSE2 vectors and two kinds of 256-bit vectors
(dual SSE2 or SSE2+MMX+native).

In my experiments, I was using new DES S-box expressions (these are in
the works), but I reverted to "plain" nonstd.c for generating this
patch.  You may choose to have the code use sboxes.c instead (change
DES_BS from 1 to 2 in x86-64.h), which might better match the target
instruction set (mostly if you use x86-64 native instructions).

A known problem is that the code violates C strict aliasing rules with
its use of typecasts.  Yet this did not cause anything worse than
compiler warnings in my testing.  A fix for this may be to use unions.

I think I should roll similar changes into the official JtR even if
they'd be no-ops in the default build - to make it easier to conduct
experiments like this.


View attachment "john-1.7.5-des-intrinsics-1.diff" of type "text/plain" (29409 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.