Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 10 Nov 2011 22:32:10 +0100
From: magnum <>
Subject: Re: more targets using sse-intrinsics.S

Looks good. While the intrinsics file is a little slower than md5-mmx.S
for SHA-1, it has the great advantage of being thread safe.

I suppose gcc/icc et al will accept the modified .S format (they have no
problems with the old x86*.S files) so maybe we can unify this to one
sse_intrinsics-32.S again, that is built and get your mods once and for
all when doing "make intrinsics". I'll experiment with this. But this is
pretty cool already. Did you try any OMP builds on Windows?


2011-11-10 16:48, jfoug wrote:
> Windows is not a straight forward build.  I have gotten it built, and it passes -test=0.  I am building a 'pre .S' sse-intrinsics.c build, with cygwin (x86-32), and will compare speeds.
> Here are the changes needed for cygwin building (I am thinking about a perl script to run during the make, for win32-cygwin-x86-sse2i build)
> 1. all lines that are .type need to be commented out
> 2. all lines that are .size need to be commented out
> 3. a couple of .section lines at end of file need commented out.
> 4. a #ifdef UNDERSCORES section needed added at top of file.
> 	.file "sse-intrinsics.c"
> #define memcpy        _memcpy
> #define memset        _memset
> #define strlen        _strlen
> #define MD5_Init      _MD5_Init
> #define MD5_Update    _MD5_Update
> #define MD5_Final     _MD5_Final
> #define SSEmd5body    _SSEmd5body
> #define SSESHA1body   _SSESHA1body
> #define SSEmd4body    _SSEmd4body
> #define md5cryptsse   _md5cryptsse
> #endif
> 	.text
> ..TXTST0:
> # -- Begin  sse_debug
> ....
> Here are some timings, testing original md5-mmx.S, an intrinsic build using sse-intrisics.c and one using a patched sse-intrisic-32.S.
> ***Timings md5-mmx.S  (original Bartavelle .S code).
> dynamic_0: md5($p)  (raw-md5)    [32x4 .S]  9925K
> dynamic_1: md5($p.$s)  (joomla)  [32x4 .S]  Many:  8285K 1salt:  5552K
> dynamic_2: md5(md5($p))  (e107)  [32x4 .S]  5502K
> dynamic_3: md5(md5(md5($p)))     [32x4 .S]  3792K
> dynamic_4: md5($s.$p)  (OSC)     [32x4 .S]  Many:  9108K 1salt:  5620K
> dynamic_5: md5($s.$p.$s)         [32x4 .S]  Many:  7348K 1salt:  4966K
> FreeBSD MD5 [32/32]                          6796 c/s
> Raw SHA-1 [4x]                              7568K c/s
> ***Timings sse-intrinsics.c
> dynamic_0: md5($p)  (raw-md5)    [16x4x2]  9412K
> dynamic_1: md5($p.$s)  (joomla)  [16x4x2]  Many:  8243K 1salt:  5594K
> dynamic_2: md5(md5($p))  (e107)  [16x4x2]  5363K
> dynamic_3: md5(md5(md5($p)))     [16x4x2]  3727K
> dynamic_4: md5($s.$p)  (OSC)     [16x4x2]  Many:  9311K 1salt:  5739K
> dynamic_5: md5($s.$p.$s)         [16x4x2]  Many:  7655K 1salt:  5097K
> FreeBSD MD5 [8x]                           14480 c/s
> Raw SHA-1 [8x]                             5495K c/s
> ***Timings sse-intrinsics-32.S  (patched for Win32)
> dynamic_0: md5($p)  (raw-md5)    [10x4x3]  11449K
> dynamic_1: md5($p.$s)  (joomla)  [10x4x3]  Many:  9953K 1salt:  6270K
> dynamic_2: md5(md5($p))  (e107)  [10x4x3]  6763K
> dynamic_3: md5(md5(md5($p)))     [10x4x3]  4785K
> dynamic_4: md5($s.$p)  (OSC)     [10x4x3]  Many: 11447K 1salt:  6455K
> dynamic_5: md5($s.$p.$s)         [10x4x3]  Many:  9123K 1salt:  5677K
> FreeBSD MD5 [12x]                          21042
> Raw SHA-1 [8x]                             7079K c/s
> So, the intrinsic-32.S is about 22% faster for dynamic, about 45% faster for the Crypt(3) MD5, and about 22% faster for raw-SHA1.
> However, the intrinsic-32.S is about 20-22% faster for dynamic, about 300% faster for the Crypt(3) MD5 (only supports SSE2i), and about 6-7% slower for raw-SHA1.
> All in all, this is not a bad way to proceed.  These are some very prelim tests.  I have not had time to really dig in too deep.
> Jim.
>> From: magnum []
>> 2011-11-09 22:57, magnum wrote:
>>> 2011-11-09 18:47, magnum wrote:
>>>> 2. Should we support this for 32-bit at all? I suppose I can cross
>>>> compile a 32-bit .S file with icc (haven't tried it) but I have no
>> idea
>>>> if it will perform better or worse than gcc on a 32-bit machine. I
>>>> suppose this should be verified on a machine that is really 32-bit.
>> A somewhat experimental patch 0004 is now uploaded. It adds a 32-bit
>> version of the intrinsics asm file generated by icc, and modifies all
>> sse2i targets to utilize it.
>> Works fine (and do give a performance boost) on linux-x86-64-32-sse2i
>> test target. I'd be interested to hear results from testing on Windows.
>> magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.