Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 25 Apr 2013 09:35:12 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: ICC performance regression

I have just built, using gcc, to build the sse-intrinsics-32.S file, and the
speed was almost identical to the older version made with icc.  I simply
used the exact same command line to build to a .S file, but added  -o
sse-intrinsic-32.S -S  and things worked.

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
4.7.2-2ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs
--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.7 --enable-shared --enable-linker-build-id
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7
--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--enable-gnu-unique-object --enable-plugin --enable-objc-gc --disable-werror
--with-arch-32=i686 --with-tune=generic --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-2ubuntu1)


Speed of the gcc built .S file:

$ ../run/john -test=5 -form=dynamic
Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics
10x4x3]... DONE
Raw:    27669K c/s real, 27715K c/s virtual
Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics
10x4x3]... DONE
Many salts:     16383K c/s real, 16394K c/s virtual
Only one salt:  12244K c/s real, 12259K c/s virtual
Benchmarking: dynamic_2: md5(md5($p)) (e107) [128/128 SSE2 intrinsics
10x4x3]... DONE
Raw:    14142K c/s real, 14150K c/s virtual
$ ../run/john -test=5 -form=md5
Benchmarking: crypt-MD5 [128/128 SSE2 intrinsics 12x]... DONE
Raw:    31925 c/s real, 31987 c/s virtual


Speed of the icc (older version) .S file

$ ../run/john -test=5 -form=dynamic
Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics
10x4x3]... DONE
Raw:    27212K c/s real, 27294K c/s virtual
Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics
10x4x3]... DONE
Many salts:     16273K c/s real, 16263K c/s virtual
Only one salt:  12295K c/s real, 12295K c/s virtual
Benchmarking: dynamic_2: md5(md5($p)) (e107) [128/128 SSE2 intrinsics
10x4x3]... DONE
Raw:    13753K c/s real, 14002K c/s virtual
Benchmarking: dynamic_3: md5(md5(md5($p))) [128/128 SSE2 intrinsics
10x4x3]... Wait...

Speed from unstable (where format md5 still works, using older ICC *-32.S
file).
$ ../run/john -test=5 -form=md5
Benchmarking: FreeBSD MD5 [128/128 SSE2 intrinsics 12x]... DONE
Raw:    31637 c/s real, 31744 c/s virtual

So instead of fighting with getting an older ICC working properly, we might
simply look at a 'current' gcc version.  One bad side effect is size.  The
older icc file was 359k (64 bit) and 394K (32 bit). The .S file I build
(only the 32 bit version), required some hand patching (the perl file
helped, but there was more code cutting needed, and some UNDERSCORES defines
needed added).  That file, however is 1156K, so it is much larger.

But this 'may' be an option (using newer gcc version).  NOTE, on this same
system (cygwin), if I do a make win32-cygwin-x86-sse2 (no i build), I get
these timings (only about 70% as fast as the prebuild .S file code):

$ ../run/john -test=5 -form=dynamic
Benchmarking: dynamic_0: md5($p) (raw-md5) [128/128 SSE2 intrinsics
10x4x3]... DONE
Raw:    20052K c/s real, 20014K c/s virtual
Benchmarking: dynamic_1: md5($p.$s) (joomla) [128/128 SSE2 intrinsics
10x4x3]... DONE
Many salts:     13214K c/s real, 13222K c/s virtual
Only one salt:  10312K c/s real, 10313K c/s virtual
Benchmarking: dynamic_2: md5(md5($p)) (e107) [128/128 SSE2 intrinsics
10x4x3]... DONE
Raw:    10121K c/s real, 10132K c/s virtual
$ ../run/john -test=5 -form=md5
Benchmarking: crypt-MD5 [128/128 SSE2 intrinsics 12x]... DONE
Raw:    21839 c/s real, 21841 c/s virtual


As for ICC, I did try a few other things.  I could not recover the speed
loss. But like magnum mentioned, it 'could' simply be PARA values needing
updated.  However, at almost an hour of build time for each change, it is
not easy to do a lot of testing.
 
Jim.

From: magnum Sent: Thursday, April 25, 2013 2:10
>
>On 25 Apr, 2013, at 1:30 , Solar Designer <solar@...nwall.com> wrote:
>> On Thu, Apr 25, 2013 at 01:12:19AM +0200, magnum wrote:
>>> Old pre-built files, icc 12.1.4:
>> [...]
>>> Benchmarking: FreeBSD MD5 [128/128 SSE2 intrinsics 12x]... DONE
>>> Raw:	39204 c/s real, 39204 c/s virtual
>> [...]
>>> gcc 4.7.2, -native target:
>> [...]
>>> Benchmarking: crypt-MD5 [128/128 AVX intrinsics 12x]... DONE
>>> Raw:	36936 c/s real, 36936 c/s virtual
>> 
>> This is pretty significant difference in favor of old icc, and not all 
>> CPUs have AVX, so I think we should simply continue to use old icc to 
>> prebuild the files.
>
> This requires someone having an older version. I haven't found one yet.
>
> Until now we have compared icc using -O3 (25 *minutes* compile time per
file), to gcc using just -O2 (compiling in 3 seconds). I will try some
different versions of icc as well as MD5_PARA values (very time consuming),
but also different sets of options to gcc and see where we end up.

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ