Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Tue, 22 Nov 2011 12:58:03 -0600
From: "jfoug" <>
To: <>
Subject: SHA1  SSE2i R&D work

I have done some R&D on the SSE2i code, for SHA1.  Here is the existing SSEi
data layout for input buffers:



uint32 crypt [80]


Of this, the first 16 uint's are read only, and contain the data to hash,
the 0x80 bit, and the bit count of the buffer (standard SHA input layout).
uint's 17 to 80 are used as the expansion buffer. This data is written to,
and then later read from.


I have never really liked this layout, I consider it to be pretty wasteful.
It can easily be seen, that this expansion buffer, 'could' be done with only
16 uints, and treated in a circular manner. I had always thought this would
give an improvement in speed, due to a much smaller working set, along with
other things, such as improved L1 caching.


This same layout is used within sha1-mmx.S  (32 bit SSE), and within
sse-intrinsics.c.    The sha1-mmx.S was complex enough, I was not going to
make a stab at trying to work with the data layout, but now with
sse-intrinsics-64.S and sse-intrinsics-32.S allowing the intrinsic code to
work as fast as the 32 bit hand built asm, and on 64 bits, the intrinsic are
the ONLY option.  I worked at getting sha1 to use uint32 crypt[16]   Then,
within sse-intrinsics.c, I made changes, and added a new buffer __m128i
tmpR[SHA1_SSE_PARA*16];  This buffer is loaded one element at a time, and
used, the same way that the 'longer' buffer version would. This required
several matched macros.  The prior code had a loop, that assigned values to
all expansion buffer items (from 17, to 80), prior to processing.  With the
circular buffer, I load the array element, prior to using it.  However, due
to 4 different 'prior' variables within the data being used to generate the
'next' variable, the only way I could see to do this, is to have multiple
SHA_ROUND macros.  I built a, b, c, d, then the 'normal' one, then an x. The
'a' ROUND is used until we hit array element 3.  We pull all of the items
from the data array.  The 'b' ROUND is used from element 3 to 8.  It pulls
the first element out of the temp circular buffer, but pulls the other 3 out
of the data.    In the 'normal', I have to pull the current tmp buffer item
into another tmp value, this is because before processing, this exact same
variable will be overwritten.  This variable is at our 'current' location,
and also at the +16 location.   The final ROUND macro (the SHA_ROUNDx), does
not need to compute the expansion, so that part is out.  


Now, I do not have the Intel compiler, so I am using cygwin GCC, building
sse-intrisics.c.  It is pretty slow.  Also, I can not use SHA1_PARA 3, it
slows things down a LOT, but now with improved memory usage, PARA=3 may be
useful in the Intel produced code.


However, I made changes to rawSHA1_fmt.c, so that I can comment 1 line out
(or leave it in), and can build for either legacy SHA1 layout, or for the
new layout.  Both functions are in sse-intrisics.c, so it can be built
either way.  Here are the timings, with little work being done for
optimizations, I simply 'got it working', and wanted to see some speeds,
more for proof of concept at this momement.


$ ../run/john -test=5 -form=raw-sha1

Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE

Raw:    5207K c/s


$ ../run/john -test=5 -form=raw-sha1

Benchmarking: Raw SHA-1 [SSE2i type-2 8x]... DONE

Raw:    6909K c/s



So by this quick proof of concept, I am getting about 1/3 speedup.  Not bad.


I have not put code up on the wiki yet.  I would like to look a little
deeper, and make sure I have not overcomplicated things.  I may get the
changes off to Simon and Magnum, to get them to look, and to also get a .S
file, possibly several with different SHA_PARA values being set.




Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ