Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 22 Nov 2011 20:09:43 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: SHA1  SSE2i R&D work

Good stuff. What's the gain for dcc2?

magnum


2011-11-22 19:58, jfoug wrote:
> I have done some R&D on the SSE2i code, for SHA1.  Here is the existing
> SSEi data layout for input buffers:
> 
>  
> 
>  
> 
> uint32 crypt [80]
> 
>  
> 
> Of this, the first 16 uint’s are read only, and contain the data to
> hash, the 0x80 bit, and the bit count of the buffer (standard SHA input
> layout).  uint’s 17 to 80 are used as the expansion buffer. This data is
> written to, and then later read from.
> 
>  
> 
> I have never really liked this layout, I consider it to be pretty
> wasteful.  It can easily be seen, that this expansion buffer, ‘could’ be
> done with only 16 uints, and treated in a circular manner. I had always
> thought this would give an improvement in speed, due to a much smaller
> working set, along with other things, such as improved L1 caching.
> 
>  
> 
> This same layout is used within sha1-mmx.S  (32 bit SSE), and within
> sse-intrinsics.c.    The sha1-mmx.S was complex enough, I was not going
> to make a stab at trying to work with the data layout, but now with
> sse-intrinsics-64.S and sse-intrinsics-32.S allowing the intrinsic code
> to work as fast as the 32 bit hand built asm, and on 64 bits, the
> intrinsic are the ONLY option.  I worked at getting sha1 to use uint32
> crypt[16]   Then, within sse-intrinsics.c, I made changes, and added a
> new buffer __m128itmpR[SHA1_SSE_PARA*16];  This buffer is loaded one
> element at a time, and used, the same way that the ‘longer’ buffer
> version would. This required several matched macros.  The prior code had
> a loop, that assigned values to all expansion buffer items (from 17, to
> 80), prior to processing.  With the circular buffer, I load the array
> element, prior to using it.  However, due to 4 different ‘prior’
> variables within the data being used to generate the ‘next’ variable,
> the only way I could see to do this, is to have multiple SHA_ROUND
> macros.  I built a, b, c, d, then the ‘normal’ one, then an x. The ‘a’
> ROUND is used until we hit array element 3.  We pull all of the items
> from the data array.  The ‘b’ ROUND is used from element 3 to 8.  It
> pulls the first element out of the temp circular buffer, but pulls the
> other 3 out of the data.    In the ‘normal’, I have to pull the current
> tmp buffer item into another tmp value, this is because before
> processing, this exact same variable will be overwritten.  This variable
> is at our ‘current’ location, and also at the +16 location.   The final
> ROUND macro (the SHA_ROUNDx), does not need to compute the expansion, so
> that part is out. 
> 
>  
> 
> Now, I do not have the Intel compiler, so I am using cygwin GCC,
> building sse-intrisics.c.  It is pretty slow.  Also, I can not use
> SHA1_PARA 3, it slows things down a LOT, but now with improved memory
> usage, PARA=3 may be useful in the Intel produced code.
> 
>  
> 
> However, I made changes to rawSHA1_fmt.c, so that I can comment 1 line
> out (or leave it in), and can build for either legacy SHA1 layout, or
> for the new layout.  Both functions are in sse-intrisics.c, so it can be
> built either way.  Here are the timings, with little work being done for
> optimizations, I simply ‘got it working’, and wanted to see some speeds,
> more for proof of concept at this momement.
> 
>  
> 
> $ ../run/john -test=5 -form=raw-sha1
> 
> Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE
> 
> Raw:    5207K c/s
> 
>  
> 
> $ ../run/john -test=5 -form=raw-sha1
> 
> Benchmarking: Raw SHA-1 [SSE2i type-2 8x]... DONE
> 
> Raw:    6909K c/s
> 
>  
> 
>  
> 
> So by this quick proof of concept, I am getting about 1/3 speedup.  Not
> bad.
> 
>  
> 
> I have not put code up on the wiki yet.  I would like to look a little
> deeper, and make sure I have not overcomplicated things.  I may get the
> changes off to Simon and Magnum, to get them to look, and to also get a
> .S file, possibly several with different SHA_PARA values being set.
> 
>  
> 
>  
> 


Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ