Date: Tue, 22 Nov 2011 12:58:03 -0600 From: "jfoug" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: SHA1 SSE2i R&D work I have done some R&D on the SSE2i code, for SHA1. Here is the existing SSEi data layout for input buffers: uint32 crypt  Of this, the first 16 uint's are read only, and contain the data to hash, the 0x80 bit, and the bit count of the buffer (standard SHA input layout). uint's 17 to 80 are used as the expansion buffer. This data is written to, and then later read from. I have never really liked this layout, I consider it to be pretty wasteful. It can easily be seen, that this expansion buffer, 'could' be done with only 16 uints, and treated in a circular manner. I had always thought this would give an improvement in speed, due to a much smaller working set, along with other things, such as improved L1 caching. This same layout is used within sha1-mmx.S (32 bit SSE), and within sse-intrinsics.c. The sha1-mmx.S was complex enough, I was not going to make a stab at trying to work with the data layout, but now with sse-intrinsics-64.S and sse-intrinsics-32.S allowing the intrinsic code to work as fast as the 32 bit hand built asm, and on 64 bits, the intrinsic are the ONLY option. I worked at getting sha1 to use uint32 crypt Then, within sse-intrinsics.c, I made changes, and added a new buffer __m128i tmpR[SHA1_SSE_PARA*16]; This buffer is loaded one element at a time, and used, the same way that the 'longer' buffer version would. This required several matched macros. The prior code had a loop, that assigned values to all expansion buffer items (from 17, to 80), prior to processing. With the circular buffer, I load the array element, prior to using it. However, due to 4 different 'prior' variables within the data being used to generate the 'next' variable, the only way I could see to do this, is to have multiple SHA_ROUND macros. I built a, b, c, d, then the 'normal' one, then an x. The 'a' ROUND is used until we hit array element 3. We pull all of the items from the data array. The 'b' ROUND is used from element 3 to 8. It pulls the first element out of the temp circular buffer, but pulls the other 3 out of the data. In the 'normal', I have to pull the current tmp buffer item into another tmp value, this is because before processing, this exact same variable will be overwritten. This variable is at our 'current' location, and also at the +16 location. The final ROUND macro (the SHA_ROUNDx), does not need to compute the expansion, so that part is out. Now, I do not have the Intel compiler, so I am using cygwin GCC, building sse-intrisics.c. It is pretty slow. Also, I can not use SHA1_PARA 3, it slows things down a LOT, but now with improved memory usage, PARA=3 may be useful in the Intel produced code. However, I made changes to rawSHA1_fmt.c, so that I can comment 1 line out (or leave it in), and can build for either legacy SHA1 layout, or for the new layout. Both functions are in sse-intrisics.c, so it can be built either way. Here are the timings, with little work being done for optimizations, I simply 'got it working', and wanted to see some speeds, more for proof of concept at this momement. $ ../run/john -test=5 -form=raw-sha1 Benchmarking: Raw SHA-1 [SSE2i 8x]... DONE Raw: 5207K c/s $ ../run/john -test=5 -form=raw-sha1 Benchmarking: Raw SHA-1 [SSE2i type-2 8x]... DONE Raw: 6909K c/s So by this quick proof of concept, I am getting about 1/3 speedup. Not bad. I have not put code up on the wiki yet. I would like to look a little deeper, and make sure I have not overcomplicated things. I may get the changes off to Simon and Magnum, to get them to look, and to also get a .S file, possibly several with different SHA_PARA values being set. Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.