john-dev - Major rework within dynamic

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130602200825.NFLYB.224602.imail@eastrmwml214>
Date: Sun, 2 Jun 2013 20:08:25 -0400
From:  <jfoug@....net>
To: john-dev@...ts.openwall.com
Subject: Major rework within dynamic

This is listed on john-dev instead of on john-users, mostly because there is little / no change to the interface of dynamic (other than a few formats now use a different loader flag).

I have fully rewritten all of the SHA1 code within dyna.  The 'interface' used by SHA1 is the standard dyna interface used by all large formats. This interface is this:

DynamicFunc__XX_crypt_input1_append_input2
DynamicFunc__XX_crypt_input2_append_input1
DynamicFunc__XX_crypt_input1_overwrite_input1
DynamicFunc__XX_crypt_input1_overwrite_input2
DynamicFunc__XX_crypt_input2_overwrite_input1
DynamicFunc__XX_crypt_input2_overwrite_input2
DynamicFunc__XX_crypt_input1_to_output1_FINAL
DynamicFunc__XX_crypt_input2_to_output1_FINAL

The SHA1 was already using this interface.  However, the code when built for SSE, would bounce back and forth into 1 limb 'mixed' SSE buffers.  This rewrite does all of the work, using the flat buffer method. Within that method, to do a SSE crypt, we first detect how long each input buffer is, and 'clean it' fully, and add the proper 0x80 byte, and # of bits, into the proper location (and null bytes wherever they are needed). The flat buffers (MMX_COEF * SSE_SHA_PARA of them) are passed into the SHA1SSEbody function, along with a flag listing we are using 4x width flat buffers (dyna uses 256 byte flat buffers).  Within the SSE code, the buffer intermixing, and byte swapping is done, prior to being used.  The prior code used a lot of status flag variables to know when and how to switch back and forth into and out off 'mixed' buffer space, since that was the only type buffer space which our SIMD code worked with.

The cons to the new methods are:
1. dyna_26 (sha1($p) or raw-sha1) is noticeably slower.  This is due to a VERY optimized key loading method, which is no longer able to be used.
2. MMX (or sha1-mmx.S sse code), is no longer used. The only SIMD code is from sse-intrinsics.c.

Pros to this change:
1. all existing dyna formats using SHA1, (except the sha1($p) types), are as fast or faster.  Some MUCH faster (2 or 3x faster).
2. The 55 byte limitation on SSE inputs is no longer a requirement.  Right now, dynamic does limit to 80 bytes, but I do plan on opening that up to 119 bytes (111 bytes for SHA512 when I get there).
3. The SHA1 was an 'oddball' in the large format, and had many nuances (mostly were very bad), making it much harder to design proper formats to operate optimally.
4. The changes make writing a script in an optimal way, very straight forward.  The original implementations of the "Waffle" hashes, I redid some, and got 20-40% improvement in speed.  I had to somewhat undo some of those changes, with the new layout, and again got 15-20% improvement in speed, over and above the improvements I obtained the first time.

My next 'plan' is to add Large Hash layout functions for the MD4/MD5 hashes.  And then try to port the existing formats to use those.  Once I get to 'that' level, and if it works well enough that I can remove the intermixed SSE buffer logic FULLY, it would reduce the complexity of dynamic, by 30-40% (a huge reduction in the complexity).  A large part of the most complex logic, is in handling when to be in, when to go out of, and if we are in or out of the MMX_COEF format buffers.  If I can get the performance of the md4/5 to be done fully in flat buffers, then ALL of that logic simply goes away.   I am sure that some formats can NOT be done as optimally in the flat buffer format logic, as they can be done by using and building the intermixed buffers directly.  These will hopefully only be the 'raw' formats (md5-raw, md4-raw, md4-unicode-raw, .i.e. NT).   I had originally designed dyna to focus on those timings, getting them to be as fast or close to as fast as they can be done in a optimal stand alone project.  However, I am seeing that this has caused the other more complex formats to suffer in speed, due to the additional overhead in buffer manipulation, which is FULLY done in CPU.  Switching back to flat buffers, with the new 'helper' functions I have built to do the actual crypt calls, takes a lot of that overhead away.  This is where the performance improvements are coming from.


I am not sure that MD5/MD4 will ever end up being converted wholly into the current 'large-crypt' method, but I will certainly spend a little time seeing if it would be possible.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.