Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 29 Jun 2011 23:33:01 -0500
From: "JimF" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: Re: Lukas's Status Report - #7 of 15

----- Original Message ----- 
From: "Lukasz Odzioba" <lukas.odzioba@...il.com>
> I'm not familiar with mscash2, so can't understand all you said, but
> after a week or two I'll contact you in that case.

I was not familiar with cash2 (DCC2) until recently.  The orignal code 
(current 'released' code), was pretty opaque, and I found it very hard to 
figure out what was wrong.  We had some problems with utf8->unicode 
conversions compared with iso8851->unicode, and we had some issues with the 
salt length (user name). So I spent a little time trying to figure things 
out.  Magnum and I nailed the issues in the code (after quite a bit of 
work).

Well, after I understood some of the code, I assumed it would be easy and 
nice to port to SSE, so I started again researching the algorithm, and not 
using the code as implemented.  I wanted to simplify as much as possible the 
code, so I chose to do it all using OpenSSL function calls. When doing this, 
I found that even though in the inner loop, there are 2 84 byte SHA1 
encryptions (which require 2 64 byte SHA block encryptions each), that the 
first block in each of these encryptions was exactly the same EVERY time. 
Only the trailing 20 bytes varied with each encryptions.  Thus, I found that 
if you perform the 2 1st limb encrptions, keeping the state of the SHA, and 
then restore each of these states (within the loop), and perform the 2nd 
half of the SHA crypt.

Since I have spent the time, and have a decent handle on the algorithm, I 
can explain it in simple terms.

first there are 2 variable peices of information.  The password (key), and 
the username (salt).  Both of these are converted to UTF16 unicode.

Then there are 2 primative encryption algorthms used, MD4 and SHA1.  There 
are also some intermediate steps that are the 'same' as other password 
keying algorithms.  These are the NTLM, the DCC1 (mscash1), and then 
finally, the final DCC2 (mscash2) is produced.

So the steps are:

1. convert key into unicode.
2. convert salt into unicode.
3. Perform MD4 on the unicode key.  This produces an NTLM hash value.  (16 
bytes)
4. Append the unicode salt (user name) to this NTLM hash.  This will produce 
a variable sized 'key'.   It appears the user name is maxed out at 22 
unicode characters (possibly 21).  The original code only allowed 19, but I 
believe that was too short.
5. Perform an MD4 on the data buffer from step 4.  This produces a DCC1 
(mscash) hash value.   (16 bytes).
6. Perform this logic PBKDF2(HMAC_SHA1, DCC1, salt (the unicode user name), 
10240, 16).

A good place to see what is meant by the PBKDF2(...) stuff is at: 
http://en.wikipedia.org/wiki/PBKDF2

HMAC_SHA1 means to use SHA1 internally within the hmac function.  The 
password is the 16 bytes of DCC1.  The salt is the unicode user name.  10240 
is the number of hmac iterations.  16 is the number of bytes to use (only 
the lower 16 bytes, not the full 20 bytes of the SHA1).

Here is step 6 in 'english':

produce ipad and opad. These are each 64 bytes.   They are the password ^ 
0x36363636.....  for the ipad, and the password ^ 0x5c5c5c5c.... for the 
opad.
   a. temp_hash is 20 bytes.   return_hash is 16 byte buffer.
The first iteration is different.
   a. perform SHA1 of ipad . salt . 0001 -> tmp_hash.  The 0001 is a 4 byte 
integer 1, in big endian format.
   b. perform SHA1 of opad . tmp_hash -> tmp_hash
   c. return_hash = tmp_hash    it is actually return_hash ^= tmp_hash with 
in step 0, return_hash starting out as 0.


For the remaining 10239 steps, perform this:
    a. SHA1 of ipad . temp_hash -> tmp_hash
    b. SHA1 of opad . temp_hash -> tmp_hash
    c. return_hash ^= tmp_hash

When all iterations are done, the final 'hash', is the value of return_hash.

What i noticed was that all 'first' SHA's of ipad and opad values sipmly 
sets up the SHA environment.  It is always 'constant', since it is an even 
64 bytes long.  Thus, I precompute these one time only, and use that 
information to setup the SHA and simply encrypt the 2nd buffer each time. 
Works great, and only does 1/2 the work.

Now, for SSE, I am ONLY going to add SSE code, to the internal loop 
(actually, to the 10239 iterations of the very inside 'loop'.   This should 
gain 99.99% of the 'possible' gain, but this 'intrusion' and change to get 
the SSE2 is as trivial as I can get it.   However, do to differences between 
the SSE and other runs (like 'generic' or OMP), I will likely create a 
special pbkdf2 function, that is called one time to perform work on all 
items at once (after loading them), while the existing pbkdf2 works on one 
item at a time (for allowing multi-threading to work).

To show what was done (you can also look at the code I posted earlier today, 
and see the memcpy's to see what is happening).

  prepare SSE variables
  for (loop=1; loop < 10240; ++loop) {
     restore_ipad_state
     SSE_SHA1 tmp_hash -> tmp_hash
     restore_opad_stat
     SSE_SHA1 tmp_hash -> tmp_hash
     xor results
  }

I am wondering if for the GPU code, this might also be the 'right' place to 
send the data over to the multiple GPU's to do their magic.  I am not sure 
exactly how the 'best' way to integrate into john, but remember, this 1 loop 
is 99.99% of the work being done.  The other IMHO is simply setup code for 
this loop.  Even if you speed up this loop 100x, then there is still only 1% 
of the time in the entire rest of the 'setup' code.  That is why I made a 
choice to ONLY optimize this inner loop, and pretty much ignore the rest. 
The other other optimization I can see (other than pinhole speedup's), is 
that the NTLM only needs to be performed if a new password is loaded.  The 
way john works, is it loads X number of passwords, then sets a salt, calls 
crypt_all, checks for hits, then sets the next salt, calls crypt_all, 
checks, etc, etc, until all salts have been run.  Thus, if there are many 
salts, the NTLM hash (the first MD4 of the unicode password), only needs to 
be done when new password(s) is(are) loaded.

Hopefully, this is a little more information for you, and will allow you to 
get a better feel for just what is happening under the hood of mscash2.

I do hope to get the low down on how to get SSE SHA1 working for 
multi-buffer encryptions.  Once I get that, it should not be long, until I 
have a faster working cash2 format, which has been trimmed down to about 600 
lines (which about only 150 are the 'core' encryption stuff).  I really did 
not plan on making any announcement about the code changes, until I was done 
with the format (generic, OMP and SSE), but since you posted that you may be 
starting on GPU code, I wanted to make sure it was known that we have a much 
more easy to follow format, even if only working for generic/omp.

Jim.

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ