john-dev - was: RE: [offlist] sunmd5

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <007201cd7250$013f7ae0$03be70a0$@net>
Date: Sat, 4 Aug 2012 09:47:00 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: was: RE: [offlist] sunmd5

That is very surprising. I expected more of a gain, since we are doing 25 SSEmd5body() calls back to back, with no key setup at all, 1/2 of the time.  I really thought this would get better performance.  The entire key setup for that 25 block work, was simply copying in the first 16 bytes (the prior crypt digest), writing the 1 to 4 byte 'round' (in text), appending the 0x80, and writing 4 byte length.  That is pretty trivial on keysetup.

Also, each round, there is copying from a flat buffer, back into a MMX_COEF buffer, for each candidate. I would love to avoid this, but since you have no idea if a candidate will use the 1 block, or the 25 block crypt on any given round, I do not see any way around that issue.

The 'algorithm' I use is this:   For non-sse, I simply perform the coin flip, and then do the MD5 of the 1 buffer (last crypt + round-text-number), or the 25 buffer (last crypt + 1517 byte constant + round-text-number), and do that right away.  For SSE builds, I have an array of 1024 passwords (and working crypt buffers).  Here, I walk the entire list (all 1024), doing the coin flip for each.  I build 2 arrays, that are lists of which indexes are requiring a small (1 buffer), or large (25 buffer), md5 hash.  Then when done collecting all of that information, I have 2 loops. The first is for doing all of the small, 1 buffer crypts.  Using the prior gained knowledge, I pack a 1 limb mmx_para * mmx_coef buffer with data, and perform the MD5, and then copy back the results to the proper long term storage unit (!!!note, I spotted an optimization when writing this part of the email!!!!).  I then continue forward doing all of the rest of the 1 limb md5's.  When done with that, I move on, and do the same things for the larger (25 limb) blocks.  Picking out which candidates require this, packing them into the working array, performing the work, and putting the results back into the proper slot in the long term full set of candidates.

Possibly some of the hoped for gains, are lost, in having such a large working set for SSE.  There is an array of 25 unique MMX_PARA*MMX_COEF*64 byte buffer blocks.  That's 19.2k for MD5_PARA==5.  I wonder if there is some way to set a larger prefetch?  Possibly within the intrinsic code, it prefetches, but not 'large enough', like we need it here.

On my 'better' system, doing a real run it did improve from 102/s to  442/s (~430%).

I guess it is what it is.  Shoot for the moon.  Be happy landing on the space shuttle ;^)

I think the only gain to be found, will be in improving the coin flip logic.

As for the whole algorithm, since Solar seemed to have some inside knowledge of Moffet's work on this, I might add my analyses of this password hash, it's good points, and where it fails, and how to properly build it to make it much harder, and much more resistant to password cracking programs, and virtually eliminate ability for GPU's to be used.   In its current state, sunmd5 would not be too hard to do in GPU, and should be pretty efficient (normal expected GPU gains).

Jim.

>From: magnum [mailto:john.magnum@...hmail.com]
>
>I did some profiling:
>
>On a 5-minute run, it spends 76.56% in SSEmd5body() and 23.12% more in
>other parts of crypt_all(). The modulus stuff is ~5%. I see nothing
>obvious that can be made better.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.