john-dev - Re: fast hash early exit vs. large hash list

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150919135848.GA28734@openwall.com>
Date: Sat, 19 Sep 2015 16:58:48 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: fast hash early exit vs. large hash list

On Sat, Sep 19, 2015 at 02:33:21AM +0200, magnum wrote:
> On 18/09/15 17:25, Solar Designer wrote:
> >For raw-md5, we currently have early exit before the last 3 steps.
> >Aside from this being extremely far from what state of the art fast
> >hash crackers do in terms of steps reversal,
> 
> Can we reverse any more without considering the actual candidate (or its 
> length)?

I haven't looked into that.  I suggest you take a look at
BarsWFopensource.zip (I did not) and see what the requirements are for
what it does.  I know from other crackers' authors that it is possible
to fully reverse all the I()'s (16 steps) and a bit more at least in
some cases.  At least when cracking a single raw MD5 hash, we're looking
at 40-something steps performed per candidate (IIRC, 43 was mentioned,
although optimized crackers vary a bit in this respect).  Not our
current 61.

> I doubt the shared functions are suitable for hard-core 
> reversal. It might be better to do so in formats like the -ng ones that 
> doesn't use shared code. Or at least do them first.

Doing this in an -ng format first, then seeing what/how we can move to
shared code makes sense to me.

> >as currently implemented it
> >also hurts performance when cracking large hash lists.  Only 32 bits of
> >the hash result are preserved and the rest are recomputed with scalar
> >code in cmp_exact(), and when the target hash list is large this happens
> >quite often.  In the 29M testcase, cmp_exact() is called a few million
> >times on wrong passwords (as well as 1.7+ million on correct passwords).
> 
> >I think we should support a mode where the SIMD code would exit early,
> >but would record the entire state (all four vectors), so that cmp_*()
> >wouldn't have to recompute from scratch.  Ideally, we'd only enable this
> >mode when the number of loaded hashes is large, although this can be
> >tricky in terms of (avoiding) code complexity.
> 
> For MD4/5, I was already considering improvements like cmp_one() of 
> Alain's NT format (and introduce a shared function for doing those last 
> steps). Maybe we should have a separate function for single-round, with 
> the few branches that would be needed for that (reversed or not, full 
> output state stored or not). This would benefit iterated formats too 
> because of less branches.

BTW, here's a compiler trick to produce specialized functions without
cluttering the source code:

static MAYBE_INLINE void generic(int flag1, int flag2, int choice1)
{
	...
	if (flag1) {
		...
	}
	...
	switch (choice1) {
		...
	}
	...
	if (flag2) {
		...
	}
	...
}

static void special1(void)
{
	generic(1, 0, 5);
}

static void special2(void)
{
	generic(0, 1, 4);
}

and so on.  Where MAYBE_INLINE is our macro that tries to force function
inlining regardless of limits on function size.  Optimizing compilers
are able to substitute the constants and eliminate the branches and dead
code (that would have resulted from naive inlining with no optimization).

> For SHA-1/SHA-2 though, would we not need a good part of the 80 word 
> internal state to be able to continue? Should we store that too? That 
> ought to be slower than not reversing at all. In self-contained formats 
> like eg. SHA-256-ng we (can) have the whole state available, but with 
> the shared code we can't. I hope I'm wrong though?

I haven't looked into reversal of fast functions myself, and most of
what I heard from others was about MD4/5.  It does appear to me that
some of W[] would be needed for SHA, but then it similarly appears that
MD5's last 16 steps depend on the input being hashed, yet folks are
reversing all of those.

Regardless, there are certainly cases where we do face this tradeoff:
store more in case we have to complete computation later, or store less
and risk having to recompute from scratch (or from an earlier point).
This isn't even fast hash specific: e.g., we also face this with the
OpenMP code for bcrypt, where the 4+ KB state is optimally kept on
each thread's stack, and copying it from there in case we have to
complete computation (the 128 Blowfish encryptions that we can usually
skip) results in a measurable performance penalty at low (yet supported)
bcrypt costs.  (The current OpenMP code computes those extra 128
Blowfish encryptions all the time, to avoid the need for copying.
This is probably sub-optimal as 128*16 = 2048 Blowfish rounds got to
cost more than copying 4 KB.  I need to improve this.)

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.