john-dev - Re: continuous feeding (for data driven branching)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150620221414.GA10461@openwall.com>
Date: Sun, 21 Jun 2015 01:14:14 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: continuous feeding (for data driven branching)

On Fri, Jun 19, 2015 at 05:33:11PM +0300, Aleksey Cherepanov wrote:
> Just to document a discussion behind an idea...
> 
> When there is a data driven branching we may want to pack intermediate
> results into several queues. For example, sunmd5 does it after "coin
> flip".
> 
> Currently we get a lot of candidates, separate them into 2 packs and,
> oops, packs may be sub-optimal: for instance, we got 8 candidates and
> based on "coin flip" (or other factor), we got packs with intermediate
> results for 3 and 5 candidates while we would like to have 4 and 4 for
> our sse code.

This problem also exists in cases where the data-driven branching is not
in the original algorithm, but is acquired along with our optimizations
such as bitslicing.  This happens e.g. when we bitslice an algorithm
where processing varied by password length.  Once we do, we have
data-driven (actually password length driven) branching in our attack
specific code.  Another example is tripcode, where descrypt salts are
generated from candidate passwords, whereas our bitslice DES
implementation assumes a fixed salt for all candidates.  Buffering and
grouping by password length or by tripcode descrypt salt or whatever is
relevant works around this issue, but with the drawback you describe.

> Continuous feeding: format reads N candidates, separate them and if it
> needs more input to fill up buffers optimally then it asks for them.

Yes, that's an obvious idea I had some years back... possibly as far
back as 1998, when thinking of whether to bitslice bsdicrypt's key
setup.  (I ended up not bitslicing it at the time, and it still is not -
only the main processing is bitsliced; I am still considering changing
this, though, so that we'd drop the non-bitslice DES code from JtR tree.
Would also need to deal with what'd become a similar issue in AFS_fmt.c,
though - grouping by length.  Right now, that format is not even
bitsliced at all for that reason, and it feels wasteful to spend time on
it when few people, if anyone at all, still need it.)

Another way to do the same is to make the multiple queues visible to the
cracking modes, which would then knowingly work on the multiple queues.
This complicates the cracking modes' implementations, but may help solve
some of the new issues such as progress and speed reporting, which could
then be 100% correct, even if too complicated to many users ("what are
all those queues I get separate reporting for?"), and crash recovery
(can have separate checkpoints for each queue, much like we do for
separate --fork'ed processes, or alternatively can record the fully
processed point only, like "single crack" mode does for its rules now).

> Problem 1: format's interface needs changes. Changes can be minimal:
> currently crypt_all() can hold back some candidates, use them on next
> call and report them as newly generated. The only changes needed: call
> crypt_all() in the end with count=0 as a signal for crypt_all() that
> it needs to flush all candidates at any cost because john finishes.
> (As a modification: call with count=0 only if the last call had count
> == MAX_KEYS_PER_CRYPT .)
> 
> Such change was proposed by Solar Designer

You and I were discussing multiple related issues at once at the time,
and having crypt_all() hold back and then seemingly generate new
candidates was your idea.  At the time, I only pointed out one problem,
which you list above: handling the very last crypt_all() call in a
session, where nothing must be held back and the queue must be flushed.
And I proposed that workaround for it.

I've since recalled that this trick, like some other optimizations for
various things that I considered before, unfortunately breaks crash
recovery.  With the current formats interface, cracker.c and the
cracking modes will assume that crypt_all() has processed all of the
candidates it was called for, and JtR may record that in its .rec file.
Recovering that session may then effectively skip some candidates, which
were never fully processed, but were queued by crypt_all().
Unfortunately, this can't be fully solved by making that new "flush the
queue" crypt_all() call before process termination.  This would work for
intentionally aborted sessions where the user is willing to wait for
this last call to complete, but it won't work for crash recovery, nor
for impatient user aborts.

The only ways to achieve reliable crash recovery appear to be:

1. Maintain per-queue checkpoints.

-OR-

2. Record the fully processed point only.

-OR-

3. Flush all queues before each .rec file update (so e.g. every 10
minutes, and don't update the .rec file until this flushing is complete).

I don't like any of these very much.

> and may be good for asynchronous calls too (to support FPGAs).

As to asynchronous processing, while your idea happens to be relevant to
that too, I'd rather implement separate support for async crypt_all(),
where recording the crash recovery position would be as simple as
staying one crypt_all() behind.  On my to-do list, the new format method
to achieve this is called set_block(), and it'd allow for crypt_all() to
work on different input and output block indices.  (Block is a new
concept to be introduced along with this change.)  As I had mentioned, I
am also considering using this same blocks concept for higher-level
multi-threading than what we have now.  We can have one block per
thread and use them for this per-thread separation rather than for async
processing then.  Or we can have two blocks total, and let our threads
complete asynchronously to key generation for the other block.
The former is more scalable, the latter allows to keep the simpler
sequential cracking mode implementations.  With per-thread blocks, we'd
also have the crash recovery issue.

To be clear: the above paragraph is about different issues, not the
multi-queue thing that the rest of this posting is about.  I digressed.
But those other issues are subtly relevant, including in bumping into
the same kind of crash recovery problem as soon as we let our multiple
threads, rather than multiple queues within a thread, get out of sync.
As soon as anything it kept out of sync beyond at most a small constant
number of crypt_all() calls (like 1 or 2), crash recovery becomes
significantly more difficult.  And if both per-thread queues and
multiple threads are out of sync, do we have to maintain N*M separate
checkpoints or something?..  It gets tricky or/and nasty.

> Problem 2: holding candidate (worst case: password) for too long: for
> instance, 1 our queue got 1 candidate that is password but we don't
> hash it because 1 is not optimal, all other candidates go to other
> queue by a coincidence, so format waits for the finish to hash that 1
> candidate and user instead of 5 minutes has to wait long
> hours/days/months/years. So format has to do something to not hold
> candidates more than N minutes.

It's been a very long time since I wrote this, but it seems the single.c
equivalent to this problem and its workaround is this excerpt:

	if (keys->count && rule_number - keys->rule > (key_count << 1))
		if (single_process_buffer(salt))
			return 1;

Looks like processing is forced if there are any keys queued for the
current salt (keys->count is non-zero) and it's more than two
key_count's worth of rules behind the candidate password generator.
Optimally, exactly key_count candidates would be buffered for each salt
before processing (so we'd process when keys->count == key_count), but
as you can see we are also willing to process when this optimal point
isn't reached in twice the number of rules (meaning that many rules
happened to reject their inputs, not producing any output).

Of course, the above excerpt is a workaround for a different variation
of the problem, but it's similar in concept.

> Problem 3: partial hashing and holding back would give inaccurate
> speed reports.

Yes, that too.

Personally, I find the problem with crash recovery (which we didn't
discuss, and which you didn't list) to be the worst of the four(?)

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.