john-dev - notes on sharding the incremental search space

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120614124922.GB16784@cmpxchg8b.com>
Date: Thu, 14 Jun 2012 14:49:22 +0200
From: Tavis Ormandy <taviso@...xchg8b.com>
To: john-dev@...ts.openwall.com
Subject: notes on sharding the incremental search space

Hey, just FYI, I recently implemented parallelisation for john on an
unusual cluster architecture, and had to solve one of the problems
listed as open on the wiki. I don't know if my solution is general
enough to be helpful to anyone else, but I think my notes might at least
be useful to someone else trying to solve a similar problem in future!

Solar describes the problem here: http://www.openwall.com/lists/john-users/2005/11/21/2

The problem is how to shard the incremental search space for k-cores
optimally. Most existing parallel implementations appear to shard the
space on order entry boundaries, this is easy to do as they're natural
boundaries in the incremental algorithm. Unfortunately, as solar
explains, these are uneven and can grow very large.

The parallel external filter distributes the work between nodes
perfectly, but scales very poorly. The problem is that each node still
iterates through the search space but just discards them before the
crypt(), so there's significant work duplicated per core, making it
infeasible for more than a dozen or so cores.

Ideally, I wanted a way to split the entire space into an arbitrary number
of equal shards. For example, it's natural to want to split the seach
space over 8 cores like this:

    [  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  ]

But this defeats the purpose of using John's optimised search order,
because this happens:

    High Quality -------- Low Quality --------- Bad Quality
    [  0  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  ]

There are cores working on low quality candidates, while just a handful
are working on good quality pieces of the search space.

For example, a good quality guess might be "catbob12", you want this
password tested early. A bad quality guess might be "$"%%^FwA)", which
you dont want to waste any cycles testing until all the better
candidates have been exhausted.

I would like to split it like this instead, into thousands of small
chunks (many times the number of cores available):

    High Quality -------- Low Quality --------- Bad Quality
    [0|1|2|3|4| ............................... | 9999999 ]

This way each core works on the highest quality chunk available, before
moving on to the next. This can't be done on order boundaries, because
there are only ~3k of them, and even if there were more, they're so
uneven that it wouldn't make sense.

To solve this problem, I needed a way to calculate the number of crypt()
operations required to complete one order entry, and because one order
might span multiple boundaries, the state at an arbitrary point within
each order.

The core part of the incremental algorithm is inc_key_loop(), here is a
minimal version that doesn't do any work, just to simplify the problem:

// A simplified version of the inc_key_loop() algorithm from John the Ripper
static uint64_t inc_key_loop(uint32_t length, uint32_t fixed, uint32_t count)
{
    int32_t numbers_cache;
    int32_t pos = 0;
    uint64_t crypts = 0;
    uint8_t numbers[8] = {0};

    numbers[fixed]  = count;

update_ending:

    numbers_cache   = numbers[length];

update_last:

    // A crypt() operation would happen here.
    crypts++;

    pos = length;

    if (fixed < length) {
        if (++numbers_cache <= count) {
            if (length >= 2) goto update_last;
            numbers[length] = numbers_cache;
            goto update_ending;
        }
        numbers[pos--] = 0;
        while (pos > fixed) {
            if (++numbers[pos] <= count) goto update_ending;
            numbers[pos--] = 0;
        }
    }
    while (pos-- > 0) {
        if (++numbers[pos] < count) goto update_ending;
        numbers[pos] = 0;
    }

    return crypts;
}

A complete incremental search is going to require CharsetLen^NumChars
crypts(), so there are three problems to solve in order to split it into
an arbitrary number of shards:

* How do you calculate how many crypts() each { length, fixed, count }
  triplet requires, so I know which one contains the next shard
  boundary?
* What will the state be in inc_key_loop() after n crypts (where 0 < n <
  total), so that I can skip to the correct position within each entry?
* And once I can calculate the position in the search I want this node
  to search from, how do I use it?

Someone asked a similar question here, but I think they abandoned the idea:

http://www.openwall.com/lists/john-users/2011/06/29/3

This sounds like what I wanted. So, the base complexity of an order
triplet is pretty simple, it will take pow(count + 1, length) crypts()
to complete. This works when fixed=0, but any other value really
complicates the calculation. Here is my solution:

// Given a { length, fixed, count } triplet, calculate the maximum amount of
// work expected in inc_key_loop().
static uint64_t calculate_maximum_attempts(uint32_t length,
                                           uint32_t fixed,
                                           uint32_t count)
{
    int64_t result;
    int32_t pos;

    // A table of coefficients required to compensate for fixed characters in
    // JtR character files. These were found using Mathematica integer sequence
    // analsysis.
    static const int64_t kOrderCoefficients[][8] = {
        { 0,   0,   0,   0,   0,   0,   0,   0 },
        { 0,  -1,   0,   0,   0,   0,   0,   0 },
        { 0,  -2,   1,   0,   0,   0,   0,   0 },
        { 0,  -3,   3,  -1,   0,   0,   0,   0 },
        { 0,  -4,   6,  -4,   1,   0,   0,   0 },
        { 0,  -5,  10, -10,   5,  -1,   0,   0 },
        { 0,  -6,  15, -20,  15,  -6,   1,   0 },
        { 0,  -7,  21, -35,  35, -21,   7,  -1 },
    };

    // Calculate base complexity of this rule.
    result = pow(count + 1, length);

    // Compensate for fixed characters.
    for (pos = 0; pos < kMaxLength; pos++) {
        result += kOrderCoefficients[fixed][pos] * pow(count + 1, length - pos);
    }

    return result;
}

The core to allow predicting the inc_key_loop() is to understand that
numbers[] is effectively a k-digit base-n representation of the number
of crypts executed so far (approximately). Again, fixed characters
complicate this.

The fixed digits can be solved by taking the other digits that would be
displaced by the fixed position, then translating them into one base
lower than the rest of the digits (!!). Here is my solution:

// Given a { length, fixed, count } triplet, calculate the internal state of
// inc_key_loop() after the specified number of crypt operations.
static bool calculate_number_state(uint8_t  *numbers,
                                   uint32_t  length,
                                   uint32_t  fixed,
                                   uint32_t  count,
                                   uint64_t  crypts)
{
    // Compensate for the number_cache in inc_key_loop() skipping certain
    // increments of numbers[].
    if (fixed != length)
        crypts = crypts - (crypts % (count + 1));

    // Convert number of crypts into a k-digit base-n representation
    // of crypts. This closely matches how John stores it's internal state,
    // with the exception of fixed digits.
    convert_to_base(numbers, kMaxLength, count + 1, crypts);

    // Move the digits above the fixed digit down one position, because the
    // fixed digit will displace them. This is the same as partial
    // multiplication of these digits by base, but moving them is more
    // efficient than having to convert them to integers.
    memmove(&numbers[0], &numbers[1], fixed);

    // Now we can force the fixed digit into the correct position.
    numbers[fixed] = count;

    // Finally, we need to adjust the digits above the fixed position
    // independently. This is because John will never set a digit in those
    // positions to count, however other arithmetic rules still apply.
    //
    // We can interpret this algorithm behaviour as setting these digits to one
    // base lower than the rest of the digits. Yes, this is confusing.
    convert_to_base(numbers,
                    fixed,
                    count,
                    convert_from_base(numbers, fixed, count + 1));

    // Complete.
    return true;
}

With these routines you can calculate John's internal state after an
arbitrary number of crypts, so to skip to the 81365th shard, you do
this (in pseudocode):

total_search_space = CharsetLen^NumChars;
total_number_shards = 9999999;
this_shard_start = 81365 * (total_search_space / total_number_shards);

Now skip through order entries until you find the one that would pass
the this_shard_start boundary, and then call calculate_number_state() on
crypts_expected_after_this_entry - crypts_from_previous_entry_until_boundary.

Once you know the entry number and the numbers[] state, I found it
simple to generate a john.rec file for each node, and just -restore it,
immediately jumping to the correct position. I used a format like this:

  static const char kJohnSessionFormat[] =
    "REC3\n"                                   // Magic
    "4\n"                                      // Number of Arguments
    "--incremental=Whatever\n"                 // Incremental Mode
    "--external=Whatever\n"                    // External Filter
    "%s\n"                                     // Password File
    "0\n"                                      // Time         (unused)
    "0\n"                                      // Count        (unused)
    "00000000\n"                               // CryptNum Lo  (unused)
    "00000000\n"                               // CryptNum Hi  (unused)
    "0\n"                                      // Pass         (unused)
    "-1\n"                                     // Progress     (unused)
    "%08x\n"                                   // Check        (CHR Checksum)
    "%u\n"                                     // Entry        (Incremental)
    "0\n"                                      // Compat       (unused)
    "8\n"                                      // Length       (MaxPasswordLen)
    "%u\n%u\n%u\n%u\n%u\n%u\n%u\n%u\n";        // Numbers      (State)

And you can stop after a workunit has completed like this:

[List.External:Whatever]
int i;
void init () { i = 0; }
void filter () { abort = i++ > 1234454; }

I've attached a .c file demonstrating this, it expects all.chr on stdin
and then tries to skip to the final 123456th candidate of each order. I
used similar code to generate the next REC file on each node, then
collect the results when it's done (I know the FAQ says not to rely on
the REC file format, but presumably if you're reading this you're not
afraid of reading the code) :-)

If exposing this via a supported interface (e.g. the config language, or
a command line option) has a chance of making the official distribution,
let me know and I would be happy to write patch.

Hope this helps someone in future.

Tavis.

p.s. I also have a sha-1 implementation that's a little faster than the
jumbo version, would this be the right list to send that to? Is there a
jumbo cvs repo I can checkout to patch against?

-- 
-------------------------------------
taviso@...xchg8b.com | pgp encrypted mail preferred
-------------------------------------------------------

View attachment "john_inc_search.c" of type "text/plain" (9795 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.