john-dev - Generic parsing functions -- prototype

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <55188AEA.9010803@openwall.com>
Date: Mon, 30 Mar 2015 02:29:46 +0300
From: Alexander Cherepanov <ch3root@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Generic parsing functions -- prototype

Hi!

I've tried to create some prototype of generic parsing functions. Not 
much is implemented. But it's enough to for 7z format (more or less). It 
looks like this:

----------------------------------------------------------------------

#define HASH_FORMAT             "$7z$ %0-0d $ %1-24d $ %0-16d $ %16h $ 
%0-16d $ %16h $ %d $ %l $ %d $ %*h"

...

static int valid(char *ciphertext, struct fmt_main *self)
{
         return proc_valid(ciphertext, HASH_FORMAT, BIG_ENOUGH);
}

static void *get_salt(char *ciphertext)
{
         static union {
                 struct custom_salt _cs;
                 ARCH_WORD_32 dummy;
         } un;
         struct custom_salt *cs = &(un._cs);

         size_t SaltSize, ivSize, length;
         proc_extract(ciphertext, HASH_FORMAT,
                      &cs->type, &cs->NumCyclesPower,
                      IGNORE_NUM, &SaltSize, cs->salt,
                      IGNORE_NUM, &ivSize, cs->iv,
                      &cs->crc, &cs->unpacksize,
                      &length, cs->data);
         cs->SaltSize = SaltSize;
         cs->ivSize = ivSize;
         cs->length = length;

         return (void *)cs;
}

----------------------------------------------------------------------

After some tuning it should become even shorter. IMHO it's much better 
than current approach of manual parsing.

The attached patch contains new files parsing_plug.c/parsing.h and 
changes to 7z_fmt_plug.c. I've only checked that self-tests are passed.
I don't think it's worth committing yet. But it should be enough to 
start discussion and to take it into account while make gsoc plans more 
precise.

Some notes.

I hope, for each john format, to have one format string describing the 
hash structure so that it's enough to validate a hash and to extract 
info from it a-la scanf (and to create a hash a-la printf if the need 
arises). Probably not for every john format, but for most of them.

It's possible to also expose intermediate functions (to parse a number 
etc.) but I'm not yet sure how useful it is. IMHO the less functions we 
expose the better.

Which elements of format string are implemented:

- spaces are ignored;

- everything special starts with %, everything else is treated as literals;

- %d for unsigned decimal numbers (uint32_t), can have a range for 
accepted values like %1-24d. Returns the result via uint32_t *;

- %h for binary data of variable length, encoded in hex. Max length have 
to be indicated. Returns two(!) things -- actual length via size_t * and 
data via unsigned char *;

- %l for a length of the next field of variable length. Returns nothing.

All length are for decoded data. '*' can be used in place of any number, 
then the number is taken from the arguments a-la printf.

Future elements:

- %% -- literal %;

- %m -- base64/mime-encoded string without padding;

- %M -- base64/mime-encoded string with padding;

- %b -- base64/crypt-encoded string without padding;

- %B -- base64/crypt-encoded string with padding;

- %s -- arbitrary string (like usernames).

There are naturally many questions:

- spaces. Do we have hashes with spaces in them?

- numbers. Should we require to always indicate the range? Do we need 
negative numbers (they are used only in pdf hashes)?

- types. Types are probably not very convenient. The idea was that for 
numbers extracted from a hash a type of fixed size should used. And for 
numbers like sizes size_t should be used. But in the example above this 
leads to 3 intermediate variable which is not very nice;

- hex. Do we need variants for lower- and upper-case?

- fixed-length data. Do we have cases of fixed-length data without a 
separator after it? Or cases when there is no separator and the length 
is extracted from the hash, like this: 
$<length-of-data1>$<length-of-data2>$<data1-in-hex><data2-in-hex>? LDAP 
formats have salt+binary base64-encoded together, they probably should 
splitted by hand;

- variable-length data. Do we need ranges for lengths?

- is scanf-like approach is good at all. It seems to be quite compact 
but types of arguments are not checked and mistakes there are fatal and 
hard to debug. Mismatch between number of specifiers and number of 
arguments (2 for %h and 0 for %l) doesn't help too;

- are chosen letters for specifier good (e.g. %b vs. %m)?

- which other types of field do we need?

Comments?

-- 
Alexander Cherepanov

View attachment "0001-Test-generic-parsing-function-on-7z-format.patch" of type "text/x-patch" (11782 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.