john-dev - Re: proof of concept converter of rexgen-like syntax into john rules

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140415104307.GA18823@openwall.com>
Date: Tue, 15 Apr 2014 14:43:07 +0400
From: Aleksey Cherepanov <lyosha@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: proof of concept converter of rexgen-like syntax into
 john rules

On Mon, Apr 14, 2014 at 07:06:28PM +0200, magnum wrote:
> On 2014-04-14 18:42, Aleksey Cherepanov wrote:
> >On Sun, Mar 23, 2014 at 12:17:05PM +0100, magnum wrote:
> >>On 2014-03-22 22:13, Aleksey Cherepanov wrote:
> >>>I think rexgen-like syntax could be a part of rules and performed at
> >>>preprocessor level like []. For instance
> >>
> >>This would only work for trivial regexes, a complicated one would produce
> >>too many rules. I'd still want a pure regexp mode.
> >>
> >>Anyways, your script is something we could supply with Jumbo almost as-is. A
> >>"rule generator script" is probably what many people need, including Nima.
> >>Just add a shebang line, license blurb, brief docs and wrap "while ($rule =
> >><>)" around the lot and I think we're set.
> >
> >I am sorry for the delay.
> >
> >I've implemented all missing features except -i option.
> >
> >I did not test things well. Tests are still TODO. Also I'd like to
> >improve interface and add other whistles like named groups and
> >relative back refs.
> 
> Thanks! Committed now. I haven't had time to test it a lot yet but I believe
> this can be an excellent tool even once we got a native rexgen mode.

Thanks!

My pipeline is quite easy. I do not store expression as a tree. I
expand each rule into a list of variants that have only \0 as
active construction (also [] because they are passed into john).
Most transformations are straight forward and are not tightly
connected. The only complex part is an expansion of
alternatives/groups because there is a tree/recursion and back
references should be expanded together with alternatives to support
back references from group to itself in previous iteration of
quantifier (like in (a(?:\1|b)){2} ).

Well, this is a sequence of transformations starting with rule as a
string:

1) Split the string into tokens.

2) Transform tokens to expand aliases: ? -> {0,1} and so on.

1 and 2 are in function to_tokens(). (expand_questions() is obsolete,
I forgot to cut it.)

3) Traverse all tokens to check parity and enumerate groups. Show
errors on unbalanced parenthesis and back refs to not yet seen groups.

Ok: (a)\1, (a\1)
Error: \1(a)

3 is check_parity().

4) expand_quantifiers(): say, we have group with number 1:
(#1...){2,3}

After the expansion we have two variants:
(#1...)(#1...)
(#1...)(#1...)(#1...)

All these groups have number 1. Things are done using copying.

Then I apply join_tokens to merge some tokens into bigger strings , it
should not be done at this point because I parse tokens again and
again later. It is a flaw of implementation, and not a part of
algorithm.

5) expand_parenthesis() for each variant: expands groups/alternatives
and back refs.

This part is recursive. At the deepest level of recursion we expand
groups into a set of variant and combine sets together:
(a|b|c)(x|y|z) -> [a, b, c] * [x, y, z] -> [ax, ay, az, bx, ...]
asdf(a|b|c) -> [asdf] * [a, b, c] -> ...

Things are not that easy in practice because we add metainformation
about each group to each variant.
(a|b|c)(x|y|z) ->
  [a (group 1 is "a"),
   b (g1 is "b"),
   c (g1 is "c")]
  * [x (g2 is "x"),
     y (g2 is "y"),
     z (g2 is "z")]
  -> [ax (g1 is "a", g2 is "x"),
      ay (g1 is "a", g2 is "y"),
      ...]

Actually we use numbers of groups that they got during enumeration.

We put group's value into metainformation of a variant when variant is
fully formed.

So combine() combines sets of variants tracking metainformation and
expanding back refs: it traverses all variants from left set, for each
of them it traverses all variants from right set:
- if left variant has back refs to groups defined in metainformation
of this left variant then it is back refs to group itself we skip such
variants.
- if right variant has back refs to groups defined in metainformation
of the left variant then substitute them.
- merge metainformation: every value defined in right set is keeped,
values from left set are not defined in right set.

Skipping variants changes global var in my implementation. I think it
could be put into metainformation too. Counter is not very accurate:
(a(?:\1|b)){2} is 4 variants including 2 skipped, but counter say 1
skipped because it skips (a\1) before it could be counted as 2
variants.

At the end of the expansion I expand all back refs to defined groups
and skip variants with back refs to undefined groups. At this moment
metainformation gets wiped and we have pure list of variants again.

6) Last action: expand \0 and other \. for other chars.
expand_0_john() transforms variants into rules. Word generator would
do different thing. Also word generator should expand []. This point
is the right place to join tokens into strings.

expand_to_john() runs the whole pipeline.

BTW should -i affect \0?

-- 
Regards,
Aleksey Cherepanov
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.