john-users - Re: Wordlist Mangling Rule

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101116232016.GA23967@openwall.com>
Date: Wed, 17 Nov 2010 02:20:16 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Wordlist Mangling Rule

On Sat, Nov 13, 2010 at 10:23:50AM +1300, Al Grant wrote:
> I have tried from the FAQ rule page to decrypt how the rules you have
> written work.

I'm not sure what page you refer to.  There's one documenting the rules
syntax, but it's not a FAQ:

http://www.openwall.com/john/doc/RULES.shtml

> Would you mind breaking it down? Ie [c:c] does what etc?

Let's start with a simpler line:

<B >7 [clu]

The square brackets trigger preprocessor expansion.  So this line gets
expanded into 3 separate rules:

<B >7 c
<B >7 l
<B >7 u

Each rule is individually applied to all words from your wordlist.

Let's look at the first one of these rules:

<B >7 c

It contains three rule commands.  Unlike separate rules (above), the
rule commands in the same rule (on the same line post-expansion) are
applied one after another - that is, the second command is applied to
the result of the first (not to the original word), the third one is
applied to the result of the second, etc.  Also, if one of the commands
rejects the input word, further commands are not used for that word;
the entire rule (one line above) produces no output for such a word.

The first command above is "<B".  The "<" character is the command code.
It is documented in doc/RULES as:

<N	reject the word unless it is less than N characters long

The "B" character corresponds to the "N" placeholder in the
documentation - that is, it is the position code.  These are also
documented in doc/RULES:

"Numeric constants may be specified and variables referred to with the
following characters:

0...9	for 0...9
A...Z	for 10...35
[...]"

According to this, "B" specifies the number 11.

Thus, the command "<B" will reject its input word (and not let it be
processed with further commands on the same line) "unless it is less
than 11 characters long".  In other words, it will insist that words be
no longer than 10 - that's one of the requirements you had mentioned for
words that we're not going to append digits to.

The next command is ">7".  (This one is only reached if "<B" did not
reject the word.)  Similarly, this one insists that words be no shorter
than 8 characters (8 being the smallest number that is "greater than 7").

Finally, the last command in that rule is "c".  It is documented as:

c	capitalize

Thus, the entire "<B >7 c" rule will capitalize words that are 8 to 10
characters long, but it will reject others.  The next two rules:

<B >7 l
<B >7 u

are similar, except they will "convert to lowercase" and "convert to
uppercase", respectively.

That's all for the simple line discussed so far:

<B >7 [clu]

Now let's see what the next line does:

<8 >6 [clu] $[0-9]

This one gets expanded into as many as 30 rules:

<8 >6 c $0
<8 >6 c $1
[...]
<8 >6 c $9
<8 >6 l $0
[...]
<8 >6 l $9
<8 >6 u $0
[...]
<8 >6 u $8
<8 >6 u $9

(I've omitted many of them above.)

So that's 30 rules, each consisting of 4 commands.  The first 3 of the
commands were already discussed above (although the length limits are
different now).  The fourth one appends a digit:

$X	append character X to the word

(where a specific digit is substituted for the "X" placeholder mentioned
in the documentation).

The next ruleset lines may be:

<7 >5 [clu] Az"[0-9][0-9]"
<6 >4 [clu] Az"[0-9][0-9][0-9]"
<5 >3 [clu] Az"[0-9][0-9][0-9][0-9]"

The last one of these is expanded into as many as 30,000 rules:

<5 >3 c Az"0000"
<5 >3 c Az"0001"
[...]
<5 >3 u Az"9998"
<5 >3 u Az"9999"

Each of the above rules consists of 4 commands, the first 3 of which
we've already discussed.  The fourth is:

AN"STR"	insert string STR into the word at position N

The documentation also says:

"To append a string, specify "z" for the position."

which is also documented in its proper section:

z	"infinite" position or length (beyond end of word)

So we're inserting the "string STR" beyond the end of the word - or in
other words, we're indeed appending the string.  In each of the 30,000
rules (produced for us by the preprocessor on the fly), only one
specific string to append is specified (e.g., only "0000" initially).

Now let's consider these more complicated ruleset lines:

-\r[c:c] <B >7 \p[clu]
-\r[c:c] <8 >6 \p[clu] $[0-9]
-\r[c:c] <7 >5 \p[clu] Az"[0-9][0-9]"
-\r[c:c] <6 >4 \p[clu] Az"[0-9][0-9][0-9]"
-\r[c:c] <5 >3 \p[clu] Az"[0-9][0-9][0-9][0-9]"

These differ from those we've discussed so far by the addition of
"-\r[c:c]" to the beginning and "\p" into the middle.  Let's see what
these achieve.

First, "[c:c]", with its non-escaped use of square brackets, is indeed a
preprocessor expression, much like "[clu]" and "[0-9]", which we've
discussed above.

"\r" and "\p" are "magic escape sequences" to the preprocessor.  These
are documented closer to the end of doc/RULES:

"Finally, the preprocessor supports some magic escape sequences.  These
start with a backslash and use characters that you would not normally
need to escape.

[...]

"\p" before a range to have that range processed "in parallel" with
preceding ranges

[...]

"\r" to allow the range to produce repeated characters."

Thus, this line:

-\r[c:c] <B >7 \p[clu]

is expanded into three rules:

-c <B >7 c
-: <B >7 l
-c <B >7 u

We needed "\r" because we have two instances of the "c" character in
"[c:c]" and we wanted to preserve both (see below for the explanation).
We needed "\p" to have the two character lists - "[c:c]" and "[clu]" -
processed "in parallel".  In other words, we wanted only the three lines
above to be produced, not 9 lines for all combinations, which is what
we would get from the preprocessor by default (and which we relied upon
when appending digits, above).

Now, what does "-c" at the start of a rule do?  This is a "rule reject
flag", documented as:

-c	reject this rule unless current hash type is case-sensitive

Note that unlike "<B" and other "rule commands", which reject individual
input words, the "rule reject flags" reject entire rules.

Thus, if the current hash type is case-insensitive - which pretty much
means LM hashes in practice - the entire rule (which is "<B >7 c") will
be rejected.  Indeed, with a case-insensitive hash there's no point in
capitalizing words when we're going to try them as-is as well (by the
next rule).  If we did not reject the rule, then effectively duplicate
candidate passwords would be generated and hashed, thereby wasting time.

The next rule is:

-: <B >7 l

This one uses a rule reject flag too, but a dummy one:

-:	no-op: don't reject

The only reason why it does, and why this flag is even supported, is to
allow for our use of the preprocessor.  These flags have almost no
performance cost anyway - they're applied per-rule, not per-word.  As
you can see in the log, the rules being applied per-word have their rule
reject flags, if any, already removed from them.

Finally, we have:

-c <B >7 u

which is similar to the first one of these three rules - it is applied
to case-sensitive hashes only.

As to the rest of the original ruleset lines:

-\r[c:c] <8 >6 \p[clu] $[0-9]
-\r[c:c] <7 >5 \p[clu] Az"[0-9][0-9]"
-\r[c:c] <6 >4 \p[clu] Az"[0-9][0-9][0-9]"
-\r[c:c] <5 >3 \p[clu] Az"[0-9][0-9][0-9][0-9]"

these are expanded into larger numbers of rules.  The last one of these
is expanded into 30,000 rules like:

-c <5 >3 c Az"0000"
-c <5 >3 c Az"0001"
[...]
-: <5 >3 l Az"0000"
[...]
-c <5 >3 u Az"9999"

...and we've already discussed the meaning and the rationale of the
individual rule reject flags and rule commands in use by these rules.

Whew, looks like that's all.  This is simple stuff for me, but I see how
it can be complicated for others given that explaining it takes a while.

Does this help?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.