john-dev - --utf8 option, proof of concept

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D7555F6.5040506@bredband.net>
Date: Mon, 07 Mar 2011 23:02:30 +0100
From: magnum <rawsmooth@...dband.net>
To: john-dev@...ts.openwall.com
Subject: --utf8 option, proof of concept

Here is a PoC of how I think we could get much better flexibility for 
all Unicode formats. The code quality is so-so, I don't intend to put 
this on the wiki.

This patch adds the option flag "--utf8". Without this flag, John 
behaves as usual, that is, for any format internally converting to 
Unicode (most notably NT), the conversion assumes ISO-8859-1 input.

Using this flag makes John assume UTF-8 input instead. That is, you 
should feed it with wordlists encoded in UTF-8, or hash files with user 
info encoded in UTF8 for --single mode to work best.

The base for the UTF-8 conversion is a ConvertUTF.[ch] from Unicode Inc. 
I have stripped and modified it a lot but the original files are 
included too. It's apparently free for us to use.

There is bound to be hideous bugs in the code and it probably does not 
even compile properly for all targets. It works fine on Linux-x86-64. 
It's not primarily meant to be used other than for experimenting with 
the pros and cons of this idea. Or maybe try out how many new passwords 
you can crack using "--single --utf8" on your favourite raw-md5 dataset. 
Parts of the code are barely working and optimisations can be made for sure.

Supported formats right now are NT, the various NET*LM* formats, 
mschapv2 and both mscash formats. Plus there is a separate 
raw-md5-unicode mode included that does unicode($p) and supports the 
--utf8 flag. This format is based on the old "thick" format so it 
performs at half the speed of the latest'n'greatest. My suggestions for 
md5_gen comes from this.

Not yet fixed formats are mssql and probably a couple more.

Other ideas on my to-do list unless someone talks me out of this:
* A couple of new reject rules, maybe -u for rejecting a rule unless the 
--utf8 flag is used, and -U for the opposite.
* Maybe even a few utf-8 aware word rules. It's not that complicated.
* I can see a need for a new format property flag, FMT_UNICODE, that 
tells that this format use Unicode internally. In particular, a mode 
that is not yet supporting --utf8 should bail out if you try.

Some problems:
* I currently have all 8-bit test strings commented out from the code, 
as they need to be different when the --utf8 flag is used. For example,

{"$NT$8bd6e4fb88e01009818749c5443ea712", "\xC3\xBC"}, // ü, UTF-8
{"$NT$8bd6e4fb88e01009818749c5443ea712", "\xFC"},     // ü in 8859-1
{"$NT$cc1260adb6985ca749f150c7e0b22063", "\xFC\xFC"}, // Two of them

If I leave the first one uncommented, I can build and test the --utf8 
mode, but the normal mode will fail. And vice versa. This is a big 
problem when eg. trying to optimise the conversions. I normally work 
with all three lines commented out, so sometimes bugs are not discovered 
immediately. Some way one would want the correct line to be picked at 
runtime but they are all constants. I'm not sure what's the best way to 
achive this. Maybe an optional third string for utf8? Or a separate struct.

* In a couple of formats (eg. NT) I have doubled the set_salt function 
and call it via a pointer, in order to mitigate the performance hit for 
non-utf8. I'm not sure how to do it better, but I'm not particularly 
satisfied. It's a hack.

Just try it out and flame or praise. One test I did was sort out all 
lines in the Rockyou dataset that has 8-bit characters. Most of these 
are UTF-8 but not all. Then I create fake NT and raw-md5-unicode 
password files from them and try to crack them using --utf8 or not. 
Works like a charm.

magnum

Download attachment "john-1.7.6-jumbo12-utf8.diff.gz" of type "application/x-gzip" (14533 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.