Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Fri, 2 Apr 2010 01:20:08 +0400
From: Solar Designer <>
Subject: Re: rule and encoding wordlist

On Wed, Mar 31, 2010 at 09:23:47PM +0200, wrote:
> I use JTR 1.7.5 with latest patches, os X, terminal is UTF-8.
>  With following rule (below) and a wordlist (1 word "tro") encoded 
> Western (Windows Latin 1) , end of line Windows (CRLF)
>  >\r[00-9A-C] A\p0[0-9A-D],him, $1
>  I get 
> iMac-de-xxx-xx:run xxxxx$ ./john -w:testmot.txt -rules -stdout
> himtro1
> thimro1
> trhimo1
> trohim1
> words: 4  time: 0:00:00:00 100.00% (ETA: Wed Mar 31 21:10:40

Looks good.  However, if you actually have any 8-bit character of the
iso-8859-1 encoding in a wordlist entry, then it may/will be displayed
improperly on your UTF-8 terminal, and indeed it will be tested in the
iso-8859-1 encoding against your hashes (which may or may not be what
you want).

> With the same rule but this time my wordlist is unicode UTF-8 , end of 
> line Windows (CRLF) or Unix (LF)
>  I get :
> john -w:testmot.txt -rules -stdout
> himtro1
> ?him??tro1
> ?him?tro1
> himtro1
> thimro1
> trhimo1
> trohim1
> words: 7  time: 0:00:00:00 100.00% (ETA: Wed Mar 31 21:12:24 2010) 
>  As you can see some unwanted "?" are now included in the word 
> generated.

Did you mean the question mark character, or did some other character
get replaced by a question mark when you sent your message?

Anyhow, JtR does not support multi-byte characters (in fact, it is
unaware of what character encoding your wordlist is in).  Most of the
time, this is not a problem for UTF-8, because JtR will simply pass any
UTF-8 characters from a wordlist into its password hashing routines
verbatim (treating the multi-byte characters as multiple single-byte
characters, which works just fine).  However, when you try to insert
strings at arbitrary character positions, you have a problem.  The rule
you have mentioned may try inserting the string "him" inbetween
individual bytes of a single multi-byte UTF-8 character, thereby
breaking that character.

In your sample JtR output above, the input word appears to be treated as
being 6 single-byte characters long.  Since it does not appear to
actually use any multi-byte characters, a guess is that your wordlist
file starts with the BOM character, which is 3 bytes in UTF-8.  It is
this character that gets broken by having "him" inserted inbetween its
bytes.  This also results in the extra output lines.

>  I use very large wordlist (up to 60 gigas) , I can't reencode them.

You also posted another message:

stating that you "reencoded file with Unicode (UTF-8, no BOM)".  That's
nice, however please be aware that this won't save you from having other
multi-byte UTF-8 characters broken and extra output lines produced by
the rule you mentioned.  You might not care, though, because hopefully
there are relatively few wordlist entries with multi-byte UTF-8
characters, and the only impact is having JtR try extra candidate
passwords (it will also generate and try all the correct/intended ones).

A way to avoid this problem would be to have your wordlist in the
iso-8859-1 encoding (which uses single-byte 8-bit characters) and to
have an external filter() convert from iso-8859-1 to UTF-8.  The
filter() would be applied after the rules, thereby avoiding the problem
with the lack of UTF-8 support in the rules engine, yet probing UTF-8
encoded strings against your hashes.  Someone should write such an
external filter().


P.S.  I'd like to remind you that whenever you need to post a follow-up
to your own posting, you need to "reply" to that posting (either as you
received it via the list or as you sent it - you do save sent messages
in a "Sent" folder, don't you).  Don't post such follow-ups to the list
anew.  Doing so starts a new thread in the list archives, which is what
happened this time (you started two separate threads for the same topic).

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.