john-dev - Re: Adding multi-byte support into more parts of JtR.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <863988bd292a7b5db4ec57c7e7336388@smtp.hushmail.com>
Date: Sun, 09 Aug 2015 23:48:22 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Adding multi-byte support into more parts of JtR.

On 2015-08-09 18:22, jfoug@....net wrote:
> I have added 2 new issues (there might be more added soon) about encodings.
>
> These are NOT going to be trivial (one may not be bad).  These are to add multi-byte support into rules and into incremental.
>
> https://github.com/magnumripper/JohnTheRipper/issues/1628
> https://github.com/magnumripper/JohnTheRipper/issues/1627
>
> Incremental as it stands 'does' do a limited job of outputting utf-8 if all the utf8 character is 3 bytes or less.  inc is not treating it as 1 character, but it is treating it as a 'often seen' group.  This did allow incremental mode to be somewhat useful.  However, if we can get it to work properly with larger character sets, then it would work 'properly'.  This may actually not be all that hard to do, and hopefully will only slow down a bit and only slow down with chr sets built for data > 8bit characters.  Then during the run, inc would work with 8 bit or 16 bit (or 32 bit?) characters, and then if 16 bit characters and the encoding wanted was utf8, upon completion of the 'word', inc would convert that wide char word into utf8.  This actually should be somewhat straight forward.

I guess if we do this at all, we should go with UTF-32.

I too think this is basically trivial. Just change "all" char[] and 
*char in inc.h and inc.c to UTF32 (which is a typecast in unicode.h) and 
then fix everything that broke ;-)

However, what if the format that are going to receive this key is a 
Unicode one (eg. NT)? Then we'll have two conversions inc->utf8->utf16 
slowing things dows for no good, instead of just using the key that was 
generated. Perhaps there should be a new format flag FMT_SET_KEY_32. If 
that is set, set_key() expects the input to be UTF-32 and incremental 
mode doesn't need to convert it.

>
> The rules is going to be a much larger undertaking, since the rules will have to work WITH the multi-byte encoded string (the actual utf-8 string). This means that everything dealing with characters can no longer assume a character is 8 bits.  This is a much larger undertaking, BUT it is something that is sorely needed.  When we added code page support into rules, we got part way there.  That is because the CP's all are 8bit (not sure all, but I think so).  Thus, rules can work like it does.  We did have some nuances to handle for things like handling digits, upper case, lower case, symbols, etc.  But the 'core' of rules was maintained.  For this change, it may actually be better to build a separate rules, unless we can hook things in without impact to the existing 8 bit logic (performance wise).

I think you got confused here. Forget about UTF-8, that encoding is for 
disk files and not for strings in memory that are processed.

I think wordlist.c and rules.c simply need to be UTF-32 interally, just 
like incremental mode.

>
> NOTE, we also need to provide better support in the MASK mode, and make sure that rexgen also properly supports multi-byte. Internally rexgen does, we just need to make sure it is properly working.Markov mode may also need changes, I am not sure.

Mask and Markov should be rewritten the same way as Incremental. 
Actually more or less all of JtR core and formats should use UTF32 strings.

>
> This may be a big enough task for a GSOC16 ;)

Yes, perhaps. Actually nothing of this is very tricky. We just need to 
decide at what point a wordlist entry get converted to UTF-32, and at 
what point it gets converted back to some target encoding. Everything in 
between those points will be simpler that today (no quirks needed, no 
internal encoding needed) and problably not much slower. And all 
non-wordlist modes would be UTF-32 from ground up.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.