john-dev - Adding multi-byte support into more parts of JtR.

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150809122246.0VXCM.5190.imail@eastrmwml302>
Date: Sun, 9 Aug 2015 12:22:46 -0400
From:  <jfoug@....net>
To: "john-dev@...ts.openwall.com" <john-dev@...ts.openwall.com>
Subject: Adding multi-byte support into more parts of JtR.

I have added 2 new issues (there might be more added soon) about encodings. 

These are NOT going to be trivial (one may not be bad).  These are to add multi-byte support into rules and into incremental. 

https://github.com/magnumripper/JohnTheRipper/issues/1628 
https://github.com/magnumripper/JohnTheRipper/issues/1627 

Incremental as it stands 'does' do a limited job of outputting utf-8 if all the utf8 character is 3 bytes or less.  inc is not treating it as 1 character, but it is treating it as a 'often seen' group.  This did allow incremental mode to be somewhat useful.  However, if we can get it to work properly with larger character sets, then it would work 'properly'.  This may actually not be all that hard to do, and hopefully will only slow down a bit and only slow down with chr sets built for data > 8bit characters.  Then during the run, inc would work with 8 bit or 16 bit (or 32 bit?) characters, and then if 16 bit characters and the encoding wanted was utf8, upon completion of the 'word', inc would convert that wide char word into utf8.  This actually should be somewhat straight forward. 

The rules is going to be a much larger undertaking, since the rules will have to work WITH the multi-byte encoded string (the actual utf-8 string). This means that everything dealing with characters can no longer assume a character is 8 bits.  This is a much larger undertaking, BUT it is something that is sorely needed.  When we added code page support into rules, we got part way there.  That is because the CP's all are 8bit (not sure all, but I think so).  Thus, rules can work like it does.  We did have some nuances to handle for things like handling digits, upper case, lower case, symbols, etc.  But the 'core' of rules was maintained.  For this change, it may actually be better to build a separate rules, unless we can hook things in without impact to the existing 8 bit logic (performance wise). 

NOTE, we also need to provide better support in the MASK mode, and make sure that rexgen also properly supports multi-byte. Internally rexgen does, we just need to make sure it is properly working.Markov mode may also need changes, I am not sure.

This may be a big enough task for a GSOC16 ;)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.