john-dev - Character encoding 'how-to' and patch 0009

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <DE16BB1A47E54B92B07D410B9E8F676A@D9VGLK61>
Date: Mon, 25 Jul 2011 09:26:42 -0500
From: "JimF" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: Character encoding 'how-to' and patch 0009

Magnum has made and posted a patch to the wiki, which disables koi8-r/cp1251 
(or any encoding), other than 8859-1/utf8 for any format which uses Unicode 
internally.  This is a proper patch for this time.   However, I am going to 
get the other encodings working properly, so this can be undone.

Which brings  me to the point of this email.

Here are the steps we need to take, when adding a new code page encoding to 
john. I am listing this here and now, because there are other code pages 
which we may want added in the near or not so near future.   When I added 
the 2 new code pages, I only performed some of this logic.  Some of what 
will be listed here, has not been done yet, thus magnum made changes to 
protect the logic inside of the 'internal' Unicode formats.

Ok, to add a new encoding.

If simple '8-bit' fixed size character encoding (wide char encodings are not 
listed in this howto).

1. Build arrays of to-upper and to-lower values in rules.c.  These arrays 
have to be the upper and matching lower case values, listed in the same 
order.  If there are upper case only, or lower case only letters, then build 
a separate array for them.

2. Add these upper/lower case arrays to the proper character recognition 
setup in the rules_init_classes() function of rules.c.  If there are any 
'singleton' upper/lower case values, they would get added here.  There are 
several of these. 'a' 'l' 'u' 'x', etc.   I currently have NOT tried to do 
'c' or 'v'    This probably also should be done, but in the existing 
encodings, I do not know the language, so do not know what is a vowel and 
what is a consonant.  Solar may want to chime in on his thoughts on whether 
these should also be adjusted.

3. Add the matched set of upper/lower case arrays into the upper/lower 
convert setup ( rules_init_convs() rules.c).  These are the to-upper and 
to-lower switching. The array items have to be matched, and in the same 
order.

This should get proper character classifications, and allow rules to work as 
a user would 'expect' it to work.

4.  within unicode.c add code into the plaintowcs() function, to convert the 
charset properly, when options.this_charset is set.  These 8-bit encodings 
are trivial to switch into Unicode.  it is just a 256 element translation 
table. ISO-8859-1 was super easy, because it was a 1 to 1 conversion.  The 
other encodings will not be as easy, and will require a 256 element array of 
UTF-16 values to perform the conversion.

5.  within unicode.c, add code into utf16toplain() to handle the conversion 
from utf16 back into the 8 bit character set.


Steps 4 and 5 have not been done yet for any character encoding, other than 
iso-8859-1 (yet).  I will work on getting koi8-r and cp1251 working today 
(and will remove the changes to john.c contained within patch 0009 at that 
time).  This will allow these encodings to work in not only the rules, but 
also in formats like NTLM, of course with the rules working properly.


If people see anything that I am missing in the 'how to add an encoding', 
then please speak up, so that we get this procedure down properly.  Once we 
get a couple of encodings added properly, the rest should be child's play to 
get them in.  It really does seem to be pretty easy to hook them into john.

Wide character / Unicode / utf8 / etc / etc, is not trivial, but is beyond 
the scope of this procedure.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.