Date: Mon, 25 Jul 2011 09:26:42 -0500 From: "JimF" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: Character encoding 'how-to' and patch 0009 Magnum has made and posted a patch to the wiki, which disables koi8-r/cp1251 (or any encoding), other than 8859-1/utf8 for any format which uses Unicode internally. This is a proper patch for this time. However, I am going to get the other encodings working properly, so this can be undone. Which brings me to the point of this email. Here are the steps we need to take, when adding a new code page encoding to john. I am listing this here and now, because there are other code pages which we may want added in the near or not so near future. When I added the 2 new code pages, I only performed some of this logic. Some of what will be listed here, has not been done yet, thus magnum made changes to protect the logic inside of the 'internal' Unicode formats. Ok, to add a new encoding. If simple '8-bit' fixed size character encoding (wide char encodings are not listed in this howto). 1. Build arrays of to-upper and to-lower values in rules.c. These arrays have to be the upper and matching lower case values, listed in the same order. If there are upper case only, or lower case only letters, then build a separate array for them. 2. Add these upper/lower case arrays to the proper character recognition setup in the rules_init_classes() function of rules.c. If there are any 'singleton' upper/lower case values, they would get added here. There are several of these. 'a' 'l' 'u' 'x', etc. I currently have NOT tried to do 'c' or 'v' This probably also should be done, but in the existing encodings, I do not know the language, so do not know what is a vowel and what is a consonant. Solar may want to chime in on his thoughts on whether these should also be adjusted. 3. Add the matched set of upper/lower case arrays into the upper/lower convert setup ( rules_init_convs() rules.c). These are the to-upper and to-lower switching. The array items have to be matched, and in the same order. This should get proper character classifications, and allow rules to work as a user would 'expect' it to work. 4. within unicode.c add code into the plaintowcs() function, to convert the charset properly, when options.this_charset is set. These 8-bit encodings are trivial to switch into Unicode. it is just a 256 element translation table. ISO-8859-1 was super easy, because it was a 1 to 1 conversion. The other encodings will not be as easy, and will require a 256 element array of UTF-16 values to perform the conversion. 5. within unicode.c, add code into utf16toplain() to handle the conversion from utf16 back into the 8 bit character set. Steps 4 and 5 have not been done yet for any character encoding, other than iso-8859-1 (yet). I will work on getting koi8-r and cp1251 working today (and will remove the changes to john.c contained within patch 0009 at that time). This will allow these encodings to work in not only the rules, but also in formats like NTLM, of course with the rules working properly. If people see anything that I am missing in the 'how to add an encoding', then please speak up, so that we get this procedure down properly. Once we get a couple of encodings added properly, the rest should be child's play to get them in. It really does seem to be pretty easy to hook them into john. Wide character / Unicode / utf8 / etc / etc, is not trivial, but is beyond the scope of this procedure. Jim.
Powered by blists - more mailing lists
Powered by Openwall GNU/*/Linux - Powered by OpenVZ