Date: Thu, 14 Jul 2011 10:18:36 -0500 From: "JimF" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: Upper casing (and lower casing) in john Solar, What logic within john is there for casing, (up/down, etc). From my knowledge, there is: 1. rules: l u c C ?l ?u t TN (p P I may also be impacted). S V are also likely candidates. 2. Formats (but these are one by one issues which need to be addressed directly). Oracle/mssql have been handled. LM has not, but by my understanding, what we have done already is the 'correct' method. Now, what about external ?? I do not think there is case conversion in there now, but is this something we 'should' add, toupper and tolower type functions? Is there anything needing looked at within the pre-processor code? Are there other places where letter case, or changing case is required within john? The reason I ask, is there are now valid toupper/tolower character macros, which will properly up/down all 8 bit ANSI characters properly (with a couple of caveats). We should be able to make most of these changes, with no impact in performance. However, we now will crack a LOT more hashes if they have many of the European accent/umlaut type characters. I believe that the changes to do this should be pretty easy, depending upon how the original code (especially in rules), is put together. I have not looked 'yet', so I am not sure. We also 'can' perform utf8 case changing, but that is quite a bit more complex, since the utf8 is really just a place holding representation of the real character data (which is Unicode). Thus for utf8, to perform case switching, you have to convert out of utf8 into the true Unicode, and then perform the case changing, and then convert back into utf8. Since we only deal with UCS2 within john (UTF16), we only have 256kb of translation table data. If we were to try to do this for utf8, without the conversion into and then out of UTF16, the tables for straight utf8 would be 2 of them (one for up, one for low), and would have 2^24 elements, and each element would be 3 bytes (at least). That is 100MB of translation array. This is simply out of the question, and the better way is the 3 step method utf8 -> Unicode -> case mod -> utf8. NOTE, this can be done 1 char at a time. A single char can be obtained from the utf8 (may be 1 2 or 3 physical bytes), This is converted into the UTF16 char (or MULTIPLE chars), then that is translated, then that char is converted back into utf8, and re-inserted into the new cased utf8 character(s). A couple of strange notes about casing in Unicode. There are many 'normal' 1 to 1 case conversion. However, there are also 1 to many conversions!?! This is where a single character converts into multiple characters. There is even one of these for an 'ANSI' character. The 0xB5 letter. This letter is a german lower case 'ss' character. It translates into SS upper cased. There is no way to go back. This is the sencond 'strangeness' for real unicode casing. uc(lc(word)) has no guarrentee to equal word, nor is there a guarantee that lc(uc(word))==word. The B5 -> SS is a prime example. uc(lc(0xB5)) == ss and not 0xB5. There are many many other examples which do not cycle. In unicode, there are also 3 cases possible for a character to be 'converted to'. These are ToUpper, ToLower, and ToTitle. The title is NOT upper, but can be different. Often the title case and upper case 'are' the same. However, it may not be the same character. We only handle char->upper and char->lower. The title case is not handled, and I am not sure it would be of value to john. Jim.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.