john-dev - Upper casing (and lower casing) in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <A4617FA0C56D43B68A3A4E7FF975F6B3@D9VGLK61>
Date: Thu, 14 Jul 2011 10:18:36 -0500
From: "JimF" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: Upper casing (and lower casing) in john

Solar,

What logic within john is there for casing, (up/down, etc).  From my 
knowledge, there is:

1. rules:  l u c C ?l  ?u t TN    (p P I may also be impacted).  S V are 
also likely candidates.

2. Formats (but these are one by one issues which need to be addressed 
directly).  Oracle/mssql have been handled.  LM has not, but by my 
understanding, what we have done already is the 'correct' method.

Now, what about external ??   I do not think there is case conversion in 
there now, but is this something we 'should' add, toupper and tolower type 
functions?

Is there anything needing looked at within the pre-processor code?

Are there other places where letter case, or changing case is required 
within john?


The reason I ask, is there are now valid toupper/tolower character macros, 
which will properly up/down all 8 bit ANSI characters properly (with a 
couple of caveats).   We should be able to make most of these changes, with 
no impact in performance.  However, we now will crack a LOT more hashes if 
they have many of the European accent/umlaut type characters.  I believe 
that the changes to do this should be pretty easy, depending upon how the 
original code (especially in rules), is put together.  I have not looked 
'yet', so I am not sure.

We also 'can' perform utf8 case changing, but that is quite a bit more 
complex, since the utf8 is really just a place holding representation of the 
real character data (which is Unicode).  Thus for utf8, to perform case 
switching, you have to convert out of utf8 into the true Unicode, and then 
perform the case changing, and then convert back into utf8.  Since we only 
deal with UCS2 within john (UTF16), we only have 256kb of translation table 
data.  If we were to try to do this for utf8, without the conversion into 
and then out of UTF16, the tables for straight utf8 would be 2 of them (one 
for up, one for low), and would have 2^24 elements, and each element would 
be 3 bytes (at least).  That is 100MB of translation array.  This is simply 
out of the question, and the better way is the 3 step method   utf8 -> 
Unicode -> case mod -> utf8.  NOTE, this can be done 1 char at a time. A 
single char can be obtained from the utf8 (may be 1 2 or 3 physical bytes), 
This is converted into the UTF16 char (or MULTIPLE chars), then that is 
translated, then that char is converted back into utf8, and re-inserted into 
the new cased utf8 character(s).

A couple of strange notes about casing in Unicode.   There are many 'normal' 
1 to 1 case conversion.  However, there are also 1 to many conversions!?! 
This is where a single character converts into multiple characters.    There 
is even one of these for an 'ANSI' character.  The 0xB5 letter.  This letter 
is a german lower case 'ss' character.  It translates into SS upper cased. 
There is no way to go back.

This is the sencond 'strangeness' for real unicode casing.     uc(lc(word)) 
has no guarrentee to equal word, nor is there a guarantee that 
lc(uc(word))==word.   The B5 -> SS is a prime example. uc(lc(0xB5)) == ss 
and not 0xB5.  There are many many other examples which do not cycle.

In unicode, there are also 3 cases possible for a character to be 'converted 
to'.  These are ToUpper, ToLower, and ToTitle.  The title is NOT upper, but 
can be different.  Often the title case and upper case 'are' the same. 
However, it may not be the same character.    We only handle char->upper and 
char->lower.  The title case is not handled, and I am not sure it would be 
of value to john.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.