john-dev - Found cp1251 issue (and likely 8850-1) or many code pages.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <024201cc4c7f$009082d0$01b18870$@net>
Date: Wed, 27 Jul 2011 12:02:43 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: Found cp1251 issue (and likely 8850-1) or many code pages.

Ok, here is the issue.

 

Cp1251 0xB5 -> Unicode 0xB5.    This is the 'micro' character.  There is an
upcase character for this in Unicode.  Unicode 0xB5 upcases to 0x39C.  There
is no upcase for this char in CP1251 (or 8859-1), meaning U+039C is NOT a
valid character on either of these code pages.  This was the problem in
mssql where I had to 'hand edit' the pass_gen generated data (when building
the 'ansi' or 8859-1 input files).  I am now seeing similar behavior for
CP1251.  0xB5 is a valid character in both CP1251 and 8859-1 (not in KOI8r,
however,  since this character in 'slot' 0xB5 is actually U+2562) Thus, when
we get that char in 1251, or 8859-1, it 'is' a valid  character. However,
there is NO WAY, to upcase it, and for MSSQL/Oracle, we want to upcase.

 

To make things even more strange, follow this Unicode 'logic':

 

uc(U+00B5) == U+039C  (title case is same).

lc(U+039C) ==  U+03BC

uc(U+03BC) ==  U+039C

 

Thus, lc(uc(U+00B5)) != U+00B5  and   lc(uc(U+00B5)) == lc(uc(U+03BC))
Once you upcase B5, you cannot go back.  Gotta love Unicode, lol

 

I think how we are handling this in john is just fine, but we do need to
'audit' how things are handled in certain ways (as I show below). We DO NOT
upcase xB5 in 1251 (or 8859-1).  The problem comes when a dict file has a
line with xB5 in it, and in perl we try to UC that line.  Perl knows that B5
is a Unicode char, it knows how to upcase B5 (into 0x39C), but then 039C is
NOT in the code page (which is correct).  Thus, perl rightly claims that you
cannot do this, and perl actually changes THAT character into \x{039c}
(that 8 byte string), thus, when we feed that into a cryptor, it fails.

 

In john, if we do this (logic, since this is not the code):

 

Str16 = plaintowc(input_8);

Str16_2 = utf16_uc(Str16)

 

and we are in cp1251/8859-1, and a 0xB5 is in the original input_8 word,
then will have a 2 byte character of 0x039C somewhere in there.  That is
wrong.

 

But if we do  Str16 = utf8_uc(input_8), I think we will do it 'as right' as
we can.  We upcase the string, but leave 0xB5 in that character.  It is a
lowercase char, but for this code page, we do not have the upcase for this
character.  I 'believe' that this handling, is correct, and that converting
to Unicode, then using Unicode to upcase is NOT valid (which must be
happening within Perl).

 

Again, without knowing the INTERNAL guts of how the actual implementation of
the password crypting functions were done, I am not sure what is correct
behavior.

 

The behavior I am working towards, is that when we upcase a string with B5
in it (for cp1251/8859-1), that there will be a xB5 left in the upcased
string in the end.   If someone 'knows' this is not right for some format,
then please speak up.   This also means that I will have to hand edit a few
'input' files built by perl, since it is doing things incorrectly, and
actually message up the strings 'telling' you it cannot do it (with \x{039C}
string).  

 

I am glad I started coding this with case switching arrays (like are
implemented in rules.c), vs converting into UTF16, casing, and converting
back.  It is much faster simply using array lookup, and if there are lower
case without upper case (or other way), it is easy to simply not have that
in the change-case array.

 

 

This was the last issue I had.  Now I  have all work to get the code pages
running 'right' with the formats (I had to modify numerous formats
set_key/get_salt, etc) functions.  I have a little clean up, and I will get
the patch in.  I also have updated the test suite, to test these code pages.
The only 'testing' done right now, is in the internal Unicode formats.  If
there were 8 bit formats that upcased or down cased, then I would have to
check them also.  However, there is no reason to check the other formats.
This is simply binary data, and the existing test suite has found any 8-bit
problems.

 

Jim.

 


Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.