Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Fri, 12 Aug 2011 13:34:41 -0500
From: "jfoug" <>
To: <>
Subject: Unicode, casing, obtaining data, and some real-world MSSQL (2000) data.

Long email, but created from in the field testing examples, not from
‘theory’ of what should happen.



Unicode, is NOT easy.  Especially when dealing with casing. There are
numerous pieces of data which we may or may not be able to figure out. Even
within the Unicode documentation, there is discrepancy on just ‘how’ to
case.  Thus, what we have is we need to determine just HOW the OS’s, and
packages running underneath them actually perform casing.


Take for instance, from the files section
( There are 3 main documents to use
in ‘casing’.  The 2 I used are the 2 files UnicodeData.txt and
SpecialCasing.txt. With the first file, you find all of the ‘simple’ single
letter to single letter casing rules.  The second file lists the single
character to multiple character rules, and some locale specific rules.
>From these 2 files, I generated the original unicode_data.h array’s, and the
‘extra’ multi-byte casing logic.  There is a 3rd file,  CaseFolding.txt
which is only casing data, and is supposed to allow handling some of the
problems found in the above 2 documents (such as characters which do have a
single char to single char upcase, but then also have a single char to
multi-char upcase).  To get this data, I wrote specialized C code, to parse
this file, and make the ‘correct’ array entries.


Well, getting that 100% workable, and being able to do things like properly
collate things such as "MASSE" and "Maße" is not the real ‘purpose’ we need
in john.  But that IS the purpose of the Unicode documentation.  In john, we
need to know EXACTLY what binary bytes were used when performing the
encryption, so that we can match an after the fact hash, and know if the
specific password presented is indeed the correct one.  Thus, john has to do
whatever the original OS/DB/application did when it produced the hash
against the user’s password (and name/userid, etc, etc).
*** MSSQL-2000 information ***
So, not knowing exactly how MSSQL (2000) performed this conversion, I set
down a road to actually produce hashes with it, and feed it data and observe
exactly what was going on.  Also, I did the same for some M$ OS’s, to view
what they are doing, since most applications which will be installed on
them, will be using the OS level API’s to do this work.  NOTE, I did not (at
this time), test C#, or any .Net code.  I image, that will also need to be
done, and added to the actual set of information.
Ok, to start with, I had the original data from UnocdeData.txt
and SpecialCasing.txt files.  In this, there are 967 upper case letters
which convert to lower case, and 1001 lower case letters which can convert
into an upper case character.  Also, from SpecialCasing.txt, there are 75
lower case letters which convert into upper case, using multiple characters,
and there are 7 upper case letters which convert into multiple letters lower
case.  However, 6 of the 7 single upper into multiple lower, are locale
specific (which john simply ignores).
I set this up, and generated some sample data within MSSQL, and started to
see just what worked, and what did not.  The MSSQL, was MSSQL-2000 SP5 (I
think sp5).  It generates both the sql and sql05 hashes.  I wrote this SQL,
which will give me a set of ALL Unicode characters (totally ignores code
declare @count int
declare @pass nvarchar(24)
declare @uchar nchar(1)
declare @hash varbinary (255)
CREATE TABLE tempdb..h (id_num int, hash varbinary (255))
set @count = 0
WHILE (@count < 65535)
  set @uchar = NCHAR(@count)
  set @pass = 'PASS' + @uchar + 'WORD'
  set @hash = pwdencrypt(@pass)
  INSERT into tempdb..h (id_num, hash) values (@count, @hash )
  set @count = (@count + 1)
select * from tempdb..h order by id_num
drop table tempdb..h
What the above SQL does, is to produce hashes from U+0000 to U+FFFF, and
build them into a password like this:  “PASS” . U+xxxx . “WORD” and then
uses MSSQL to perform the exact encryption it would have done.   Thus, it is
easy to determine EXACTLY which Unicode characters are case converted.  If
the sql and sql05 hashes are the same, then there was no casing performed at
all on this Unicode character.
Once I ran this, I came up with a list of 673 uc into lc characters and 635
lc to uc characters.  I made some changes to Unicode.c, to remove some of
the upcasing data (turned the 1001 into 636), and when that was done, john
cracked every single password generated by the actual instance of MSSQL
(other than U+000D, which I do not care about, since it is the carriage
The SQL server was installed on a Win2003Server.  I believe that this is the
last version of the OS, capable of running this version of MSSQL.
*** Windows OS information ***
Next, I wanted to see just what the ‘native’ API’s within Windows was doing.
I used MSVC to test with, so this may not be 100% valid.  I probably need to
test with some API calls, and also dig into how .Net, and even different
versions of .Net behave, and then likely look at other apps, like PHP, Perl,
etc, to make sure that these 3rd party apps behave the same as the Windows
OK, with the above limitations listed, I created a program, which I will
show a small part of here.  This application did pretty much the same as the
SQL.  It used C to upcase/downcase characters, and then when different,
output them. I wrote the full app, so that it generated arrays which were in
the EXACT format that was produced by the code that chopped up
UnicodeData.h.  Here is some of this code.

#include <windows.h>

#include <wctype.h>

#include <locale.h>


wchar_t iAr[6];

int uc, lc;

void main()


      wchar_t *cpLocale = _wsetlocale( LC_CTYPE, L"English" );

      for (i = 0, iAr[0] = 0; i < 0x10000; iAr[0] = ++i) {


            if (iAr[0] != i) { ++uc; Output(i,1); } // note output not


      for (i = 0, iAr[0] = 0; i < 0x10000; iAr[0] = ++i) {


            if (iAr[0] != i) { ++lc; Output(i,0); }


      // ..

      printf ("/* There are %d uc's and %d lc's
*/\n", lc, uc);


What I found here, is several things. First, if the _wsetlocale() was not
called, then the only upcasing/lowcasing was A..Z <-> a..z  Then, if
_wsetlocale() was called (with a valid locale), then the exact same casing
was happening, NO MATTER WHAT locale is used.  Remember, we are in Unicode,
so the OS simply turns on the above 0x7F casing rules, but they are the same
for the OS.
What I found from doing this, is that what this C program returned, and what
the MSSQL returned was almost exactly the same. I think there was 1 casing
difference.  This program when run on XP, or Win2k, or Win2k3 was almost
identical.   There may have been a letter or 2 difference.  IIRC, the one
letter missing on Win2k3 in this test (compared to WinXP), was the exact
same letter as was different in the MSSQL.  Thus, because of this
similarity, I am pretty sure that the technique used in my C app, jives with
how MS was handling things under the hood in MSSQL.  So, I also am pretty
sure that this is how casing would be handled for other M$ formats, if we
had any Unicode casing issues to deal with.
One other BIG thing to point out, is there are no instances where the 1
character into multiple character case folding, was being performed by
Now, was not able to install MSSQL2k on a Win7 box.  However, I was able to
run the C program.  On the Win7 instance, I get 973 lc2uc and 973 uc2lc.  In
this instance, the Unicode data I originally produced is ‘very’ close.
However, it looks like I have missed a couple (or there are a couple of ‘new
characters’).  I had 967.  Also it looks like MS removed instances that are
not cyclic (i.e. one way only).  Also, it should be noted, there are NO one
character to multiple character casing within this version of Windows.   
At this time, I do not have access to a WinVista box.  However, I would bet
that it works much closer to Win7, than it does to WinXP/Win2k3, meaning
that it has more full Unicode support.  However, for the MSSQL format, I do
not think either of these OS’s matter at all, since MSSQL will not run (at
least will not install), on either of these.
**** Results ****
At this time, I am not fully sure all of the proper unicode casing methods
needed.  Right now, we have only 2 formats, which this is even used.  Those
formats are Oracle, and MSSQL (old).  I have added multiple arrays of
Unicode casing data (right now, the original UnicodeData.txt file and my
sample app run under XP).  It is being laid out, so that john can properly
set itself up to use the right one.  However, at this time (likely in the
first release), this may be set to use the XP conversions.  The problem is,
that Unicode needs to be initialized prior to initializing the formats.
Thus, we do not at this time, ‘know’ what format is being used, IF ANY.
Thus, when I do release this, it will likely be an initial release, and need
some work tweaking it.  Also, I had some problems with magnums recent UTF-32
changes.  I need to work through some of that with him, as I do not fully
understand all of that code.

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.