john-dev - RE: Unicode, casing, obtaining data, and some real-world MSSQL (2000) data.

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <00fb01cc5931$5dec2b50$19c481f0$@net>
Date: Fri, 12 Aug 2011 15:49:44 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: Unicode, casing, obtaining data, and some real-world MSSQL (2000) data.

>From: magnum [mailto:rawsmooth@...dband.net]
>> What I found here, is several things. First, if the _wsetlocale() was
>not
>> called, then the only upcasing/lowcasing was A..Z<->  a..z  Then, if
>> _wsetlocale() was called (with a valid locale), then the exact same
>casing
>> was happening, NO MATTER WHAT locale is used.  Remember, we are in
>Unicode,
>> so the OS simply turns on the above 0x7F casing rules, but they are
>the same
>> for the OS.
>
>Are you saying that if you set a locale it would go from just a-z to
>complete Unicode - BUT using the system locale instead of the one you
>specified? That weird, kinda defeats the whole purpose of wsetlocale().

No, what I meant to say, is if you did NOT call wsetlocale() then the ONLY data which was upcased using the _wcsupr() C function (which simple drills down to the Win32 API), only cases lower 128 byte ASCII values.   Now, it may be within the _wcsupr() function, it knows whether you have called the setlocale on LC_CTYPE or LC_ALL (the 2nd one sets 'all' the locale items), then it does not drill down to the API, but simply falls back to the builtin clib strupr type functionality.   I do not know for sure.

However, no matter what locale information I fed into the wsetlocale function, the casing changes which showed up in the call to wcsupr() were the exact same characters.

>> Thus, when I do release this, it will likely be an initial release,
>and need
>> some work tweaking it.  Also, I had some problems with magnums recent
>UTF-32
>> changes.  I need to work through some of that with him, as I do not
>fully
>> understand all of that code.
>
>Do you mean the reinstated "third case" in utf8towcs()? 

I believe so.  There were a couple of if blocks which printf error codes, and 'tried' to correct their location within the data stream.  I commented both of those out, at this time. I know it is not right, and we will have to work through what 'is' right, but it allowed the format to process every data point from U+0 to U+FFFF.

It is likely that I was simply spitting out invalid nonsense data, and the code was correct, in 'expecting' another UTF16 character, which was not present.  However, I think this is simply garbage avoidance code.  We simply have to get it where it keeps the process image 'safe', and does not output unneeded warnings.  Like I said, what I initially publish will likely need some tuning.  However, I do not think this would cause anyones cracking job to have any heartburn at all, right now.

Jim.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.