Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 12 Aug 2011 15:49:44 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: Unicode, casing, obtaining data, and some real-world MSSQL (2000) data.

>From: magnum [mailto:rawsmooth@...dband.net]
>> What I found here, is several things. First, if the _wsetlocale() was
>not
>> called, then the only upcasing/lowcasing was A..Z<->  a..z  Then, if
>> _wsetlocale() was called (with a valid locale), then the exact same
>casing
>> was happening, NO MATTER WHAT locale is used.  Remember, we are in
>Unicode,
>> so the OS simply turns on the above 0x7F casing rules, but they are
>the same
>> for the OS.
>
>Are you saying that if you set a locale it would go from just a-z to
>complete Unicode - BUT using the system locale instead of the one you
>specified? That weird, kinda defeats the whole purpose of wsetlocale().

No, what I meant to say, is if you did NOT call wsetlocale() then the ONLY data which was upcased using the _wcsupr() C function (which simple drills down to the Win32 API), only cases lower 128 byte ASCII values.   Now, it may be within the _wcsupr() function, it knows whether you have called the setlocale on LC_CTYPE or LC_ALL (the 2nd one sets 'all' the locale items), then it does not drill down to the API, but simply falls back to the builtin clib strupr type functionality.   I do not know for sure.

However, no matter what locale information I fed into the wsetlocale function, the casing changes which showed up in the call to wcsupr() were the exact same characters.

>> Thus, when I do release this, it will likely be an initial release,
>and need
>> some work tweaking it.  Also, I had some problems with magnums recent
>UTF-32
>> changes.  I need to work through some of that with him, as I do not
>fully
>> understand all of that code.
>
>Do you mean the reinstated "third case" in utf8towcs()? 

I believe so.  There were a couple of if blocks which printf error codes, and 'tried' to correct their location within the data stream.  I commented both of those out, at this time. I know it is not right, and we will have to work through what 'is' right, but it allowed the format to process every data point from U+0 to U+FFFF.

It is likely that I was simply spitting out invalid nonsense data, and the code was correct, in 'expecting' another UTF16 character, which was not present.  However, I think this is simply garbage avoidance code.  We simply have to get it where it keeps the process image 'safe', and does not output unneeded warnings.  Like I said, what I initially publish will likely need some tuning.  However, I do not think this would cause anyones cracking job to have any heartburn at all, right now.

Jim.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.