john-dev - UTF-8 rules engine (was: What is a digit?)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e05baa25886015220a3a4f6d1eaf0bca@smtp.hushmail.com>
Date: Fri, 4 Jan 2013 18:53:06 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: UTF-8 rules engine (was: What is a digit?)

On 4 Jan, 2013, at 18:37 , magnum <john.magnum@...hmail.com> wrote:
>> Even if ², ³, ¼ are digits, why aren't these characters digits if utf-8
>> is used?
> 
> Solely because the rules engine has almost no support for UTF-8. We'd need to make a separate alternative UTF-16 rules engine. It would probably be easier than one might think at first, but at this time it's low prio to me. And I doubt anyone else cares.

That was poor wording, I mean we would need to make an UTF-8 rules engine that internally use UTF-16 (or even UTF-32) from start to end. I think you could basically copy rules.c to rules16.c and change all 'char' to 'UTF16' (which is typedefed to unsigned short in unicode.h). Then go from there. But I'm not sure how to interface that rules engine. Where do we want conversion from/to UTF-16 to happen? For a format like NT, we'd really want this:

wordlist (utf8) -> conv2utf16 -> rules16 -> NT_fmt (still utf16 so we'd need to have an alternative set_key() for this case)

while for raw-md5 using UTF-8, you'd want:

wordlist (utf8) -> conv2utf16 -> rules16 -> conv2utf8 -> raw-md5_fmt

...and you might even want to use a different input encoding than output (hashed) encoding. Say you have raw-md5 made from ISO-8859-1 but you want to use an UTF-8 wordlist:

wordlist (utf8) -> conv2utf16 -> rules16 -> conv2ansi -> raw-md5_fmt

This would be powerful and I'd like to have it. But again, I doubt many others care :-)

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.