john-dev - RE: --utf8 option, proof of concept

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <006801cbdea8$2bcf9ed0$836edc70$@net>
Date: Wed, 9 Mar 2011 16:20:17 -0600
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: --utf8 option, proof of concept

>-----Original Message-----
>From: jfoug [mailto:jfoug@....net]
>
>Ignore the above.  I AM having problems.  I  will produce some changes.
>However, it should be easy to get 27 unicode chars from an 'unknown'
>utf8 string (but things may get truncated).  I will be making changes
>to the ConvertUTF8.c, so that it returns back how much of the original
>utf8 string was used (if we max out the Unicode characters).  Also, I
>will have to do work in nt_fmt.c, so that longer passwords can be
>handled (in places other than the Unicode SSE buffers).  But when done,
>it should be able to process in -utf8 mode, and 'fill' up 27 characters.
>Right now, it can only handle 27 ansi characters (the ones which convert
>1 char into 1 unicode).  It will currently only handle 13-15 unicode
>characters created from utf8 multi-byte inputs.  We can get that to 27.
>
>Jim.

I believe the changes made will now work for NT utf-8 to Unicode, up to 27
Unicode characters (or up to however many Unicode characters can be built
from 95 utf8 characters).  

There was a bug in ConverteUTF.c (buffer was 2x to long. Also, there was
some Endian conversion missing in the utf8towcs function, and utf8towcs
would miss a 'broken' last character, and could start walking 'random'
memory in that case. I have fixed all of those issues.

If we max out at 27 Unicode characters, the function will now return of a
negative number. What is returned is the negative length of the amount of
input string used.  So, if we pass in 62 utf8 chars, and want to convert to
at most 27 Unicode chars, and 53 of them utf8's were used in converting to
those 27 Unicode chars, then -53 is returned.  However, if 62 chars were
converted into 25 characters, then 25 is returned (the length of the Unicode
'string').  

Knowing the length of the Unicode 'string', also allowed a little
optimization in the loading code (gains a couple percent in the SSE2
loading, and I have no idea what gains there are in XP-64) 

IN SSE2, it appears that -utf8 works at about 70% the speed of 'non' -utf8.
For testing, I made a 50mb file of 'pure' utf8 data (your utf80dic.txt file,
appended to itself until it was 50mb), then made a rule that was simply 30 :
lines (i.e. same line played 30 times).  Then I ran loading the whole file
into memory. Doing this, gave me the best 'valid' test, where the Unicode
multi-byte conversion routines were in play.  For the 'non-SSE2' code (x86
build), the -utf8 runs at about 75% of the speed.

View attachment "NT_fmt.c" of type "text/plain" (23077 bytes)

View attachment "ConvertUTF.h" of type "text/plain" (2345 bytes)

View attachment "ConvertUTF.c" of type "text/plain" (6847 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.