john-dev - RE: --utf8 option, proof of concept

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <004f01cbde6d$00366f00$00a34d00$@net>
Date: Wed, 9 Mar 2011 09:16:44 -0600
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: --utf8 option, proof of concept

>-----Original Message-----
>From: magnum [mailto:rawsmooth@...dband.net]
>Sent: Tuesday, March 08, 2011 5:41 PM
>To: john-dev@...ts.openwall.com
>Subject: Re: [john-dev] --utf8 option, proof of concept
>
>I have solved the fmt_tests problem (so a couple of utf8 tests kick in
>when --utf8 is used together with --test) and I just realised mscash and
>the other salted formats does NOT fully work with utf8 yet, because I
>have yet to properly convert the salt from utf8 too. I just need to
>digest the code a while.


  if (options.flags & FLG_UTF8) {
     fmt_NT.methods.set_key = set_key_utf8;
+    fmt_NT.params.plaintext_length = 40;  // kick it up from 27. We will
'adjust' in the setkey_utf8 function
     tests[0].plaintext = "\xC3\xBC";         // German u-umlaut in UTF-8
     tests[1].plaintext = "\xC3\xBC\xC3\xBC"; // two of them
+    tests[2].plaintext = "\xE2\x82\xAC"; // euro sign
+    tests[2].ciphertext = "$NT$030926b781938db4365d46adc7cfbcb8";
  }

The above 'fix', gets the Euro check in there.  I copy pasted the 2nd test
(so there are 2 of them), for ascii.  Then you can set plaintext/ciphertext
to change it to utf8.  Also, I increased the plaintext_length if in utf8
mode.  40 may not be large enough. We want up to 27 unicode chars.  The call
to utf8towcs() within set_key_utf8 specifies PLAINTEXT_LENGTH max chars
(27), so the utf8towcs will stop at 27. I have also placed this change into
set_key_utfc, so that it makes sure if we have too many chars, that we
process truncated 27 of them.

  static void set_key_utf8(char *_key, int index)
  {
     int len = strlen(_key);
     unsigned char *key = (unsigned char *)_key;
     unsigned int i=0;
     unsigned int md4_size=0;
     unsigned int last_length, saved_base=index<<5;
     UTF16 temp;
     int buff_base;
-    UTF16 utf16key[PLAINTEXT_LENGTH];
+    UTF16 utf16key[PLAINTEXT_LENGTH+1];
+    utf16key[PLAINTEXT_LENGTH] = 0;

     utf8towcs(utf16key, key, PLAINTEXT_LENGTH);


I have tested by putting out a LONG utf8 line into the utf8 dictionary file
(same double length word, every other line). There were 30 UTF characters on
that line, and it had no problem.  Program ran fine, and did not 'lose'
anything (all 100 test hashes found).  It will test truncated passwords, but
I think this is default behavior for john in general.  Unfortunately, I do
not know what the plaintext length of the utf8 data 'should' be.  I do not
know if there is a way to know in advance the length.  However, I think 3
utf8 chars into 1 unicode is average.  It can be 1 to 1, upto 5 to 1.  NOTE,
if run in --utf8 mode against a .....

Ignore the above.  I AM having problems.  I  will produce some changes.
However, it should be easy to get 27 unicode chars from an 'unknown' utf8
string (but things may get truncated).  I will be making changes to the
ConvertUTF8.c, so that it returns back how much of the original utf8 string
was used (if we max out the Unicode characters).  Also, I will have to do
work in nt_fmt.c, so that longer passwords can be handled (in places other
than the Unicode SSE buffers).  But when done, it should be able to process
in -utf8 mode, and 'fill' up 27 characters.   Right now, it can only handle
27 ansi characters (the ones which convert 1 char into 1 unicode).  It will
currently only handle 13-15 unicode characters created from utf8 multi-byte
inputs.  We can get that to 27.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.