john-dev - Re: Markov UTF-8 magic

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 6 Jan 2013 10:22:02 +0100
From: Frank Dittrich <frank_dittrich@...mail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Markov UTF-8 magic

On 01/06/2013 04:52 AM, magnum wrote:
> On 6 Jan, 2013, at 3:23 , magnum <john.magnum@...hmail.com> wrote:
> 
>> On 5 Jan, 2013, at 14:29 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>>> On 01/05/2013 01:11 PM, Frank Dittrich wrote:
>>>> Since Markov mode generates words based on 2-byte-frequencies, and since
>>>> it generates passwords shorter than maximum length, there will be a
>>>> non-neglectable number of words with invalid utf-8 characters,
>>>> especially at the end of the word. So you might need to combine --markov
>>>> with an --external filter.
>>>
>>> If you don't want to write a general-purpose utf-8 validity check, but
>>> just one which checks --markov output based on stats files which have
>>> been generated using a word list encoded in (valid) UTF-8, then this
>>> task is quite simple:
>>>
>>> If the last byte is < 0x80, the word is valid.
>>> Else if the last byte is > 0xbf, the word is invalid.
>>> Else if the second to last byte is >= 0xc0 and <= 0xdf, the word is valid.
>>> Else if the third to last byte is >= 0xe0 and <= 0xef, the word is valid.
>>> Else if the forth to last byte is >= 0xf0 and <= 0xf7, the word is valid.
>>> Else the word is invalid.
>>
>> I'm thinking I could include this in the Markov mode itself, provided we run with --enc=utf8. Would that be sane?
> 
> I tried doing so. Please test.

Unfortunately, what you input contains both characters which are
represented by two bytes and characters which are represented by three
bytes, then you can get wrong sequences if you have characters with the
same second byte. I.e., you could have the third non-ascii byte for a
3-byte-character appended to a 2-byte character, or the third byte of
the 3-byte-character skipped, because 2-byte characters and 3-byte (or
4-byte) characters use the same continuation bytes. It is just the first
byte that determines how long the UFT-8 representation of a character is.

I tried to implement a better UTF-8 check as an external mode, but so
far I had no time to test:

# Check for valid UTF-8 encoding
[List.External:UTF-8]
void filter()
{
	int i, j, c;
	i = err = 0;
	j = -1;

	while (c = word[i]) {
		if (c >= 0x80) {
			if (c < 0xc0) {
				if (i > j) {
					word[0] = 0;
					return;
				}
			}
			else if (i <= j) {
				word[0] = 0;
				return;
			}
			else {
				j = i+1;
				if (c >= 0xe0) {
					j++;
					if (c > 0xf7) {
						word[0] = 0;
						return;
					}
					else if (c >= 0xf0)
						j++;
				}
			}
		}
		else if (i <= j) {
			word[0] = 0;
			return;
		}
		i++;
	}
	if (i <= j)
		word[0] = 0;
}

If there are bugs, you should still be able to get the idea of what I
tried to do.
Some bytes (0xc0, 0xc1, and 0xf5 - 0xff) are invalid. Those shouldn't be
generated by Markov mode if the input for generating the stats file was
valig UTF-8, but for a general-purpose UTF-8 checker, they should be
excluded.
See http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Frank
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.