john-dev - Re: Markov UTF-8 magic

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2b7b4cab88dacf40b1e2773275c568bb@smtp.hushmail.com>
Date: Sun, 6 Jan 2013 04:52:20 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Markov UTF-8 magic

On 6 Jan, 2013, at 3:23 , magnum <john.magnum@...hmail.com> wrote:

> On 5 Jan, 2013, at 14:29 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>> On 01/05/2013 01:11 PM, Frank Dittrich wrote:
>>> Since Markov mode generates words based on 2-byte-frequencies, and since
>>> it generates passwords shorter than maximum length, there will be a
>>> non-neglectable number of words with invalid utf-8 characters,
>>> especially at the end of the word. So you might need to combine --markov
>>> with an --external filter.
>> 
>> If you don't want to write a general-purpose utf-8 validity check, but
>> just one which checks --markov output based on stats files which have
>> been generated using a word list encoded in (valid) UTF-8, then this
>> task is quite simple:
>> 
>> If the last byte is < 0x80, the word is valid.
>> Else if the last byte is > 0xbf, the word is invalid.
>> Else if the second to last byte is >= 0xc0 and <= 0xdf, the word is valid.
>> Else if the third to last byte is >= 0xe0 and <= 0xef, the word is valid.
>> Else if the forth to last byte is >= 0xf0 and <= 0xf7, the word is valid.
>> Else the word is invalid.
> 
> I'm thinking I could include this in the Markov mode itself, provided we run with --enc=utf8. Would that be sane?

I tried doing so. Please test.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.