john-dev - Re: "valid character" class

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110809120231.GA27064@openwall.com>
Date: Tue, 9 Aug 2011 16:02:31 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: "valid character" class

On Tue, Aug 09, 2011 at 01:00:31PM +0200, magnum wrote:
> OK, I think we'll go for ?y for 'valid' then.

Sounds good.

> Question to *all*: There are some characters that are truly invalid for 
> a codepage, like 0x98 in cp1251. There are also characters that are not 
> really invalid per the Unicode spec, but control characters. For 
> example, in most (all?) ISO-8859-xx codepages, the characters 
> 0x80..0x9F. Should we treat the latter as invalid? There are pros and 
> cons. My personal vote is that we should treat them as invalid, i.e. the 
> rule !?Y would drop any candidate that contains 0x80..0x9F if we're 
> using --enc=iso-8859-1 but only 0x98 if using -enc=cp1251.

I concur.

We could also want to introduce a class for control chars, though.
By default, it'd cover whatever chars are usually the control ones on
terminals - see the DumbForce sample.  However, for example,
--encoding=cp1251 will turn most chars in the 0x80 to 0x9f range to
non-control, even though they will remain risky to the terminal...

In practice, I'd expect the complement of this class (non-control) to be
more useful.  We'll get that one automatically.

So we'll have ?y for valid and ?O for non-control - similar, but
different (as you explained above).

Oh, and we could want to allocate a consecutive range of character class
letters (maybe a very small range) for user-defined classes.  Maybe we
could use digits rather than letters, but then there won't be automatic
complements.

> One effect of doing so is ability to reject/accept any UTF-8 encoded 
> words (from a mixed wordlist like RockYou.txt) using such rules because 
> *all* non-ascii characters in UTF-8 contains octets in that range.

In what range?  Sorry, I don't understand what you mean here.  There are
UTF-8 characters that are not ASCII yet that do not contain octets in
the 0x80 to 0x9f range.  So perhaps you meant something else.

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.