john-dev - Re: revised incremental mode and charset files (was: Bleeding-jumbo branch updated from core)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3bafd74ac9f00a39108646069a73846@smtp.hushmail.com>
Date: Fri, 26 Apr 2013 03:06:23 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: revised incremental mode and charset files (was: Bleeding-jumbo branch updated from core)

On 26 Apr, 2013, at 0:31 , Solar Designer <solar@...nwall.com> wrote:
> I've just pushed another related change out to the public CVS
> repository: speeding up charset file generation, as this becomes more
> important with increased CHARSET_* settings that we're going to use.

Cool, I'll try to merge things day by day in order to not lose control.

> I considered keeping/introducing some backwards compatibility
> code, like we had for JtR 1.6's (not a typo) incremental mode revision
> until now.  This would be reasonably easy if we kept CHARSET_MAX and
> CHARSET_LENGTH intact, but clearly we're not going to.  Since at least
> the default CHARSET_LENGTH will change, which would break compatibility
> with old charset files even without code changes, my options were either
> to revise/add a lot of code to support charset files with arbitrary
> CHARSET_LENGTH settings (different from the current build's) or to drop
> the backwards compatibility altogether and simply standardize on new and
> much increased default CHARSET_LENGTH (and maybe new CHARSET_MAX as
> well).  I chose the latter.

It would be mighty cool to support any old charset file even produced with different max & length, but it mostly just affects resumed sessions and in Jumbo we can't really guarantee resuming between versions anyway. Changes in wordlist dupe suppression, single mode max-pairs and numerous other little things make it unfeasible to really *guarantee* 100% resumability although in general it will "work" just fine.

> What do you think the new CHARSET_LENGTH and CHARSET_MAX defaults should
> be?  I think the minimum for new defaults is CHARSET_LENGTH 15
> (increased from the current/old default of 8) and CHARSET_MAX 0x7E (same
> as it is now), but should we go higher?

I urge you to use a CHARSET_MAX of 0xFF. This does not mean you have to supply charset files that actually uses it a lot, or at all. But if anyone chooses to generate his own charset, it will happily co-exist with the supplied files. That is a great benefit.

For length, 15 is barely enough. Again, unless there is some kind of bad drawback, why not make a significant increase and ship it with 24 or even more, useful or not. And again you do not have to supply charsets that really go that far.

If there is a significant tradeoff (speed? size? precision?), limit length a little from what I just suggested but not charset_max. Just my opinion.

> If we support 8-bit chars by default (increase CHARSET_MAX), then should
> the supplied .chr files include those or not?  

Like I said I think this is less important. Perhaps supply the ones we are used to have, generated from mostly the same sources as always, and add a rockyou.chr (or two, see below) on top of that.

> If yes, then where should
> the input data come from?  RockYou has about 18k strings with 8-bit
> chars, but it is unclear how many of those are actual passwords; I think
> the percentage of junk is a lot higher within this portion of RockYou
> than it is in RockYou overall.

I believe it isn't - it's just that some of it is ISO-8859-1 while most is UTF-8. I can upload a cleaned version of rockyou-with-dupes to Bull that you can consider - or you could produce your own like this: Each line that was ASCII or strictly valid UTF-8 is intact, other lines are assumed to be CP1252 [so converted to UTF-8]. This results in very little garbage, if any. A charset produced from that produces a lot of cool valid UTF-8 words (eg. lots of Spanish) and not that much bogus UTF-8 (I'm testing latest code now).

This file can be used as-is to produce charset files, or it can be converted to CP1252 (dropping lines that does not fit that). Or perhaps both - two different charset files. As we can't (yet) internally convert input to another codepage, both versions (and theoretically even more versions, like CP437 but that would be overkill) might be needed. But for some time now, most leaked datasets contain UTF-8 passwords.

$ ../run/john -inc -stdout -ext:filter_utf8 | perl -ne 'use bytes; print if /[\x80-\xff]/ and $len{length($_)}++ < 3'
Warning: only 192 characters available
toño
toña
toñe
l0v€
1q!"à
1q!"é
mikszó
mikszç
mikszñ
words: 289894131  time: 0:00:01:35  w/s: 3024K  current: 20vmmu
Session aborted

BTW feel free to include filter_utf8 in core, under any conditions/license you want. It seems to do the job well, although slow (and obviosuly, feel free to improve that :-)

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.