john-dev - Re: revised incremental mode and charset files (was: Bleeding-jumbo branch updated from core)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130426201657.GB24771@openwall.com>
Date: Sat, 27 Apr 2013 00:16:57 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: revised incremental mode and charset files (was: Bleeding-jumbo branch updated from core)

On Fri, Apr 26, 2013 at 03:06:23AM +0200, magnum wrote:
> On 26 Apr, 2013, at 0:31 , Solar Designer <solar@...nwall.com> wrote:
> > I've just pushed another related change out to the public CVS
> > repository: speeding up charset file generation, as this becomes more
> > important with increased CHARSET_* settings that we're going to use.
> 
> Cool, I'll try to merge things day by day in order to not lose control.

Thank you!  Please do a merge again, as I pushed some additional changes
to incremental mode to the public CVS tree.

> I urge you to use a CHARSET_MAX of 0xFF. This does not mean you have to supply charset files that actually uses it a lot, or at all. But if anyone chooses to generate his own charset, it will happily co-exist with the supplied files. That is a great benefit.
> 
> For length, 15 is barely enough. Again, unless there is some kind of bad drawback, why not make a significant increase and ship it with 24 or even more, useful or not. And again you do not have to supply charsets that really go that far.

OK, here are the new defaults:

#define CHARSET_MIN                     0x01
#define CHARSET_MAX                     0xff
#define CHARSET_LENGTH                  24

This brings (CHARSET_SIZE + 1) to exactly 0x100, which simplifies
address calculation (I actually checked "size inc.o" for different
builds, and benchmarked with an --stdout run with the puts() call
commented out).  We can't allow NUL, because we use it for CHARSET_ESC
and because it'd terminate C strings that we use between inc.c,
cracker.c, and the format's set_key() and get_key().  I think we don't
want to anyway.

> If there is a significant tradeoff (speed? size? precision?), limit length a little from what I just suggested but not charset_max. Just my opinion.

There was significant performance hit and memory usage increase from the
increase of either CHARSET_SIZE or CHARSET_LENGTH, let alone both.  I've
implemented changes to reduce the impact of the increase of these
settings greatly.  Now there's still a memory usage increase, but it's
not as bad, and the performance hit is almost eliminated.  There's
further room for improvement in this area, but I will likely leave it
for post-1.8 (can't afford more scope creep now).

> > If we support 8-bit chars by default (increase CHARSET_MAX), then should
> > the supplied .chr files include those or not?  
> 
> Like I said I think this is less important. Perhaps supply the ones we are used to have, generated from mostly the same sources as always, and add a rockyou.chr (or two, see below) on top of that.

I intend to use only RockYou for the new .chr files, perhaps with some
simple and documented filtering applied to it (perhaps in the form of
revised external mode filters).  This will let JtR users regenerate the
same .chr files, or make minor adjustments and then regenerate the files -
as long as the RockYou list is available for download somewhere (and I
expect that it will remain available).

> > If yes, then where should
> > the input data come from?  RockYou has about 18k strings with 8-bit
> > chars, but it is unclear how many of those are actual passwords; I think
> > the percentage of junk is a lot higher within this portion of RockYou
> > than it is in RockYou overall.
> 
> I believe it isn't - it's just that some of it is ISO-8859-1 while most is UTF-8. I can upload a cleaned version of rockyou-with-dupes to Bull that you can consider - or you could produce your own like this: Each line that was ASCII or strictly valid UTF-8 is intact, other lines are assumed to be CP1252 [so converted to UTF-8]. This results in very little garbage, if any. A charset produced from that produces a lot of cool valid UTF-8 words (eg. lots of Spanish) and not that much bogus UTF-8 (I'm testing latest code now).

Curious.  Perhaps you can post a script that does such preprocessing of
the RockYou list?

> BTW feel free to include filter_utf8 in core, under any conditions/license you want. It seems to do the job well, although slow (and obviosuly, feel free to improve that :-)

Thanks.  I will likely leave this for later.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.