john-users - Re: source of information for John's charset files

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210502212134.GA4367@openwall.com>
Date: Sun, 2 May 2021 23:21:34 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: source of information for John's charset files

On Sun, May 02, 2021 at 10:29:29AM -0800, Royce Williams wrote:
> On Sun, May 2, 2021 at 9:50 AM Solar Designer <solar@...nwall.com> wrote:
> 
> > (I had heard folks cracked almost the entire HIBP set by downloading and
> > testing against it various lists of breached passwords.  After all, HIBP
> > is supposed to only contain passwords that were breached or leaked in
> > plaintext, so if Troy could compile this collection then others could as
> > well.  However, for my test above I only used what was crackable without
> > usage of plaintext leaks beyond RockYou.)
> 
> Just to make sure that everyone's aware, it wasn't just a matter of
> acquiring the component breaches. Many other techniques were needed to
> fully "recover" the plains for the HIBP hashes as published. Many of them
> are not "real-world" passwords - they're full of nested hashes, conversion
> errors, HTML escapes, truncations, untrimmed separators, and many other
> non-password artifacts. And even after reverse-engineering those, some
> remain. Just something to keep in mind when measuring cracking success
> rates against that corpus, or trying to use that corpus as a wordlist for
> other attacks.

Thank you, Royce.

> For more detail, CynoSure Prime and m33x and I did some work on the first
> couple of HIBP releases, and wrote up the results here:
> 
> https://blog.cynosureprime.com/2017/08/320-million-hashes-exposed.html
> 
> Hard to believe it was four years ago. :)

This appears to show that at least the nested hashes are a small
minority of the total.  Also, being of unusually high lengths for
passwords they wouldn't affect incremental mode much since its
statistics are mostly per-length.  And they're easy to exclude.

A few other things you list are really bad for this use, indeed, but
again it matters how common they are in that corpus.

Anyway, I just ran some tests the other way around - "cracking" RockYou
passwords.  I didn't try excluding RockYou itself from the training sets
here - can't do that while including our current .chr files in the
comparison.  So this is in-sample testing, which is generally a wrong
thing to do, but with that in mind here are the results for different
training sets (all are for incremental mode and 1 billion candidates):

RockYou with dupes - 20.2%
RockYou unique - 21.9%
HIBPv7 cracked - 17.9%

The percentages cracked are those of RockYou unique.

Not surprisingly, RockYou is best fit for itself.  HIBP is an acceptable
fit as well.  It could have potentially performed better than RockYou
on this test due to its larger size, but as we can see that was not
enough to overcome it not being such a perfect fit as RockYou itself.

So this is inconclusive.

Royce (and others), please feel free to try generating .chr files from
RockYou vs. HIBP too, and run them on real-world test sets you have and
share your results here.  Thanks!

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.