john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240515200308.GA16670@openwall.com>
Date: Wed, 15 May 2024 22:03:08 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

Hi,

This thread was very timely for my talk, but I didn't have time to
comment in here, so let me do that now.

On Tue, May 14, 2024 at 04:16:28PM +0200, Jens Timmerman wrote:
> I guess you could try to train a large language model on large lists of 
> known leaed passphrases?
> 
> These might perform better at learning the underlying patterns people 
> use when thinking of passphrases than a simle markov model.
> 
> However this might end up being computationally expensive, and probably 
> also storage intensive if you want to create a nice list of passphrases.

Right.  The current unsolved problem with generative NNs is that when
you get them to produce progressively lower weight outputs, they also
produce progressively more duplicates.  The duplicates ratio is on the
order of 50% at 1 billion candidate passwords/phrases generated (but it
varies greatly).  I wonder what it would be e.g. at 10 billion - 90%
maybe?  So yes, this becomes also storage intensive if we try and
eliminate the duplicates.  Yet I'd like researchers of generative NN
based candidate password/phrase generators to release, say, 1 billion
lists of deduplicated output, so that we could use them and run
comparisons against other tools without making those time-consuming and
unreliable setups ourselves (if documented and reproducible at all,
which unfortunately is usually not the case so far).

See slides 74, 77.  Here are the excerpts from the Markdown source:

---
-> Probabilistic candidate password generation with neural networks (2010s+) <-

* William Melicher et al., "Fast, Lean, and Accurate: Modeling Password
  Guessability Using Neural Networks", 2016
  - Recurrent neural network (RNN) predicts next character, no duplicates
  - 60 MB model outperforms other generators, but apparently was too slow to
    actually go beyond 10 million candidates so that is only simulated
  - 3 MB performs almost as well, takes ~100 ms per password in JavaScript

* Generative Adversarial Networks (GAN) produce duplicates (~50% at 1 billion)
  - "PassGAN: A Deep Learning Approach for Password Guessing" (2017)
  - "Improving Password Guessing via Representation Learning" (2019)
  - "Generative Deep Learning Techniques for Password Generation" (2020)
    - David Biesner et al., VAE, WAE, fine-tuned GPT2 - maybe currently best?
  - "GNPassGAN: Improved Generative Adversarial Networks For Trawling Offline
    Password Guessing" "guessing 88.03% more passwords and generating 31.69%
    fewer duplicates" than PassGAN, which had already been outperformed (2022)

-> Future <-

[...]

* Focus
  - Better passphrase support (tools, datasets), arbitrary tokenization
  - Further neural networks, tackling the duplicates problem of generative NNs
    - Meanwhile, publicly release pre-generated and pre-filtered output
  - Application of NNs for targeting (scraping and training on user data)
---

> And that would be by design, I think the entire idea of using 
> passphrases over passwords is that it makes password cracking a lot 
> harder/more expensive.

Passphrases offer a good balance between cost or risk to crack and user
friendliness.

> But it would be interesting to see the results of such an approach.

We have some up to the GPT2 era so far, see above.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.