Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Date: Mon, 29 Jun 2015 11:09:02 -0400
From: Matt Weir <>
Subject: Using Probabilistic Context Free Grammars (Was precomputed attacks)

>> I see, there is UnLock and it generates better wordlists than john in
>> some cases. Is it Free Software? Why is not it integrated into john?

I decided to fork this thread to keep things from getting to mixed up
between PCFGs and pre-computed attacks. I'm not a huge fan of the name
UnLock so I'll just refer to the basic idea as using Probabilistic Context
Free Grammars, (PCFG), instead.

The version I worked on is free software, (GPLv2) and it does use bits and
pieces from JtR. There is another version that is a re-write that students
and staff at Florida State University are still developing but I don't have
the source code for that.

I have a very poorly maintained github page for it available at:

but I basically was using that for my own development efforts and the code
is in a very unstable state. A more stable, but older version of it is
available at:

I never bothered to directly integrate it into JtR simply because the
-stdin option worked well enough for my needs. Direct integration into JtR
would be nice but I don't think the PCFG cracker is mature enough for that
to make sense right now. Aka I'd like to limit the amount of buggy software
in JtR ;p

Ok, I probably should take a step back and give a high level overview of
using PCFGs to model human password creation strategies.

The easiest way to think about it is to think of the PCFG attack as a
model. Now we currently do a lot of models in JtR and many of them are
wicked advanced. Wordlists model password creation strategies using
dictionaries and mangling rules. Markov and Incremental mode model password
creation strategies by using the conditional probability of individual
characters to generate guesses. PRINCE models passwords by combining words
from a dictionary together. PCFGs are just another way to model users.

I started using PCFGs when I was trying to make an automated rule generator
that I could train on previously disclosed passwords. My first attempts
were very much like what Bartavelle has done below though his code is way
better then what I did:

Unfortunately that approach had some limitations and I ran into issues when
trying to combine different mangling techniques and optimizations. Aka if
you don't see "append 4 digits" + "append "$" together in your training set
should you still try that? Now one solution is to make it more generic and
modify the rule to be "append 4 digits" + "append 1 special char", but then
you loose a lot of the potential optimizations since some special
characters are way more common than others. The same goes for digits where
you might want to try years, (or series like '1234'), before you try
generic numbers.

In addition, I really wanted to add support for multiple input dictionaries
since the current setup of having to run multiple cracking sessions with
different input dictionaries, (small targeted ones and then bigger ones),
with different rulesets is annoying. For example some words are way, way
more common than others, (for example "password"), so you might want to try
advanced mangling rules on them before you try a basic mangling rule on a
less common word. But at the same time, you might want to try a basic
mangling rule on a less common password, such as 'chair123', before you try
some crazy mangling rules such as 'PasSword423&'. Balancing that in the
current one dictionary setup is difficult.

This lead me and my group to start using PCFGs. Breaking down that term,
basically we assign everything probabilities. We assign a probability to
the length of a password, a probability to individual dictionary words, a
probability to individual digits, a probability that digits appear at the
end vs the beginning, a probability to the capitalizations used, etc.
That's the P in PCFGs.

My group then designed a CFG based on those probabilities. We made it
context free since we assumed in *most* cases that the different
transitions were independent of each other. Aka the digit at the end of the
password does not have anything to do with the base dictionary word used or
the capitalization chosen. Now there are exceptions and there's a couple of
things we did to sneak context specific info, (such as password complexity
requirements), into our grammar but that can wait for a later discussion.

By using a PCFG it makes training easier since thanks to it being context
free if we see a date such as '2015' show up anywhere in the training
dataset it gets a probability vs only detecting it in the context of a
specific mangling rule. Also you can modify the grammar later to perform
target specific modifications, (for example you can raise the probability
of dates and words that hold significance to the target such as birthdays
and names of pets).

The current version of my PCFG cracker outputs guesses to stdout for use of
a tool like JtR. Another option though could be for it to create very fine
grained mangling rules to be used in a traditional wordlist style attack.
The only downside then is you would loose the ability to have multiple
input dictionaries of different probabilities.

Now for the downsides. The overhead is pretty high so against fast hashes
the increase in precision isn't worth the slower guessing rate. Also the
grammar is imprecise enough that against really slow hashes hand written
mangling rules still are the way to go. So it currently performs best
against medium speed hashes.

Long story short, there is a lot of work that needs to be done if you
really want to make this useful outside of an academic setting for cracking
passwords. If anyone is interested I think it's worth mentioning that the
actual grammar itself has some huge areas that I suspect could be hugely
improved. I could talk about that in more detail but the grammar my tool
currently uses is pretty naive and was more of a proof of concept vs a well
designed model of human behavior. I'm sure there are better ways to model
password creation strategies using PCFGs.

As a side note, right now most of the focus with PCFGs has been on the
defensive side, using them to generate fake passwords, (honeywords), or
evaluate the strength of passwords as part of a password strength checker.


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.