john-users - Re: GSoC 2011

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D6B752E.4060404@gmail.com>
Date: Mon, 28 Feb 2011 11:13:02 +0100
From: bartavelle <bartavelle@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: GSoC 2011

On 28/02/2011 06:56, Solar Designer wrote:
> I'd appreciate any comments and suggestions.

"Further research on and implementation of automatic rule set generation
based on previously-cracked passwords"

I am spending my spare time on such a project. My approach is to start
from a dictionnary and mutate it with as many rules as possible. I am
only using rules that do not add prefixes or suffixes. Right now I am
working with a simple rule list, but if this proves workable I will use
other approaches (exhaustive search or genetic algorithm optimization
come to my mind).

For each mutated dictionnary, I try to match them in a real password
list, such as the rockyou list, and I store what comes before and after
my mutated word in the actual password. I can then compute which
passwords can be found with a given (mutation, prefix, suffix) tuple.

I have done this part, and will implement coverage searching real soon,
so that I could select a "good" 100 rules set (I'll go for a greedy
algorithm, so it will be a heuristic).

My hypothesis is that the most common password selection technique, for
hard passwords, is to take a known word, mutate it, and add stuff before
and after.

There are two challenges that must be overcome :
* the matching part, to be efficient, requires a lot of memory if you
want to work with a big enough dictionnary. Same for the coverage
resolution. My implementation requires around 8G RAM and 60s per rule
for a 21965166 words dictionnary and 6025204 passwords, on a i7 860.
This means I cannot run several instances in parallel without swapping
as I only have 16G RAM. If somebody could point me on a time and memory
efficient multistring matching algorithm, I'm interested.
* I'm hacking stuff as it goes, and this produces unreliable programs.
Moreover, I don't know how to test properly if a program works, as there
is no reference that I know of. Given the memory requirement of the C
program, I can't even prototype it in a "simpler" language.

There are probably tons of other approaches to this problem, such as
that described by Matt Weir in his thesis. There is however a
distinctive lack of interest in them, probably due to the fact that it
is much harder than coding yet another md5 gpu bruteforcer, and that it
requires a lot of material, trial and error.

If somebody else has ideas on that topic, I would be really interested.
A better approach would be to start from the dictionnary and password
list, and, for each password, efficiently compute rules that could
transform some of the dictionnary words into it. I have no idea on how
to do this however.

BTW, I already computed a large prefix/suffix list from rockyou (and
other sources) and a wikipedia based dictionnary. It works quite well
and starts with :

:
$1
Al"123"
$2
$a
$3
Al"12"
$7
Al"23"
$s
$5
Al"13"
$4

It works quite well and you can find prefixes such as "ilove", "lil" and
suffixes such as "boy", "man", "love" that crack a lot of passwords.

I also wrote a quick hack that helps computing that kind of huge
dictionnaries that works like "sort -u" except it doesn't sort, and is
thus MUCH faster. That kind of tools, optimized for large data sets,
would be incredibly helpful as contributed tools for JtR.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.