john-users - Rules for realistic words

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU159-W380118EAA144CEEA8874E2A4930@phx.gbl>
Date: Sat, 31 Dec 2011 14:17:55 +0000
From: Alex Sicamiotis <alekshs@...mail.com>
To: <john-users@...ts.openwall.com>
Subject: Rules for realistic words


>From an analysis I've conducted in a file containing greeklish (greek words written in english) and english passwords, of ~1500 DES (max 8 length) passwords, the following came up:

Very high frequency letters:
a=850 i=597 o=584 e=525

Medium to low frequency letters
s=498 r=472 n=418 t=405 l=366 m=277 p=247 c=211 d=201 k=193 g=159 u=148 h=144 b=113 y=97 f=87

Very low frequency letters
v=66 w=53 x=47 j=31 z=31 q=18


Number frequency:  
1=448 occurences
2=293 occurences
3=249 occurences
9=219 occurences
0=203 occurences
4=185 occurences
6=175 occurences
5=174 occurences
7=156 occurences
8=132 occurences
 
...what this means, is that a new method of brute forcing could be used.

Currently it's something like

1) single
2) dictionary
3) dictionary with rules
4) incremental with digits, Alpha, Lanman, All from lower characters to more characters.

Now for the 26 letters of Alpha, it goes like 26x26x26x26x26x26x26x26 = 208.8 billion combos
For the Alpha+Digits it goes 36x36x36x36x36x36x36x36 = 2.82 trillion combos

What if there were intermediate character sets of frequently used letters as an intermediate step between dictionaries with rules and incremental with full character sets? For example the top 16 letters and 4 numbers = 20 characters in total. In such a case it's only 25.6 billion combos for 8 char length - and with multiple hashes, it's always worth to check these first in order to crack them and speed up the rest. I think incremental mode already applies some sort of "more frequent" type of cracking, but I don't know how optimized it is in relation to this. If it already covers this sector, ignore this comment. 

Another aspect that can take improvement, (not in cracking speed, but in cracking the easier ones out) is to emulate how language is constructed. For example greek & italian languages, use a lot of alternation between consonant and vowels. This means that you can have a rule which goes like this:

(V)owel
(C)onsonant
(B)oth+numbers+symbols

1-4 lengths are cracked in incremental
>From 4 char length onwards:

VCVCV => italy
CVCVC => begar
VCVCB => nike@
CVCVB => epic6
VCVCVC
CVCVCV
VCVCVB
CVCVCB
VCVCVCV
CVCVCVC
VCVCVCB
CVCVCVB
VCVCVCVC
CVCVCVCV
VCVCVCVB
CVCVCVCB

By splicing words in human-like syllables, I achieved a hefty increase in effective cracking speed. Because instead of 26x26x26... it goes like 18x8x18x8x18 - which means enormously less combinations than non-words like zzxaeseq.

(the following is a greeklish example - you may see some words as vowels which are consonants in english, but in greeklish for example w is used phonetically as o.. it's the omega letter)

[bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy]"
[aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz]"
[bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz]"
[aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy][bcdfgjklmnpqrstvxz][aehiouwy]"

In some cases it needs tweeking to account for two consonants or two vowels in some part of the word (for example peNTagon, aCRopolis, bicyCLe, AErodynamic), so a few variations of the above are necessary to cover a large percentage of words. 

An analysis of the english language and linguistic patterns might give significant increase in human-like words or composite words (that the dictionaries do not contain - like name&surname). Ideally, we could have a statistics program or an AI program to extract rules for the 95%+ of the words contained in a certain language, so that combinations could be based on this structure (with possible twists like adding stuff in the end). English are a bit more difficult to do in a letter-by-letter format compared to greek/italian, but, ultimately, it's just more variations. A syllable approach (ie combos of one, two and three letter sequences) might also be appropriate for english or other languages. For example instead of combining words, we could combine ready syllables... The syllable MO + syllable RE = word MORE. The combinations compared to 26^8 will drop dramatically.

Have a great 2012...
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.