Date: Mon, 28 Feb 2011 17:05:36 -0500 From: Matt Weir <cweir@...edu> To: john-users@...ts.openwall.com Subject: Automatic Rule Generation (was GSoC 2011) Hey Bartavelle, That sounds like a very cool project. I'd really like to hear more about it! Please take everything I say below with a grain of salt, (I certainly don't have the problem solved), but I just wanted to share my own experiences. When I first started working on a rule generator for JtR, my original approach was to use a modified version of edit distance. Basically I attempted to reverse mangle known passwords by applying different combinations of 'reverse mangling' rules to each training password and then seeing if the result matched a base word in an input dictionary. For example stripping off digits/special characters from the front/back, applying common replacements ('@'->'a'), etc. This kept the memory requirements relatively low, (I only had to load up the input dictionary and the current password I was checking.), and was actually fairly fast since each password only had a limited number of 'reverse mangling' rules that could be applied to it. There were numerous problems I ran into though. First of all, it had a high false positive rate, (especially when I applied 'letter' additions, which it sounds like your method has much better success with). Second, it missed a lot of the passwords because the base word was not in the input dictionary. Yes I could use a larger input dictionary such as the Wikipedia list, but then the false positives go through the roof, (I might as well just analyze it based entirely on the possible mangling rules and drop the input dictionary altogether). Neither of these problems were game-stoppers though. The biggest issue I ran into was dealing with multiple word mangling rules applied to the same password. Detecting it wasn't the problem. Where I ran into issues though was that the resulting word mangling rules were either too specific or I had to make them too generic. Aka if you learn the mangling rules exactly and have the training set: Password123 Password245 Password874 222password You might generate the rules: Cap first, append 123 Cap first, append 245 Cap first, append 874 pre-pend 222 But that would never generate the guess: Password222 On the other hand you could make it generic and create the rules: Cap first, append three digits pre-pend three digits But then you loose the fact that the number 123 is much more common than all the other digits. That's why I eventually moved to using probabilistic context free grammars to model user behavior and generate my rules. Basically this is just another layer of abstraction from your traditional rules. In your typical CFG you start with non-terminals and replace them until everything is a terminal value, or in this case a password guess. The probabilistic part just means each replacement has a probability assigned to it. In the current incarnation of my program it first selects a mangling rule, (such as add three digits to the end of a word) which has a set probability associated to it. It replaces the three digits with an actual value, (such as '123') and multiplies the total probability of the guess by the probability of '123' appearing. It then replaces the dictionary word with an actual dictionary word, such as 'password', and multiplies it by the probability of that word. Finally it applies mangling rules to that dictionary word, (such as capitalizing the first letter), and modifies the total probability appropriately. This leaves you with a final guess which is composed entirely of terminal replacements, (in this case 'Password123'), with a final probability associated to it. The advantage of this is that it will eventually generate guesses such as 'Password222' from the above training set while still trying 'Password123' much earlier in the cracking session. The downside: A whole lot of overhead in the cracking session, along with various other issues. For example I haven't incorporated letter replacements into it. Also it assumes that mangling rules are context free, and that's not always true. Aka '***password***', and 'p1a2s3s4w5o6r7d' are rules that are context sensitive, (aka the mangling rules are related to each other). Oh and my code sucks, but that's a whole other issue ;) There's certainly other ways to solve the problem. For example, one option I thought of is you could have a specific version of the rule: Cap first, append '123', and then a generic version of the rule: Cap first, append three digits. It's just that personally I feel that there are a lot of benefits of using PCFGs that are hard to implement using other methods. Right now my program generates the guesses and then outputs them for use in JtR using the -stdin option. I considered having it generate JtR rules instead, but since JtR doesn't support multiple input dictionaries, or interweaving brute force and dictionary based attacks I haven't done that yet. In all honesty I should probably revisit that and directly integrate the guess generator into JtR, but that's low on my to-do list, (watching season 1 of Heros is much higher). All in all, the whole project is still in the proof of concept stage, (despite me working on it over a number of years), and there's a million problems with it, which is why I'm so interested in your ideas and results Bartavelle ;) Matt
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.