john-users - Re: Markov Sampling

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ9ii1F2zTtHmmi0TYJLbRzscNpFw5p4NkX5=PLWCW0dVOS9Bg@mail.gmail.com>
Date: Thu, 15 Feb 2018 10:00:28 -0500
From: Matt Weir <cweir@...edu>
To: john-users@...ts.openwall.com
Subject: Re: Markov Sampling

This is for JtR's Markov mode. You can do similar options with more
advanced Markov methods like OMEN but I'm going to skip that for now.

1) Make sure you have your stats file trained already. The following
instructions will calculate the level based upon that.
2) Open up the stats file. You'll see a lot of lines like:

97=proba1[32]
51=proba2[32*256+35]
44=proba2[32*256+38]
51=proba2[32*256+39]
40=proba2[32*256+48]
35=proba2[32*256+49]
37=proba2[32*256+50]
....

3) The format is a bit wonky if you have never dug into it before. The
main thing to keep in mind is the level is the leftmost value.
'proba1' means the probability for the first letter. 'proba2' means a
conditional probability for a letter following the previous letter.

These examples taken from the code comments in my pcfg implementation.
Note when I'm using probability/level interchangeably, (not ideal I
know) : https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_trainer/markov.py

# Format:
# Probability=proba1[ORD_REP1]
# Probability=proba2[ORD_REP1*256+ORD_REP2]
#
# Example:
# 27=proba1[97] //'a' has probability 27
# 85=proba2[97*256+114] //'r' given 'a' has a probability of 85
# 83=proba2[97*256+100] //'d' given 'a' has a probability of 83

4) This is all based on the ASCII representation so 'a' = '97'. You
can ignore the *256. That's a holdover from the JtR implementation. I
suspect at one point there was a plan to support non-ASCII characters.

5) Now take the password you want to find the Markov level/probability
for. Let's assume it is 'cats'. You start out by looking up the proba1
for [c]. That's your starting level. Then look up the prob2['c' +
'a']. Add that level to your running total. Next look up proba2['a' +
t']. Add that to your running total. Finally look up proba2['t' + s'].
Add that to your running total and the sum of all of that is your
Markov level.

6) If you have a transition that isn't listed, (for example it uses a
non-ascii character) that guess will never be generated.

7) Use ./genmkvpwd to calculate how many password guesses would be
made for that level of the target password. While it won't tell you
where in that level the password would be guessed, you can probably
say it'll be (keysize for level)/2 and have it average out. With
Markov it isn't weighted to the front which is why cracking sessions
look a lot like a straight line with a few bumps in it.

./genmkvpwd statfile max_lvl [max_len] [start] [stop]

Example (if you are looking up for level 120, max length 120)
./genmkvpwd statfile 120 12 1 120

8) Yeah you have to specify a maximum length for guesses and that does
impact the keysize. Also I can't remember if the results of genmkvpwd
are total for 0-level or just that level and you have to add all the
previous levels to get the guesses for it so you might need to play
with that. I think they are additive, (aka take the last value for the
total keysize), but I could be wrong as it's been a while since I
played with that.

Good luck, and if this is confusing or didn't make sense let me know.

Matt

On Thu, Feb 15, 2018 at 9:10 AM, Matlink <matlink@...link.fr> wrote:
> I am interested by knowing how you calculate the level required to
> generate plaintext passwords (as you explain in 1)).
>
>
> Le 13/02/2018 à 19:16, Matt Weir a écrit :
>> So my initial reaction was that I'm doubtful printing out 1/10 of all
>> the password guesses will give you the end results you are looking
>> for. Here are a couple of other options:
>>
>> 1) You can always work backwards if your target plaintexts are known.
>> Aka calculate the level required to generate them, (that's easy to do.
>> Sum up all of the transitions based on your character file). Once you
>> have that, calculate how many guesses are required to get to that
>> level which the included tools in JtR will do for you. If you need
>> additional help on this approach I can write up something more
>> detailed. This is *super* fast and gives you a pretty good
>> approximation.
>>
>> 2) You can rely on JtR's own logging to see when passwords would be
>> cracked so that way you are not slowed down by piping the output to
>> stdout or using a 3rd party tool to figure out how many passwords you
>> cracked. If you want to go this route I'm sure the people on this list
>> can help out.
>>
>> There's other options as well I guess. You could always modify the
>> Markov code directly in JtR. If you want other examples of the Markov
>> code, I re-implemented it in Python for my PCFG cracker. You can see
>> the guess generation code at:
>>
>>  https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_manager/markov_cracker.py
>>
>> Of course, my code is slow so that almost certainly is not the way to
>> go, but it may be useful as a reference. Long story short, if you let
>> us know more what you are trying to do I'm sure we can brainstorm some
>> better options.
>>
>> Matt
>>
>>
>>
>> On Tue, Feb 13, 2018 at 11:47 AM, Solar Designer <solar@...nwall.com> wrote:
>>> On Tue, Feb 13, 2018 at 04:44:03PM +0100, Matlink wrote:
>>>>> The pre-defined --external=Parallel mode will do what you ask for.
>>>>> You'll just need to customize the "node" and "total" numbers in its
>>>>> init() in john.conf.
>>>> Well, I guess it's only 'not printing' generated candidates? Does it
>>>> really speed up the process, since generating a password candidate is
>>>> more costly than printing it?
>>> It doesn't speed up the processing inside JtR; it actually adds extra
>>> processing.
>>>
>>>> Concretely, is --markov --stdout --external=Parallel with node 1/100,
>>>> 100 times faster than with node 1/1?
>>> No.  It's probably roughly same speed: the external mode adds overhead
>>> internally to JtR, but then those skipped candidates don't need to be
>>> printed to the Unix pipe.
>>>
>>>>> However, note that "every 10th" doesn't necessarily produce a
>>>>> representative sample: the underlying cracking mode (in this case,
>>>>> Markov) might happen to have some periodicity in its output, and one of
>>>>> its period lengths might just happen to be a multiple of 10 or whatever.
>>>>> So ideally you'd want to randomize the order (if the order somehow
>>>>> doesn't matter for your research) over a larger number of candidate
>>>>> passwords - say, pass a million of them through GNU coreutils' shuf(1) -
>>>>> and then take every 10th out of that randomized list.
>>>> My issue is that I can't get the whole output because it is too costly
>>>> for me to gather them due to UNIX pipe. I would like to my
>>>>
>>>>     john --stdout --markov --sample=100 | my_sublime_post-process
>>>>
>>>> be somewhat 100 times faster than
>>>>
>>>>     john --stdout --markov --sample=1 | my_sublime_post-process
>>> You could use the built-in --node=1/100 feature, which probably will
>>> speed things up a lot, but then it almost certainly doesn't result in a
>>> representative sample - it's just a way to split the work between
>>> multiple nodes, without regard as to whether each node would get a
>>> representative sample and be expected to crack a similar percentage of
>>> real-world passwords that other nodes crack or not (so this probably
>>> won't be the case, making this approach unsuitable for use in research).
>>>
>>> The same applies to incremental mode.
>>>
>>>> Your solution requires to get the whole output of john and then
>>>> post-process it, but I can't find a satisfiable way to get its whole
>>>> output (since john is really fast to generate candidates).
>>> A question is whether you actually need to get this many candidates (or
>>> a sample from this many), or whether fewer would suffice.  That depends
>>> on what your ultimate goal is.
>>>
>>> Alexander
>
> --
> Matlink - Sysadmin matlink.fr
> Sortez couverts, chiffrez vos mails : https://café-vie-privée.fr/
> XMPP/Jabber : matlink@...link.fr
> Clé publique PGP : 0x186BB3CA
> Empreinte Off-the-record : 572174BF 6983EA74 91417CA7 705ED899 DE9D05B2
>
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.