john-users - Re: Cracking stats: p/s, c/s and C/s. Hashing cost factors.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210515121001.GA23510@openwall.com>
Date: Sat, 15 May 2021 14:10:02 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Cracking stats: p/s, c/s and C/s. Hashing cost factors.

David,

On Fri, May 14, 2021 at 09:00:51PM -0700, David Sontheimer wrote:
> I should add the high-level goal:
> 
> I'd like to test my password generation heuristics as you are testing
> your .chr files' candidate generation - using the same number of
> max-candidates across trials, and recording the percentage of passwords
> cracked in each trial. This is preferable to comparing heuristics by total
> time to crack all passwords.

OK.

> Further, I'd like to calculate the average number of candidates required to
> successfully guess a hash.

I assume it'd be average only out of those that are successfully guessed
in your experiments.  This might not be a very useful metric, such as
because it'd be growing the longer you let your experiments run.  Those
longer runs would have a larger total number and percentage of guesses,
but since the additional guesses would be the harder ones the average
would also grow.  In fact, if you were to include the unreached guesses
as well - since all hashes would theoretically be cracked in infinite
time - then your average is kind of half of infinity, or just nonsense.

If you do reach 50% guessed in your experiments, then that's your median
(not average) number of candidates to guess a password.  I think that's
a more reasonable metric.  If you don't reach 50%, then you just call
the metrics e.g. "number of candidates to successfully guess 10% of
passwords in the test set", etc.

Average is a useful metric when the total keyspace is within reach.  For
example, you can compare how much more efficient "--incremental=Digits
--length=6" is than "--mask='?d' --length 6" and how much more efficient
the latter possibly is than "--mask='[0-9]' --length 6" (yes, there's a
difference between the latter two), including in terms of average
candidates to achieve a successful guess within this well-defined subset
of 1 million possible passwords.

> A binary search wrapper was discussed, but
> StatusShowCandidates appears to be perfect (and more efficient).
> 
> Using StatusShowCandidates = Y shows the candidate count of a successful
> candidate

This is a feature I wasn't involved in implementing and have never used.
However, my understanding is that yes, it's suitable for your needs.

> for a single fork, not all candidates across all forks, correct?

Correct.

> If so, then the best use of my hardware appears to be running each trial on
> a single fork,

I think so, given that your use case is research of candidate password
generators, not maximizing the number of passwords cracked per unit of
real time.

> and running many trials in parallel.

Either that, or you can use OpenMP.

With OpenMP, you can adjust OMP_NUM_THREADS to your liking (such as to
maximize real time speeds), which won't affect your research.

> Unique, session-named
> log files will be required for exact candidate counts (as opposed to the
> multiple of batch size printed to stdout) and to differentiate the results
> between trials.

Sure.  You have that by default.

> On Fri, May 14, 2021 at 8:46 PM David Sontheimer <david.sontheimer@...il.com> wrote:
> >> You might find slides 7 to 10 in this presentation helpful:
> >>
> >> https://www.openwall.com/presentations/BSidesLjubljana2017-Yescrypt-Large-scale-Password-Hashing/
> >
> > This is helpful - thank you.
> >
> > When parallelized, would optimal be that the set s0 for fork 0 contains
> > the most-likely candidate, the set s1 for fork 1 contains the second
> > most-likely candidate, ... , set sN for fork N contains the Nth most-likely
> > candidate?

I'm not sure what sets you refer to.  The entire sets that the forked
process will ever test, or the smaller sets being tested at a time.
Regardless, no, especially in the former case what you describe would be
very far from be optimal.

Rather, with "--fork" we generally use a round-robin distribution of
candidates or of groups of candidates taken off a (virtual) list
ordered from most likely to least likely.  So e.g. with two processes,
process 1 will take candidates/groups 1, 3, 5, 7, ..., and process 2
will take 2, 4, 6, 8, ...  This way, e.g. the two most likely candidates
or groups (1 and 2 in this example) would be tested first
(concurrently), and the less likely candidates would be tested later.
(Of course, this is a simplified example.  In reality, each process'
candidate stream will also be buffered and tested in batches.)

Here, by "groups" I refer to a cracking mode specific grouping - for
some modes it's non-existent (individual candidates are in round-robin),
for some others (incremental mode) it can be pretty large groups (e.g.,
commonly much larger than what we refer to as "batches").  In the latter
case, optimality if the search order is negatively impacted by the large
group size.  In general, "--fork" is potentially less optimal than
OpenMP in terms of candidate password ordering (with OpenMP there's just
one optimally-ordered candidate passwords stream), but its lack of
synchronization achieves a higher cumulative c/s rate.

> >  Now, with a fixed runtime/fixed candidates to test, I'd like to run
> > separate cracking sessions in parallel. Where do I find the named log file
> > based on session name, as described in:
> >
> > https://github.com/openwall/john/blob/bleeding-jumbo/doc/OPTIONS?
> >
> > It should be in the same directory as ${JOHN}/john.log, correct?

Yes.  However, it is wrong to use those curly braces to refer to $JOHN,
because it is not an environment - it just looks like one, but it's
actually a different concept specific and internal to JtR.  (I now see
we have that bug in the descriptions of SessionFileProtect and
LogFileProtect.  We should fix those.)

> > Here is my command (for example, SESSION =
> > "experiment_2021-05-12_16:13:06.153197"):
> >
> > ${JOHN}/john  --session=${SESSION}  --fork=1
> >> --pot=${EXPERIMENTS}/${SESSION}/cracked_len_10.pot   --incremental=Digits
> >>  --min-len=10  --max-len=10 --max-candidates=10000000
> >> ${EXPERIMENTS}/${SESSION}/pwdfile_sha1_crypt_10.txt  2>
> >> ${EXPERIMENTS}/${SESSION}/cracked.err 1>
> >> ${EXPERIMENTS}/${SESSION}_cracked_len_10.out

Looks OK to me, except that "--fork=1" is pointless.  If you effectively
don't use "--fork", you'd want to control OMP_NUM_THREADS instead to
ensure you never overbook the system - OpenMP's performance is very
sensitive to overbooking.

> > Further, how does --session interact with LogFileProtect in john.conf?
> >
> > I would expect "LogFileProtect = Named" to not allow named sessions to
> > write to john.log, nor to write to other named session logs, yet they would
> > write to their own session's NAME.log.
> >
> > Instead, when uniquely naming sessions and with "LogFileProtect = Named" I
> > do not see session-named logs created, and no new data is added to john.log
> > (the latter which is expected).

These are features I wasn't involved in implementing and have never
used, and have no comments on.

> > I have looked through both doc/OPTIONS and john.conf for a separate,
> > independent variable to enable session-named logs, but no luck.

This is enabled by default.  In fact, there's no way to disable it short
of disabling the logging altogether.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.