john-dev - RE: john scalability

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <015101cbebc5$d55fa150$801ee3f0$@net>
Date: Sat, 26 Mar 2011 09:55:22 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: john scalability

>From: magnum
>
>Someone please explain briefly what those different size tables are
>about. What happens without this patch, that make such a huge
>performance hit? How come it affects inital loading too? I looked at the
>code and found sizes and thresholds but I didn't get any wiser.

There are several (many actually) places where hash tables are needed. This
cuts down lookup times.

If you have 500 candidates, and your hash function is random (enough), and
you have 5 hash buckets, then on 'average' there will be 10 items in each
hash bucket, and no average, you will look through 1/2 of them for success
(which may be rare), and look through all of them on non-success (often very
common).   This makes for checking each value to take quite a lot of time
(tesing against 10 candidates on average).   However, the memory usage is
small (only 5 hash buckets).

Now, if you had the same 500 candidates, but had 50 million hash buckets,
and the hashing function again was random, then it would be very likely that
no bucket contained more than 1 hash value, and almost ALL of them contain
zero.   This makes for much quicker searching.  However, in this case, you
waste a BUNCH of memory.

During load time in john, there are some hash tables build.  These may be
driven by data in the params.h file. Solar may have to fill in some of the
details, I am sort of sketchy here.  I know that if you do NOT have a proper
salt_hash function built in your format, then loading can get very slow.
This function is used by the loader, and not during the run of john (I
think).

During the run of john, there is a hash table used to check computed hashes
against possible candidates.  If we had 1 million candidates we would NOT
want to check our computed hash value against all 1 million of them.  We
would want to as quickly as possible narrow that search down to none, or
just a few candidates.  This is where the hashing comes in.

When the loader loads the cipher file, all of the want to find hashes are
there.  The loader then builds a hash table using those values.  Then when a
candidate is run, the result is put through the same hash function, and only
the wanted values which match that hash (if any) are looked at.

To do this, (as john is laid out), there are an array of hash generation
functions in the format (binary_hash) which compute hashes found in the
cipher text file (used by loader), and there is an array of functions which
are used at run time (get_hash) which john calls after the crypt_all.    The
hash generated by each of these 'pair' of functions MUST be the same, for
the same password.   The reason for a pair of them, ist aht binary_hash can
be written to do things like ascii/hex/base64 to binary conversion, etc,
while the get_hash can be written to perform no conversion, but simply use
the binary data (thus run much faster).  The speed of binary_hash is of
little consequence (within reason).  The speed of get_hash should be fast,
since it will be called billions of times. 

Now, when john starts, it determines how many unique candidates it has.  It
then computes which of the 5 sized hash table can be used.  NOTE, it may be
constrained by a format only having 3 hash table sizes that was the original
number of hashing functions, before I 'optionally' expanded this to 5 a
couple years back.  There is a table in params.h which specifies when to
bump up to the next table size (based on candidate count).

There used to be 3 hash tables.  I found serious slow-downs that started at
a couple thousand candidates, and went totally to pot when doing 350k or so.
I bumped that up to 5, and Solar has since put that into the jumbo.
However, at the time, I did not know how to get the proper hash values for
some of the formats. Thus, I made the hash size selection code smart enough
to check to see if the format had function pointers that were not null.
Thus, if a format only had 3 hash functions, then the john loader (and
cracker) would only use up to 3 hash table sizes.

It now sounds like some of them I 'left out' need to be added (they ALL
should be added), The faster the hash, the more important it is to check
zero or just a very small few candidates.  If the format is very slow, then
checking more will not add up to a large percentage of runtime.  However,
for saltless single crypt hashes (like rawmd5, rawsha, single-salted, lm,
ntlm, etc), checking as few candidates as possible speeds things up a lot.

Now, Solar has posted that he plans on upping this from 5 to 7 hash levels.
That may speed stuff up with we get REALLY large data sets.  However, there
is also memory caching issues that need to be looked at, when we do up the
size.  Hash tables are notoriously bad on expanding the memory 'working
set', and causing page faults, due to a large number of memory pages which
are only accessed occasionally. Going up to 256M hash table entries may
cause certain systems to really start to memory thrash.  However, I am sure
he will look into these type issues when implementing this  (along with
setting up some logic in the --save-memory type settings)

I am sure if I do not have this logic right, Solar will correct my mistakes.
But even it I do not have everything listed exactly right, it is close
enough that it should give you the insight as to why it make such a huge
impact.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.