john-dev - RE: john scalability

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <015b01cbebe9$b7615df0$262419d0$@net>
Date: Sat, 26 Mar 2011 14:12:14 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: john scalability

>From: magnum 
>
>So if we use the "size 2" (4096) hash, we will have 4096 buckets? And if
>we use the "size 4" (1M) hash, we'll have 1M buckets with a lot less in
>each.

Yes.

>The loader produce binaries from the hashes once and for all, and calls
>binary_hash() to decide what bucket each one should be put in?

There are 3 (or 5) binary_hash functions for each format. There are also a
matching 3 (or 5) get_hash() function in the format.  The loader has counted
how many candidates there are.  It then uses a table (in params.h??) to
select which of the hashes to use. Both 'sides' (i.e. binary_hash[x]() and
get_hash[x]()) MUST be the same, and the hash table generated must be
generated the proper size.  

The hash tables are all generated on a even 2^(4*(size+1)) value. So hash[0]
hash 16 elements, hash[1] has 256, hash[2] hash 4096, etc. Thus, that is why
you see all the same hash[0](Val&0xF)  hash[1](Val&0xFF)  hash[2](Val&0xFFF)
...   This is so the number returned contains the proper distribution.  

Thus, we currently allow for hash tables of 16, 256, 4096, 64k, and 1M.
Solar is adding 16M (0xFFFFFF) and 256M (0xFFFFFFF)

>And after we produce a candidate binary, say we call get_hash() where
>some part of it is (in the 1M case) and'ed with 0xFFFFF and it returns
>0x32154, then we only have to look through the bucket that contains the
>"0x32154" hashes for a match. Something like that?

That is exactly how it works.  NOTE, that binary_hash() and get_hash() MUST
be exactly the same (the end result). It is 2 different function, since the
data they work on is different.  Binary_hash works with the base-16 or
base-16, while the get_hash works with the native results of the crypt
function.  Binary hash does not need to be optimized.  It is only used once
per candidate at startup.  Get_hash must be fast.  It is used once per each
dictionary word for unsalted or once for each dictionary word * each salt
for salted hashes.  It is called a LOT.

>It seems very easy to add them to most formats and I have already added
>them to a couple I needed. It may be academic for some formats but I can
>volunteer to produce a diff with the "missing" functions added to all or
>most formats, if that is something we should have.

It likely is.  I took the ones I was 'sure' about at the time. There were
some formats, I simply was not comfortable enough to know that my changes
would be right. That is why I did not do them. What I did do, was put NULLs
into the hash functions, and made loader.c NOT use those null functions.  

If you have the time and knowledge to get them right, then go for it.  It
will only speed things up, for anyone doing a large bulk search, of those
formats.

Jim

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.