john-dev - binary hashes and BINARY

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110329172520.GA16758@openwall.com>
Date: Tue, 29 Mar 2011 21:25:20 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: binary hashes and BINARY_SIZE

Hi,

I think there's some confusion regarding what goes into the "binary"
hashes and thus what BINARY_SIZE and params.binary_size of a "format"
should be.  I'll try to explain.

A typical "hash encoding" that we load from a password hash file might
look like:

$1$7Uu2iTBB$Y4hQl2WvrOA3LBbLDxbAf0

This one consists of a hash type identifier ("$1$"), a salt
("7Uu2iTBB"), and the hash itself ("Y4hQl2WvrOA3LBbLDxbAf0").

Now, John could simply keep these strings in memory, compute similar
strings for every candidate password being tested, and compare the
strings.  However, this is suboptimal because in that case it has to
ASCII-encode all computed hashes, whereas instead it can ASCII-decode
the loaded hashes into their binary representations and be comparing
those while cracking.  For example, "Y4hQl2WvrOA3LBbLDxbAf0" may be
decoded into a 16-byte binary value at startup.  (For this specific hash
type, this does not make much of a performance difference because these
hashes are so slow to compute.  For some other hash types, it does.)

However, John still needs the ASCII strings for storage into john.pot
whenever a password gets cracked.  Thus, it could either keep both
things in memory or it could encode the loaded binary representations
back into ASCII when needed.  However, the latter would require extra
code (per hash type), and more code means potentially more bugs.

So John uses the former approach - keep both things in memory - but then
many "formats" make an optimization where only partial binary hashes are
kept in memory.  Really, say, 32 bits per hash is enough to weed out
most candidate passwords.  And in the relatively rare cases that partial
hashes match, John can always re-decode the ASCII strings (which it has
code for anyway) and do the full comparison.

Thus, cmp_all(), cmp_one(), and all of binary_hash[]() and get_hash[]()
functions work on potentially partial binary hashes, whereas cmp_exact()
works on full ASCII-encoded hashes.

To continue the same example, MD5_fmt.c uses a BINARY_SIZE of 4.  That's
4 bytes, or 32-bit partial binary hashes.  Compared to the full size of
16, this saves 12 bytes of memory per hash loaded for cracking.  (It may
also improve cache usage, which matters for faster hashes.)

To give another example, here's how I converted the "dummy" "format" (to
be introduced in 1.7.7) from full to partial binary hashes:

http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/john/john/src/dummy.c.diff?r1=1.2;r2=1.3

This has resulted in huge memory savings because otherwise I'd have to
make BINARY_SIZE large enough to hold the longest supported plaintext
password (we don't want any user-visible "hash collisions", nor do we
want to use a real and thus not-very-fast cryptographic hash function
in "dummy").

Now that I look at NT_fmt.c in jumbo, I see that its BINARY_SIZE is 16.
So it wastes memory on storage of full binary hashes, even though their
halves (8 bytes) would be more than enough.  We could want to correct
this, moving some code from cmp_one() to cmp_exact().  Alain?  Many
other "formats" are similarly not optimized in this respect.  Not that I
care much (with modern RAM sizes), but Alain started to compare memory
usage of different tools here:
http://www.hashsuite.webuda.com/index.php?p=1_6 ;-)
JtR's memory usage at NTLM will grow by 4 MB or 8 MB with the large hash
tables patch, but we can reduce it by 8 MB or 12 MB with the BINARY_SIZE
optimization.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.