john-dev - memory usage (was: [john-users] JTR against 135 millions MD5 hashes)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120904014900.GA25165@openwall.com>
Date: Tue, 4 Sep 2012 05:49:00 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: memory usage (was: [john-users] JTR against 135 millions MD5 hashes)

Jim -

Thank you for sharing your thoughts on this.

On Mon, Sep 03, 2012 at 11:16:44AM -0400, jfoug@....net wrote:
> The memory reduction due to the source() function, and also due to removing some wasteful allocations within the prepare() and valid() within dynamic.  I believe the wasteful allocations in dynamic have been removed, all the way back to the J7-RC.   The memory requirements were greatly reduced early on, but have slowly started to be lost some (the bitmaps).  However, total mem usage is still much less than it was, when we were keeping the full source buffer for each hash.

I did not realize that there were wasteful allocations in prepare() and
valid().  Weren't they temporary?

Another potential source of memory usage reduction are the alignments.
For raw MD5 hashes, a 4-byte alignment should suffice (they're 4x4
bytes), yet we were using 8-byte alignment on 64-bit builds.  I've just
checked: bleeding's rawMD5_fmt_plug.c is still using DEFAULT_ALIGN.
I think this should be corrected to be just "4".  We should review other
formats for excessive alignment like this as well.

As to the bitmaps, they use less memory than hash tables used before.
For the same size in terms of number of entries, bitmaps are obviously
size-of-pointer-in-bits times smaller than hash tables.  Yes, we also
use hash tables along with bitmaps, but they're smaller by
PASSWORD_HASH_SHR (right shift by 2, as specified in the current
params.h).  So overall we need almost 4 times less memory for the same
number of entries, due to the introduction of bitmaps.

However, at the same time the thresholds have been changed to be
allocating bitmaps and hash tables for a larger number of entries.  This
is what may be increasing memory usage.  For hash counts as large as
146 million (from my test), this shouldn't matter, though - since the
largest size should have been picked anyway (and it corresponds to
almost 4 times smaller memory usage now).

Also, for saltless hashes we only need one bitmap and one hash table.
That's for the entire john process.  When its total allocation size is
multi-gigabyte, spending 272 GiB (if I calculated correctly) on a bitmap
and a hash table combined is not a problem.  (In fact, I think we should
introduce even larger bitmaps and hash tables to better handle cases
like that - this is on my to-do.)  Before the introduction of bitmaps,
we were spending 1 GiB.

> There are still other areas where significant memory reduction can be made:
>
> When we do use source(), we currently have to keep the full binary in memory, which is a waste, it is ONLY needed to regenerate source to print it back out. In 'non' source() mode, we only need to keep 4 bytes in hot memory, for a binary check.  I think we could reduced this from 16 bytes per hash (or 64 bytes for sha512!!!), to only 8 bytes per hash, with an auxiliary file storing the full array of binary hash data points.  Since the binary data is all fixed size, JtR only needs to use an index, vs a real offset pointer, computing the true offset into the file at runtime.. This should allow JtR to only need to have 4 bytes for the pointer into this file.  That would allow running against 4 billion hashes at once, which should be an 'adequate' limitation.   So this would drop us from 16 bytes to 8 bytes per hash.  Probably make up for the memory lost to the bitmaps (and quite a bit more). Run time should not be impacted much, except VERY early on, when a huge number of hashes are being found.  Once you get past that early onslaught, having to lookup data in an external file should not impact runtime at all.  That file would ONLY be needed to be accessed when printing a found factor (or in some of the -show functions).

Yeah, but I think hot/cold separation inside the process' virtual memory
will be better than using external files.  If a person wants swapping
out to disk, they will probably have swap space configured in the OS -
so we just need to be friendly to that.  If a person has a lot of RAM
and no other tasks competing for the RAM, then the swapping will likely
not happen even for the "cold" portions.  I think that us supporting
essentially a swap file of our own would only be beneficial in some
relatively obscure cases.

Yes, using 32-bit offsets instead of full 64-bit pointers is a way to
save memory.  It is usable with inside-address-space cold storage as
well; those don't have to be _file_ offsets.

In fact, there may be semi-hot items that we may reasonably access via
similar 32-bit or smaller offsets as well, with the same goal (saving
memory in 64-bit builds).  unique.c uses 32-bit offsets in this fashion.
single.c uses 16-bit offsets for buffered candidate passwords (since
there's a separate buffer for each salt, the savings are significant
when the salt count is large).  We might identify other places where
this is beneficial and not too cumbersome.

> Second, IIRC JtR uses way too much memory when the input file is largely cracked.

Yes.

> If the input file has 140 million, but 115 million of them are cracked, there is a lot of memory used loading the hashes, then loading the .pot file hashes, then things get marked off, and a new list of the remander candidates is created.

No, that's not true.  The memory usage remains the same as if no
passwords were previously cracked.

> But IIRC, this memory is not released (or not fully released).

Yes.  The memory usage stays the same as if no passwords were removed.

> It would be nice if JtR would be able to load 140m into the same machine, no matter if there was a .pot file with 115m or not.

This is already true.

> Also, it would be nice if JtR had the same final runtime foot print no matter if you run against 25m hashes, or an input file with 140m where 115m were already found.

Yes, but this is tricky and it is a tradeoff.

> For the binary in a separate file, the biggest issue, is that separate file, making sure it gets cleaned up, and what to do upon restart, if one already exists.  Leaving 2gb or larger files scattered on a users system is not a really good thing to do. However, I think changing having the full binary in memory, to having most of the binary live in an external file, should reduce memory about 30-40% beyond what it has been reduced already.

Compared to the swap-friendliness (hot/cold separation inside the
address space), using an external file would only have cosmetic
benefits (except under some unusual circumstances e.g. when the OS
doesn't have swap configured), and not to everyone's taste.

Simply being swap-friendly also avoids the problem of cleaning up.

That said, on Unix-like systems (and appropriate filesystems), cleaning
up is easy: we keep fd to the opened file, but unlike the file right
after we've created it.  Then we just use the already-unlinked file.
When the process terminates, the only instance of the fd to the file is
automatically closed and the filesystem blocks are automatically freed.
It may be a bit nasty if the system crashes with the unlinked file
still opened, though.  The blocks would then be freed on next bootup,
which is generally handled just fine - but a warning may be printed.
Another nastiness is in user confusion ("df" displays less free space
than the person expects from listing the directories).

Overall, I am for swap-friendliness rather than explicit swap files.

> For trying to use same memory for 25m vs 140m with 115m in the .pot file, likely it would require something like offloading things to disk, being able to FULLY scrub memory, then reloading properly.  There may be many other ways to do this, but often when there is this many allocations, that is about the only way to really free up the temp used pages, and reclaim memory.

We could similarly do this inside the process' address space, causing
some swapping on low RAM systems.  We'd rely on libc's malloc/free being
able to release a large contiguous region of memory back to the OS,
though - which won't always be the case (especially not if we allocated
that region in small chunks rather than all at once).  We could use
mmap() to make this explicit, but that's less portable.

Another approach would be to use a memmove-like approach to compact
the data in-place (adjusting the pointers indeed).  That is, rather than
just exclude a removed entry from linked lists, hash tables, etc., move
another entry (from the end of the allocated region) into its storage
and re-point pointers for that other entry.

I'd just ignore this problem/wish for now.  Solving this might not be
worth it.  This involves tradeoffs.

> For loading 140m vs 140m with 115m in .pot file loading on same machine (i.e. using same memory), we may have to do something along the lines of how the unique is done, vs loading the whole world into hot memory.  This 'unique' like reduction may also help with the problem talked about in the prior paragraph.

This is what we're doing already.

> For a short cracking session (or no cracking session, such as -show), load time is a very large part of overall time.

BTW, I think we need to replace the hard-coded
PASSWORD_HASH_SIZE_FOR_LDR with a range of settings and choose a likely
optimal one based on file size (have some thresholds).  Things become
trickier when multiple files are given on the command-line, though (we'd
have to loop over and stat() the filenames prior to loading to determine
the total size).  This is on my to-do, but it is not a priority.

> Again, the ideas are mind food.  I have not put any time into code.  But it is similar to when I was looking at doing source().  I KNEW JtR was using way more memory than was needed.  It is all in finding the largest of the low hanging fruit, and picking it, and doing it in a way that has the minimal adverse affect runtime speed.

Whatever we do on this, if anything, should not introduce much complexity.
I am concerned that some of these ideas do involve a substantial
complexity increase.

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.