john-dev - Re: memory usage within JtR and possible ways to significantly reduce it.

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120524202624.GA4014@openwall.com>
Date: Fri, 25 May 2012 00:26:24 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: memory usage within JtR and possible ways to significantly reduce it.

On Thu, May 24, 2012 at 02:33:26PM -0500, jfoug wrote:
> The hot/cold does makes a lot of sense. Very good way to try to keep the
> locality.  For a 10 million candidate search, you still have 40mb of hot
> memory and 120mb of 'cold' memory, vs 160mb of arbitrary accessed memory.  I
> am not sure that reducing from 160mb to 40mb will have that 'huge' of a
> help, but it might.

Yes.  And Frank is correct: this is mostly for sizes that fit or almost
fit into some sort of cache.  I imagine that 40 MB will fit into a L3
cache in a few years from now, though.  A non-negligible portion of it
may already fit, resulting in some speedup now.  In practice, the
bitmaps will be the hottest portion of the working set, though - but
when there's a bitmap and hash table hit, then the hashes themselves
also come into play.

> But I certainly understand the goal, and it certainly should not hurt
> overall memory usage.

Actually, if we split binary into hot/cold portions, then we'll need an
extra pointer per hash.  (Well, unless there's some obvious way to
derive the address of the cold portion from the address of the hot
portion, but that's tricky.)

For the current binary/source setup, you're right: moving source to
cold doesn't cost us extra memory overall, except maybe for a small and
fixed amount to maintain the second pool.  In fact, it might save
memory on alignment gaps.

We could save memory by eliminating the source pointer and having it
calculated as (char *)binary + binary_size, though.  The hot/cold thing
prevents us from doing that.

> Did you have any problems with starting on the source()  (or rebuild_hash()
> or whatever), within the current jumbo john (prior to 1.8), just to start
> working through any unforeseen issues?  If you have no problem, then what
> interface would you like to use, so that I could start on that, and have
> built using the interface you would like.

My problem is that it's not one of my priorities now.  I can't even
afford to spend time on discussing this further now.  Overall, I thought
the interface would be very similar to what you proposed.

> I am not looking at starting to split the binary (just yet), but am looking
> at starting on the 'optional' source() method, to eliminate having to have
> the hashes allocated (IF a format can recreate the hash, some may not be
> able to do that)?

Please feel free to experiment with that if you like.

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.