john-dev - calloc()/mmap() vs. fork() (was: Judy array)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150917185315.GA18545@openwall.com>
Date: Thu, 17 Sep 2015 21:53:15 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: calloc()/mmap() vs. fork() (was: Judy array)

On Tue, Sep 15, 2015 at 06:27:37PM +0300, Solar Designer wrote:
> On Tue, Sep 15, 2015 at 06:40:38AM +0300, Solar Designer wrote:
> > The change to PASSWORD_HASH_SIZE_FOR_LDR speeds up startup by a few
> > seconds.  The table size used is 16M elements, so 128 MB on 64-bit or
> > 64 MB on 32-bit systems.  It think that's acceptable these days,
> > especially given that for tiny files (which is what people might process
> > on tiny systems) that's just address space rather than memory
> > allocation.  This memory is freed after loading is complete.
> 
> I was wrong in that it's "just address space rather than memory
> allocation".  We have to zeroize the hash table, so it's memory (unless
> we were to explicitly use mmap, or used an implementation of calloc that
> uses mmap internally for sizes like this and skips zeroization then).

I've just tried changing our mem_calloc() to make it actually use
calloc(), and using it for the hash tables and bitmaps in loader.c.

The result is surprising: fork() became slow.  Like, it literally takes
multiple seconds just to spawn the 7 child processes.  So e.g. if I
press a key a few seconds after all 8 processes are finally cracking, I
see them display very different timestamps (up to 10 seconds apart).
Perhaps our partial and out-of-order filling of the hash tables and
bitmaps results in page tables that are costly to fork().

I think the ones we allocate just for the loader's own use and free
before fork()'ing may still be efficiently calloc()'ed, though.  Perhaps
the page table complexity is undone when we free those, with the free()
translating to munmap() and thus letting the kernel undo the magic.
I am getting similar total running times for this combination that I was
before any calloc() use.  Well, maybe just slightly worse.

And yes, per strace calloc() uses mmap() for large enough sizes, and I
hope it's smart enough to skip explicit zeroization then (in fact, if it
wasn't then there wouldn't be the performance regressions above).

Smart isn't always good.

We might want to revisit this once we implement keeping the loaded
hashes in a shared memory region.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.