musl - Re: What's wrong with musl's malloc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180423020010.GX3094@brightrain.aerifal.cx>
Date: Sun, 22 Apr 2018 22:00:10 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: What's wrong with musl's malloc

On Sun, Apr 22, 2018 at 09:34:50PM +0200, Markus Wichmann wrote:
> On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote:
> > I've talked on and off about musl's malloc needing to be overhauled or
> > replaced, and gaining the ability to experiment with that is one of
> > the motivations for making malloc interposable/replacable. Here's a
> > brief write-up of what's wrong that needs to be addressed:
> 
> Yeah, I was about to ask for one.
> 
> > The main benefits of musl's malloc vs the standard dlmalloc algorithms
> > it's based on is the fine-grained locking. As long as there are binned
> > free chunks of various sizes available, threads calling malloc will
> > only contend for a lock when they're requesting allocations of same or
> > similar size. This works out well under artificial random loads; I'm
> > not sure how much it helps on typical real-world loads.
> 
> That chiefly depends on the design of the programs. Mine tend to avoid
> heap allocation wherever possible, and wherever impossible tend to
> coalesce allocations into few large requests. However, a lot of

For such programs, the properties of malloc are largely irrelevant, so
the only optimization that makes sense for them is ensuring that the
malloc implementation not do anything dumb like allocate tens to
hundreds of kB or even MB for expensive bookkeeping structures, etc.

> object-oriented code I have seen (even in C) allocates many small
> objects.

Indeed, this is the class of programs that are most affected, and
includes things like web browsers.

> But that is good: In the small numbers, there are many bins available.
> If the program makes random requests for anything between 16 and 128
> bytes, there are 4 bins for that. Add 2k to each of these and they all
> fall into one bin.

This is incorrect. Up to size 512 bytes (1024 on 64-bit archs), every
single multiple of 16 (or of 32, on 64-bit) is its own bin. Only past
that point does sparse spacing of bins begin.

> > In any case, the fine-grained locking also created a problem I didn't
> > anticipate when I designed it: when allocating memory that involves
> > splitting a free chunk, or when freeing memory that will get merged
> > with an adjacent free chunk, the operation as a whole is not atomic;
> > rather, a large free chunk is consumed, possibly emptying the bin it
> > lies in, split or merged, then returned either to the same or a
> > different bin. By saying this operation is non-atomic, I mean other
> > threads see the intermediate state where the large free chunk has been
> > consumed but not yet returned, and when this happens, they
> > unnecessarily split another free chunk or expand the heap rather than
> > waiting for it to be returned. This yields bad, potentially
> > catastrophic, fragmentation.
> > 
> 
> Fragmentation is bad with the current malloc() even in the single
> threaded case. A simple example is a request for fifty bytes followed by
> a request for two-thousand fifty bytes. If the stack was empty before,
> the first request will allocate a page from the OS. Assuming that was
> 4k, malloc will now split off the rest of the page and put it in the bin
> for "chunks sized 2k - 4k". The second request however will be rounded
> up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two
> more pages will be allocated from the system. Then the requested 2k and
> change will be split off, leaving the heap with one free chunk in the
> "2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the
> "4k - 8k" bin that is circa 6k in size. Three pages were allocated where
> one would have sufficed. (That is one definition of fragmentation: The
> request could not be serviced although the resources where available.)

This is not how it works, at least not at scale. When heap is expanded
via brk, the expension merges with the existing top of the heap;
discrete pages are not relevant. When it's expanded via mmap (because
brk doesn't work/is blocked on some other mapping), it's slightly true
for the first few allocations, but asymptotically irrelevant since we
don't keep allocating small maps. The amount of new anon memory mapped
grows exponentially, doubling every two times it happens.

There is also no bin for "sized 2k to 4k". It's "sized 2048 bytes to
2560 bytes", etc. -- granularity of bin sizes is in 1/4 steps up to
the next power of 2. So If you do start with a 4k chunk, after
splitting off 50 bytes to allocate, you have left a chunk in the
"sized 3586 up to 4096" bin, and the 2050-byte allocation is perfectly
satisfiable from it. You could adjust the example to work, but then
the fragmentation you find as a result is much lower.

> The benefit of this scheme, of course, is that allocations in the
> single-threaded case are bounded time: The algorithm is to pop off the
> first chunk in the bins large enough to support the request, or to
> allocate the memory necessary for that from the OS. In the worst case,
> allocation is
>     - OS memory allocation
>     - allocation of a chunk from the bin
>     - splitting that chunk
> 
> Each of these is constant time. I am not sure optimizing fragmentation
> is worth reducing the performance for. Especially in today's desktop
> and server systems, where anything below 16GB of RAM will just get
> laughed off the stage (slight exaggeration).

I've rarely used a system with 16GB ram, and plenty of systems using
musl have under 1GB. The old musl git/web server had 256MB before the
host shutdown had; now we're stuck with a more expensive one with
something like 768MB or 1GB. Also 32-bit archs have a hard *virtual*
limit of 2, 3, or 4 GB (mostly 2 or 3) regardless of how much physical
memory you have. Treating 16GB like something reasonable to have is
how you get the glibc & mainstream browser ecosystem...

> > In the long term, I think the whole malloc design we're using now is
> > wrong, and should be replaced, but until the time comes to actually do
> > that, it may make sense to try to improve some of the worst
> > properties, or even to reduce or eliminate the granularity of the
> > locking if it proves not to be very valuable on real-world loads.
> > 
> 
> Care to elaborate on what is wrong with the current design of the
> malloc? Everything you named so far was just implementation issues, but
> nothing that's design related.

Designs that require splitting/merging chunks inherently require
excessive synchronization that makes them not scale well to many
threads/cores (and awful on NUMA, if anyone cares). They also have
inherent fragmentation problems with certain allocation patterns, like
alternating between small/large allocations then freeing all the large
ones, which is a reasonable pattern (e.g. allocating large working
spaces and small result spaces, then freeing all the working spaces).

Designs where all the metadata (and especially freelist pointers) are
inline are highly vulnerable to heap overflow and use-after-free
attacks. If possible, I'd rather have less metadata inline and have it
efficiently coupled with something out-of-line that's harder to
attack. This might also improve performance and reduce contention,
e.g. if we used atomic bitmasks in an index of chunks to allocate/free
them rather than having to move them in and out of linked lists.

> So far, I am also not impressed with the competition. dietlibc for
> instance runs essentially the same algorithm, but with a big global lock
> in the multi-threaded case. tcmalloc runs essentially the same
> algorithm, but with an arena for each thread. Therefore every chunk has
> to know which arena it came from, thuns increasing the overhead by a
> pointer. But it is all fundamentally the same. Not like with sorting
> algorithms, where you get meaningful differences and tradeoffs. No, the

Here when I say "whole design is wrong" I'm considering all these as
the same basic design -- they're all dlmalloc variants. Doing better
in important ways while not doing worse in any important way is a hard
problem. But it seems like one worth solving.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.