musl - Re: What's wrong with musl's malloc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180422193450.GA11638@voyager>
Date: Sun, 22 Apr 2018 21:34:50 +0200
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: What's wrong with musl's malloc

On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote:
> I've talked on and off about musl's malloc needing to be overhauled or
> replaced, and gaining the ability to experiment with that is one of
> the motivations for making malloc interposable/replacable. Here's a
> brief write-up of what's wrong that needs to be addressed:
> 

Yeah, I was about to ask for one.

> 
> The main benefits of musl's malloc vs the standard dlmalloc algorithms
> it's based on is the fine-grained locking. As long as there are binned
> free chunks of various sizes available, threads calling malloc will
> only contend for a lock when they're requesting allocations of same or
> similar size. This works out well under artificial random loads; I'm
> not sure how much it helps on typical real-world loads.
> 

That chiefly depends on the design of the programs. Mine tend to avoid
heap allocation wherever possible, and wherever impossible tend to
coalesce allocations into few large requests. However, a lot of
object-oriented code I have seen (even in C) allocates many small
objects.

But that is good: In the small numbers, there are many bins available.
If the program makes random requests for anything between 16 and 128
bytes, there are 4 bins for that. Add 2k to each of these and they all
fall into one bin.

> 
> In any case, the fine-grained locking also created a problem I didn't
> anticipate when I designed it: when allocating memory that involves
> splitting a free chunk, or when freeing memory that will get merged
> with an adjacent free chunk, the operation as a whole is not atomic;
> rather, a large free chunk is consumed, possibly emptying the bin it
> lies in, split or merged, then returned either to the same or a
> different bin. By saying this operation is non-atomic, I mean other
> threads see the intermediate state where the large free chunk has been
> consumed but not yet returned, and when this happens, they
> unnecessarily split another free chunk or expand the heap rather than
> waiting for it to be returned. This yields bad, potentially
> catastrophic, fragmentation.
> 

Fragmentation is bad with the current malloc() even in the single
threaded case. A simple example is a request for fifty bytes followed by
a request for two-thousand fifty bytes. If the stack was empty before,
the first request will allocate a page from the OS. Assuming that was
4k, malloc will now split off the rest of the page and put it in the bin
for "chunks sized 2k - 4k". The second request however will be rounded
up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two
more pages will be allocated from the system. Then the requested 2k and
change will be split off, leaving the heap with one free chunk in the
"2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the
"4k - 8k" bin that is circa 6k in size. Three pages were allocated where
one would have sufficed. (That is one definition of fragmentation: The
request could not be serviced although the resources where available.)

The benefit of this scheme, of course, is that allocations in the
single-threaded case are bounded time: The algorithm is to pop off the
first chunk in the bins large enough to support the request, or to
allocate the memory necessary for that from the OS. In the worst case,
allocation is
    - OS memory allocation
    - allocation of a chunk from the bin
    - splitting that chunk

Each of these is constant time. I am not sure optimizing fragmentation
is worth reducing the performance for. Especially in today's desktop
and server systems, where anything below 16GB of RAM will just get
laughed off the stage (slight exaggeration).

[...]
> 
> In the long term, I think the whole malloc design we're using now is
> wrong, and should be replaced, but until the time comes to actually do
> that, it may make sense to try to improve some of the worst
> properties, or even to reduce or eliminate the granularity of the
> locking if it proves not to be very valuable on real-world loads.
> 

Care to elaborate on what is wrong with the current design of the
malloc? Everything you named so far was just implementation issues, but
nothing that's design related.

So far, I am also not impressed with the competition. dietlibc for
instance runs essentially the same algorithm, but with a big global lock
in the multi-threaded case. tcmalloc runs essentially the same
algorithm, but with an arena for each thread. Therefore every chunk has
to know which arena it came from, thuns increasing the overhead by a
pointer. But it is all fundamentally the same. Not like with sorting
algorithms, where you get meaningful differences and tradeoffs. No, the
tradeoffs appear to be entirely in the area I'd call tuning: Where do
you put the mmap threshold, how do you lock, do you optimize locality of
reference and tiny allocations, all that stuff. But as far as I can
tell, they all suffer from the problem described above. Well, OK, for
32-bit versions of dietlibc, 2050 bytes is already beyond the mmap
threshold. But if it weren't, my point would still stand.

> Rich

Ciao,
Markus
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.