Date: Sun, 22 Apr 2018 21:34:50 +0200 From: Markus Wichmann <nullplan@....net> To: musl@...ts.openwall.com Subject: Re: What's wrong with musl's malloc On Fri, Apr 20, 2018 at 04:09:04PM -0400, Rich Felker wrote: > I've talked on and off about musl's malloc needing to be overhauled or > replaced, and gaining the ability to experiment with that is one of > the motivations for making malloc interposable/replacable. Here's a > brief write-up of what's wrong that needs to be addressed: > Yeah, I was about to ask for one. > > The main benefits of musl's malloc vs the standard dlmalloc algorithms > it's based on is the fine-grained locking. As long as there are binned > free chunks of various sizes available, threads calling malloc will > only contend for a lock when they're requesting allocations of same or > similar size. This works out well under artificial random loads; I'm > not sure how much it helps on typical real-world loads. > That chiefly depends on the design of the programs. Mine tend to avoid heap allocation wherever possible, and wherever impossible tend to coalesce allocations into few large requests. However, a lot of object-oriented code I have seen (even in C) allocates many small objects. But that is good: In the small numbers, there are many bins available. If the program makes random requests for anything between 16 and 128 bytes, there are 4 bins for that. Add 2k to each of these and they all fall into one bin. > > In any case, the fine-grained locking also created a problem I didn't > anticipate when I designed it: when allocating memory that involves > splitting a free chunk, or when freeing memory that will get merged > with an adjacent free chunk, the operation as a whole is not atomic; > rather, a large free chunk is consumed, possibly emptying the bin it > lies in, split or merged, then returned either to the same or a > different bin. By saying this operation is non-atomic, I mean other > threads see the intermediate state where the large free chunk has been > consumed but not yet returned, and when this happens, they > unnecessarily split another free chunk or expand the heap rather than > waiting for it to be returned. This yields bad, potentially > catastrophic, fragmentation. > Fragmentation is bad with the current malloc() even in the single threaded case. A simple example is a request for fifty bytes followed by a request for two-thousand fifty bytes. If the stack was empty before, the first request will allocate a page from the OS. Assuming that was 4k, malloc will now split off the rest of the page and put it in the bin for "chunks sized 2k - 4k". The second request however will be rounded up, and seeing as the bin for "chunks sized 4k - 8k" is still empty, two more pages will be allocated from the system. Then the requested 2k and change will be split off, leaving the heap with one free chunk in the "2k - 4k" bin that is just a bit beneath 4k, and one free chunk in the "4k - 8k" bin that is circa 6k in size. Three pages were allocated where one would have sufficed. (That is one definition of fragmentation: The request could not be serviced although the resources where available.) The benefit of this scheme, of course, is that allocations in the single-threaded case are bounded time: The algorithm is to pop off the first chunk in the bins large enough to support the request, or to allocate the memory necessary for that from the OS. In the worst case, allocation is - OS memory allocation - allocation of a chunk from the bin - splitting that chunk Each of these is constant time. I am not sure optimizing fragmentation is worth reducing the performance for. Especially in today's desktop and server systems, where anything below 16GB of RAM will just get laughed off the stage (slight exaggeration). [...] > > In the long term, I think the whole malloc design we're using now is > wrong, and should be replaced, but until the time comes to actually do > that, it may make sense to try to improve some of the worst > properties, or even to reduce or eliminate the granularity of the > locking if it proves not to be very valuable on real-world loads. > Care to elaborate on what is wrong with the current design of the malloc? Everything you named so far was just implementation issues, but nothing that's design related. So far, I am also not impressed with the competition. dietlibc for instance runs essentially the same algorithm, but with a big global lock in the multi-threaded case. tcmalloc runs essentially the same algorithm, but with an arena for each thread. Therefore every chunk has to know which arena it came from, thuns increasing the overhead by a pointer. But it is all fundamentally the same. Not like with sorting algorithms, where you get meaningful differences and tradeoffs. No, the tradeoffs appear to be entirely in the area I'd call tuning: Where do you put the mmap threshold, how do you lock, do you optimize locality of reference and tiny allocations, all that stuff. But as far as I can tell, they all suffer from the problem described above. Well, OK, for 32-bit versions of dietlibc, 2050 bytes is already beyond the mmap threshold. But if it weren't, my point would still stand. > Rich Ciao, Markus
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.