musl - Re: What's wrong with musl's malloc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <0EC2E771-368F-454E-B08D-7D6A5321F598@mac.com>
Date: Mon, 23 Apr 2018 09:40:08 +1200
From: Michael Clark <michaeljclark@....com>
To: musl@...ts.openwall.com
Subject: Re: What's wrong with musl's malloc


> On 21/04/2018, at 8:09 AM, Rich Felker <dalias@...c.org> wrote:
> 
> I've talked on and off about musl's malloc needing to be overhauled or
> replaced, and gaining the ability to experiment with that is one of
> the motivations for making malloc interposable/replacable. Here's a
> brief write-up of what's wrong that needs to be addressed:
> 
> 
> The main benefits of musl's malloc vs the standard dlmalloc algorithms
> it's based on is the fine-grained locking. As long as there are binned
> free chunks of various sizes available, threads calling malloc will
> only contend for a lock when they're requesting allocations of same or
> similar size. This works out well under artificial random loads; I'm
> not sure how much it helps on typical real-world loads.
> 
> 
> In any case, the fine-grained locking also created a problem I didn't
> anticipate when I designed it: when allocating memory that involves
> splitting a free chunk, or when freeing memory that will get merged
> with an adjacent free chunk, the operation as a whole is not atomic;
> rather, a large free chunk is consumed, possibly emptying the bin it
> lies in, split or merged, then returned either to the same or a
> different bin. By saying this operation is non-atomic, I mean other
> threads see the intermediate state where the large free chunk has been
> consumed but not yet returned, and when this happens, they
> unnecessarily split another free chunk or expand the heap rather than
> waiting for it to be returned. This yields bad, potentially
> catastrophic, fragmentation.
> 
> The malloc side already partially mitigates this with the "pretrim"
> function, which is used whenever the leftover part after splitting
> would remain in the same bin it was already in. In particular his
> covers splitting small allocations off a large "top of heap" zone, but
> it doesn't help with splitting small chunks. And the free side is much
> worse -- large free zones momentarily disappear every time something
> adjacent to them is freed.
> 
> In order to overcome this fully while maintaining the fine-grained
> locking, we'd need the ability to hold multiple locks concurrently,
> which would require a protocol for lock acquisition order to avoid
> deadlock. But there may be cheap mitigations that could be made in
> free like on the malloc side. For instance it might be possible to
> make a simpler protocol whereby we could always hold the lock for bin
> 63, thereby avoiding exposing a state where large free zones are
> unavailable. If there's a way to make this work, moving the binmap
> masking for newly-emptied bins from unbin() to unlock_bin() would
> ensure that malloc doesn't "see" a "temporary empty" state.
> 
> 
> In the long term, I think the whole malloc design we're using now is
> wrong, and should be replaced, but until the time comes to actually do
> that, it may make sense to try to improve some of the worst
> properties, or even to reduce or eliminate the granularity of the
> locking if it proves not to be very valuable on real-world loads.

Hi Rich,

Have you considered importing jemalloc?

It’s been battle tested for 12 years in FreeBSD and in many other apps. Given it’s bundled and used in MariaDB which is undoubtedly a demanding user and presumably Monty approved the choice of allocator, so it shouldn’t be a bad choice. I’d trust the allocator choice of a well respected database architect.

From looking at the Lockless, Inc benchmarks jemalloc has good performance from very small to very large allocations. From the Lockless Inc memory allocator comparisons: “The disadvantage of the jemalloc allocator is its memory usage. It uses power-of-two sized bins, which leads to a greatly increased memory footprint compared to other allocators. This can affect real-world performance due to excess cache and TLB misses.”

From the wiki:

- https://github.com/jemalloc/jemalloc/wiki/Background

“jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support. It is intended for use as the system-provided memory allocator, as in FreeBSD's libc library, as well as for linking into C/C++ applications. jemalloc provides many introspection, memory management, and tuning features beyond the standard allocator functionality.”

It’s a libc allocator.

Here are some of the major users (who it’s fair to say must have exercised due consideration in choosing their memory allocator given there are browsers (Firefox’s mozjemalloc fork), several databases and caches):

- FreeBSD
- NetBSD
- Mozilla Firefox
- Varnish
- Cassandra
- Redis
- hhvm
- MariaDB
- Aerospike
- Android

Curious how big the linked code is... assuming code size is an important consideration...

Looking at the code, ... it doesn’t look too big... and it’s interesting to note that the recent changes to glibc to support per thread pools uses the same tcache feature name as jemalloc whose per thread pool implementation is in tcache.c

- https://sourceware.org/ml/libc-alpha/2017-01/msg00452.html
- https://sourceware.org/ml/libc-alpha/2017-01/msg00524.html

musl libc could vendor the jemalloc code with some sed scripts to modify includes so that jemalloc fits into the musl build structure and can be easily synced with upstream. Ideally musl would want to import/vendor vs using a submodule so musl remains stand-alone. I assume many projects have done the same. MariaDB bundles it. Firefox forked it.

I think the primary disadvantage of jemalloc is power of 2 size bins. Non power of 2 size strides for small structures have better caching performance for many workloads due to stochastic cache eviction when not using power of 2 strides. I’ve noticed that power of 2 alignments can have pathological performance in some cases. i.e. witnessing an array with sizeof(struct) == 28 or 36 perform much better than sizeof(struct) == 32. That said most apps would be using a single malloc call for these types of performance critical arrays so it’s only an issue for apps that individually malloc many small objects.

Worth considering.

BTW - I owe you an update on the riscv musl port. I’ll post one soon... I’ve been busy on the QEMU RISC-V port, whose ‘virt’ board has recently been used by Fedora and Debian distro porters for riscv64 arch support (glibc based distros). Stefan O’Rear got MTTCG working on qemu/target/riscv so we can run SMP Linux in QEMU with reasonable performance. Debian for example now has QEMU RISC-V builders for the riscv64 arch.

- https://github.com/riscv/riscv-qemu/wiki

It will be nice to get the riscv musl support upstream. The major issue at present is ELF thread local storage (pthreads are working but GCC’s __thread is not). The RISC-V TLS model is not documented so it requires some reverse engineering of the riscv support in GCC/binutils/glibc. I haven’t been able to isolate the issue and could  benefit from some help by someone more knowledgable in that area (ELF TLS and architecture specific TLS models).

Regards,
Michael
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.