musl - Non-candidate allocator designs, things to learn from them

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191106034605.GB16318@brightrain.aerifal.cx>
Date: Tue, 5 Nov 2019 22:46:05 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Non-candidate allocator designs, things to learn from them

On Tue, Oct 22, 2019 at 01:40:51PM -0400, Rich Felker wrote:
> Alongside time64 and other enhancements, over the next few months I'll
> be working on a major project to replace musl's malloc, which has
> known fundamental problems that can't really be fixed without
> significant redesign.
>
> [...]
>
> I'll follow up soon with some thoughts/findings so far on designs that
> might work for us, or that don't work but have interesting properties
> that suggest they may lead to workable designs.

The following was written incrementally over the past week or so, and
as I indicated before it's not so much about a design that works, but
about identifying properties that will motivate a good design.

The current dlmalloc-style split/merge of free chunks is problematic
in a lot of ways (complex interaction with fragmentation, hard to make
efficient with threads because access to neighbors being merged has to
be synchronized, etc.), so I want to first explore some simple designs
that completely lack it.


1. Extending a bump allocator

The simplest of all is a bump allocator with free stacks. This is just
one step beyond musl's current simple_malloc (used for static linking
without free). The only changes are that you round up requests to
discrete size classes, and then implement free as pushing onto a free
stack, with one stack per size class. Because it's a stack, a singly
linked list suffices, and you can push/pop solely with atomics.

One big pro is that there is no progressive fragmentation. If you
successfully make a complex sequence of allocations, free some subset,
then allocate the previously-freed sizes again in different order,
it's guaranteed to succeed. This is a really desirable property for
long-lived processses in certain kinds of memory-constrained systems.

Of course, the permanence of how memory is split up can also be a con.
After successfully performing very large numbers of small allocations
and freeing them, future large allocations may not be possible, and
vice versa.

Stack order is also undesirable for catching or mitigating double-free
and use-after-free. LRU order will give best results there. Switching
to LRU/FIFO order is possible, but no longer admits simple atomics for
alloc/free.


2. Zoned size classes

The above bump allocators with free essentially partition address
space into separate memory usable for each size class in a way that's
(one-time-per-run) adaptive, and completely unorganized. Another
simple type of allocator partitions (virtual) address space
*statically*, reserving (PROT_NONE, later mprotectable to writable
storage) for each size a large range of addresses usable as
slabs/object pools for objects of that particular size. This type of
design appears in GrapheneOS's hardened_malloc:

https://github.com/GrapheneOS/hardened_malloc

At first this seems pretty magical -- you automatically get a complete
lack of fragmentation... of *virtual* memory. It's possible to
fragment the zones themselves, requiring far more physical memory than
the logical amount allocated, but the physical memory usage by a given
size class is bounded by the peak usage in that size class at any
point in the program's lifetime. So that's pretty good.

In order to be able to do this, however, your virtual address space
has to be vastly larger than physical memory available. Ideally, each
size's zone should be larger than the total amount of physical memory
(ram+swap) you might have, so that you're not imposing an arbitrary
limit on what the application can do with it. Figuring around 50 size
classes and leaving at least half of virtual address space free for
mmap/large allocations, I figure your virtual address space should be
at least 100x larger than physical memory availability in order for
this to be viable. For 32-bit, that means it's not viable beyond
around 30 MB of physical memory. Ouch.

Ineed, GrapheneOS's hardened_malloc is explicitly only for 64-bit
archs, and this is why.

Aside from that, this kind of design also suffers from extreme
over-allocation in small programs, and it's not compatible with nommu
at all (since it relies on the mmu to cheat fragmentation). For
example, if you allocate just one object of each of 50 sizes classes,
you'll be using at least 50 pages (200k) already.

One advantage of this approach is that it admits nice out-of-band
metadata approaches, which let you make strong guarantees about the
consistency of the allocator state under heap corruption/overflows.


So..

Neither of these approaches is appropriate for the future of musl's
malloc. They're not sufficiently general-purpose/generalizable to all
environments and archs musl supports. But they do suggest properties
worth trying to achieve in an alternative. I've made some good
progress on this, which I'll follow up on in a separate post.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.