musl - [PATCH 1/1] Add documentation for mallocng

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20211105135323.36524-2-yoan.picchi@foss.arm.com>
Date: Fri,  5 Nov 2021 13:53:23 +0000
From: yoan.picchi@...s.arm.com
To: musl@...ts.openwall.com
Cc: Yoan Picchi <yoan.picchi@....com>
Subject: [PATCH 1/1] Add documentation for mallocng

From: Yoan Picchi <yoan.picchi@....com>

This is an attempt at describing the various data structures of
mallocng and how they interract together.

Signed-off-by: Yoan Picchi <yoan.picchi@....com>
Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d
---
 src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++
 1 file changed, 284 insertions(+)
 create mode 100644 src/malloc/mallocng/readme.txt

diff --git a/src/malloc/mallocng/readme.txt b/src/malloc/mallocng/readme.txt
new file mode 100644
index 00000000..c0c04847
--- /dev/null
+++ b/src/malloc/mallocng/readme.txt
@@ -0,0 +1,284 @@
+This doc is intended to give the reader a high level view of how musl's malloc
+works through explaining its data structures, and is targeted for whomever
+wants to work on malloc or is just curious about how it works.
+
+You can find an alternative unofficial shorter explanation here:
+https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f
+And some other information can be found in the mailing list archive:
+https://www.openwall.com/lists/musl/2021/10/04/2
+
+
+
+    GROUP
+
+The group is the structure that will actually hold the user data. Given that
+this structure will be required for any size of allocation, it must be small
+to reduce the memory waste for tiny allocations, yet be able to hold a large
+amount of data. There are two slightly different usages depending on the size
+of the data to hold.
+
+    For small allocations (<128KiB):
+A group has several slots, up to 32. The header part of the group is 16B long
+on 64 bits platforms. This is defined by the UNIT constant, which also
+represent the minimal alignment inside a group. Then start the slots.
+
+     Small group:
++---------------------+
+| meta* meta          | \
+| int number_active:5 |  } UNIT (16B)
+| char padding[7]     | /
+|                     |
+|  +---------------+  | \
+|  |      slot     |  |  } sizeclass x UNIT
+|  +---------------+  | /
+|  |      slot     |  |
+|  +---------------+  |
+|       ... x32       |
+|  +---------------+  |
+|  |      slot     |  |
+|  +---------------+  |
++---------------------+
+
+Note that a slot uses some memory before the slot's start to store the in-band
+data (IB). This is safe because a slot's last 4B are always unused. For the
+first slot, "padding" should be at least 4 Bytes long to fit the IB data. This
+IB data stores the information required to find the various fields in the slot.
+IB is not defined by any struct, and is accessed solely through pointer
+arithmetic. If it had a structure, it would be like so:
+
+        Basic slot:                                         IB:
+  +---------------------+   <- before the slot    +----------------------+
+  |       IB (4B)       |                         | char need_big_offset |
+  +---------------------+                         | char index:5         |
++-------------------------+ <- slot's start       | char reserved:3      |
+| +---------------------+ |                       | uint16_t offset      |
+| |                     | |                       +----------------------+
+| |      user data      | |
+| |                     | |
+| +---------------------+ | \
+| |     canary (1B)     | |  \
+| +---------------------+ |   \
+| |        slack        | |    \
+| +---------------------+ |     } size = reserved
+| |   footer size (4B)  | |    /
+| |    (exist only if   | |   /
+| |    reserved == 5)   | |  /
+| +---------------------+ | /
+| |     unused (4B)     | |
+| +---------------------+ |
++-------------------------+
+
+"need_big_offset" is always null for multislot groups.
+"index" is the slot's index. It is used to find the slot's end once we have the
+meta (explained later on). It might be redundant as one can calculate it on the
+fly from the slot's address, the group's address and the sizeclass, but by
+keeping the number of slots to 32, those 5 index bits save us the aforenoted
+calculation.
+"reserved" specifies where the user data actually ends compared to the end of
+the slot. This way a canary can be placed to detect overflows right at the end
+of the user's data. If "reserved" is 0, then we have no space for the canary
+and so, we don't use any. Being only 3 bits long, we deal with larger footers
+by writing the size of the footer directly in the slot when we have enough
+space. "offset" is the offset between this slot's start, and the "storage"
+field in group (the start of the first slot). Given that all group slots are
+contiguous, this is used to access the metadata stored at the start of the
+group. Note that this offset is only a 16 bit value. To increase the range, it
+gets multiplied by UNIT, which is currently set to 16 and is the reason why 16
+appears in a lot of places. This offset of 16 bits explains why the largest
+multislot groups will only have up to 8 slots: 8*8191 = 2^16-8.
+This is the meaning of the basic IB data, and is enough to run malloc, but it
+can't support aligned_alloc yet. The group metadata being 16 Bytes long, what
+if one needs an alignment of 64B ?
+
+The slots being a fixed size means that most malloc calls actually won't fill
+it entirely and that we have some wiggle room as to where to place the user
+data in that slot: this is the "slack". But if the user data starts further
+back in the slot, how to find the IB ? The solution is to move the IB data, as
+well. This would mean though that once we return from the malloc call, we no
+longer know where the user data starts. This is no issue though, because the
+original IB slot has been repurposed. In the modified IB, the "offset_size"
+field gives us the offset between the start of the slot and the start of the
+user data. To make sure one doesn't use this IB as a "normal" IB, the
+"reserved" slot is set to 7, which triggers an assert if used to get the
+metadata.
+
+This offset is not only used in aligned_alloc, it is used in nearly all allocs.
+By cycling through every valid offset in a slot upon free and re-use, we
+increase the time to reuse of the same user address. This makes it easier to
+catch a double free.
+
+      Complex slot:                                    Modified IB:
+  +---------------------+                         +----------------------+
+  |   modified IB (4B)  |                         | need_big_offset = 0  |
+  +---------------------+                         | index:5 = undefined  |
++-------------------------+ <- slot's start       | reserved:3 = 7       |
+| +---------------------+ |                       | uint16_t offset_size |
+| |   optional offset   | |                       +----------------------+
+| +---------------------+ |
+| +---------------------+ |
+| |       IB (4B)       | |
+| +---------------------+ |
+| +---------------------+ |
+| |                     | |
+| |      user data      | |
+| |                     | |
+| +---------------------+ |
+| |     canary (1B)     | |
+| +---------------------+ |
+| |        slack        | |
+| +---------------------+ |
+| |   footer size (4B)  | |
+| +---------------------+ |
+| |     unused (4B)     | |
+| +---------------------+ |
++-------------------------+
+
+
+    For large allocations (≥128KiB):
+A group will only hold a single slot. It is not tracked and reused the same way
+as small groups are. It is mmapped upon allocation, and munmapped when freed.
+Given that this group has no maximum size, it is possible that one needs it to
+be aligned to a few MiB using aligned_alloc. This would be above what the 16
+bit offset (and offset_size) fields can hold. In this case, "need_big_offset"
+is set to 1 and an additional 4B is added to the IB to hold a 32 bits offset.
+Note that it limits alignment to 64GiB (offset*UNIT).
+It may look like we have a problem given that there is only 7 Bytes of padding,
+but given that there is only one slot, we never use "number_active", so we can
+afford overwriting it with the extra IB.
+
+Also note that in most case there is some slack in large groups as well because
+mmap allocates at a page granularity. Internally, musl works with 4KiB page and
+simply considers larger pages as a bunch of 4KiB pages. As such, the size of
+the slot is a multiple of those 4KiB pages and not a multiple of the actual
+page size.
+
+     Large group:
++---------------------+
+| meta* meta          | \
+| int number_active:5 |  } UNIT (16B)
+| char padding[7]     | /
+|                     |
+|  +---------------+  |
+|  | extra IB (4B) |  |
+|  +---------------+  | \
+|  |               |  |  \
+|  |      slot     |  |   } maplen x 4KiB
+|  |               |  |  /
+|  +---------------+  | /
++---------------------+  <- end at a 4KiB boundary
+
+Other than those few changes, large groups behave like small ones.
+
+
+
+    META
+
+Groups only have the first part of the metadata. The second part is the bound
+meta object. There's one for each group and they each have a pointer to each
+other.
+
+        Meta:
++--------------------+
+| meta* prev, next   | <- make a circular buffer
+| group* mem         |
+| int avail_mask     | \ define 3 states for each slot
+| int freed_mask     | /
+| size_t last_idx:5  |
+| size_t freeable:1  |
+| size_t sizeclass:6 | <- size if small group
+| size_t maplen:52   | <- size if large group
++--------------------+
+
+This is where the size of the slots is defined, using the "sizeclass" field. It
+is linked to a global array with 48 fixed size ranging from 1 to 8191. These
+sizes are then multiplied by UNIT to get the actual size in memory, hence the
+small/large group threshold at 128KiB. The "sizeclasses" [48-62] are unused and
+a "sizeclass" of 63 means it's a large group.
+Given the limitation of offset to 16 bits, there is a limitation of the number
+of slots for the largest of the small groups. It is set in "last_idx".
+The "avail_mask" and "freed_mask" are both bitmaps to hold the status of every
+slot in the group. The former holds whether the slot is available and never
+used, the latter whether the slot has been used but has been freed, or is
+currently inactive. When one calls malloc, it'll first attempt to allocate into
+an "avail_mask" slot. Failing this, it'll use a freed slot or activate some new
+slots. This allocation procedure is used to keep memory pages clean. A group
+can easily span several pages. As such, it's better to fill in slots in pages
+that have already been dirtied. This is what the "active_idx" field in group is
+used for. When a new slot is alloc-ed in a group, all the slots that are in the
+same page get activated.
+"maplen" is where the size of a large group is stored. It is the number of 4KiB
+pages the group spans.
+Then there is "freeable", which can be a bit more confusing. While it obviously
+specifies whether the group is freeable or not, currently a group is not
+freeable only when a new meta is allocated on zeroed memory.
+And last, the "prev" and "next" fields. This is so we can make a circular
+buffer out of those meta objects. This way, when one requires a slot, malloc
+can easily go through the various metas until it finds a satisfactory one. In
+practice it doesn't search for the optimal one but just uses some heuristics
+like "no slot available, need to activate a new slot ? take the next meta
+instead".
+
+
+
+    META_AREA
+
+We need some way to hold the memory where the meta will be allocated. That's
+the purpose of the meta_areas.
+
+     Meta_area:
++--------------------+
+| int64_t check      |
+| meta_area* next    | <- form a linked list
+| int nslot          |
+| meta slots[]       | <- slots to store meta objects
++--------------------+
+
+These work like a simple linked list where each link is bigger than the
+previous one. Growing the size exponentially ensures there will only be a small
+amount of links and mmap calls. "nslot" gives the number of slots where the
+meta object can stored in this link (≠ slots in groups). Though, in practice,
+"nslot" seems to be currently unused and the slots are tracked in the
+malloc_context object instead.
+
+
+
+    MALLOC_CONTEXT
+
+And finally, the context. This is a global singleton mainly used to manage the
+meta objects. Some of the fields have been omitted for brevity.
+
+        Malloc_context:
++------------------------------+
+| int64_t secret               |
+| meta* free_meta_head         | <- head of the free meta
+| meta* avail_meta             | <- next meta in meta_area
+| size_t avail_meta_count      |
+| size_t avail_meta_area_count |
+| meta_area head, tail         |
+| meta* active[48]             | <- circular buffer for each sizeclass
+| size_t usage_by_class[48]    | <- number of slot for each sizeclass
+| uint8_t unmap_seq[32]        | \ balance fragmentation/slot reuse
+| uint8_t bounce[32]           | /
++------------------------------+
+
+"head" and "tail" allow easy access to the meta_area. Given that meta slots are
+not tracked, we only need access to the tail of the list. There is some kind of
+memory protection trick with the meta area. When a new meta_area is created,
+only the first 4KiB page is made available for the slots. The number of slots
+fitting in that page is written down in "avail_meta_count". When
+"avail_meta_count" runs out of slots, another 4KiB worth of slots are added and
+"avail_meta_area_count" is decremented once. When it reaches 0, a new meta_area
+is mmap-ed. The point of doing so is that the pages originally mapped are
+initialized with no right to read nor write, and those rights are given back
+just as the malloc needs them. "avail_meta" and "free_meta_head" are the two
+sources of meta objects. The first is a pointer to the first free slot in the
+tail meta_area. The second is a pool of all the metas that have been freed.
+This is important because once again, the meta slots are not tracked in the
+meta area. If we weren't retaining the freed meta in this list, they would be
+lost and it would be a memory leak. This way, we prioritize reusing freed meta
+instead of initializing new ones. For easy access, there are a circular buffer
+for each sizeclass. Along with those there are counters of how many slots exist
+for any given sizeclass: "usage_by_class".
+The "bounce" and "unmap_seq" seems to be used for balancing fragmentation and
+address reuse.
+
-- 
2.27.0
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.