musl - Re: [PATCH 1/1] Add documentation for mallocng

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <52ad54d5-ffc0-6136-6a27-2fe75075c383@foss.arm.com>
Date: Mon, 24 Jan 2022 13:25:01 +0000
From: Yoan Picchi <yoan.picchi@...s.arm.com>
To: musl@...ts.openwall.com
Cc: yoan.picchi@....com
Subject: Re: [PATCH 1/1] Add documentation for mallocng

Hello.
Having received no review nor acknowledgement for the last few month I'm 
sending this reminder.
This patch is about adding some documentation for mallocng as the 
current lack of thereof makes
it time consuming to work with.


From: yoan.picchi@...s.arm.com <yoan.picchi@...s.arm.com>

Sent: 05 November 2021 13:36
To: musl@...ts.openwall.com
Cc: Yoan Picchi <Yoan.Picchi@....com>
Subject: [PATCH 1/1] Add documentation for mallocng

From: Yoan Picchi <yoan.picchi@....com>

This is an attempt at describing the various data structures of mallocng 
and how they interract together.

Signed-off-by: Yoan Picchi <yoan.picchi@....com>
Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d
---
  src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++
  1 file changed, 284 insertions(+)
  create mode 100644 src/malloc/mallocng/readme.txt

diff --git a/src/malloc/mallocng/readme.txt 
b/src/malloc/mallocng/readme.txt new file mode 100644 index 
00000000..c0c04847
--- /dev/null
+++ b/src/malloc/mallocng/readme.txt
@@ -0,0 +1,284 @@
+This doc is intended to give the reader a high level view of how musl's
+malloc works through explaining its data structures, and is targeted
+for whomever wants to work on malloc or is just curious about how it works.
+
+You can find an alternative unofficial shorter explanation here:
+https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f
+And some other information can be found in the mailing list archive:
+https://www.openwall.com/lists/musl/2021/10/04/2
+
+
+
+    GROUP
+
+The group is the structure that will actually hold the user data. Given
+that this structure will be required for any size of allocation, it
+must be small to reduce the memory waste for tiny allocations, yet be
+able to hold a large amount of data. There are two slightly different
+usages depending on the size of the data to hold.
+
+    For small allocations (<128KiB):
+A group has several slots, up to 32. The header part of the group is
+16B long on 64 bits platforms. This is defined by the UNIT constant,
+which also represent the minimal alignment inside a group. Then start 
the slots.
+
+     Small group:
++---------------------+
+| meta* meta          | \
+| int number_active:5 |  } UNIT (16B)
+| char padding[7]     | /
+|                     |
+|  +---------------+  | \
+|  |      slot     |  |  } sizeclass x UNIT
+|  +---------------+  | /
+|  |      slot     |  |
+|  +---------------+  |
+|       ... x32       |
+|  +---------------+  |
+|  |      slot     |  |
+|  +---------------+  |
++---------------------+
+
+Note that a slot uses some memory before the slot's start to store the
+in-band data (IB). This is safe because a slot's last 4B are always
+unused. For the first slot, "padding" should be at least 4 Bytes long
+to fit the IB data. This IB data stores the information required to 
find the various fields in the slot.
+IB is not defined by any struct, and is accessed solely through pointer
+arithmetic. If it had a structure, it would be like so:
+
+        Basic slot:                                         IB:
+  +---------------------+   <- before the slot +----------------------+
+  |       IB (4B)       |                         | char need_big_offset |
+  +---------------------+                         | char index:5         |
++-------------------------+ <- slot's start       | char reserved:3      |
+| +---------------------+ |                       | uint16_t offset      |
+| |                     | | +----------------------+
+| |      user data      | |
+| |                     | |
+| +---------------------+ | \
+| |     canary (1B)     | |  \
+| +---------------------+ |   \
+| |        slack        | |    \
+| +---------------------+ |     } size = reserved
+| |   footer size (4B)  | |    /
+| |    (exist only if   | |   /
+| |    reserved == 5)   | |  /
+| +---------------------+ | /
+| |     unused (4B)     | |
+| +---------------------+ |
++-------------------------+
+
+"need_big_offset" is always null for multislot groups.
+"index" is the slot's index. It is used to find the slot's end once we
+have the meta (explained later on). It might be redundant as one can
+calculate it on the fly from the slot's address, the group's address
+and the sizeclass, but by keeping the number of slots to 32, those 5
+index bits save us the aforenoted calculation.
+"reserved" specifies where the user data actually ends compared to the
+end of the slot. This way a canary can be placed to detect overflows
+right at the end of the user's data. If "reserved" is 0, then we have
+no space for the canary and so, we don't use any. Being only 3 bits
+long, we deal with larger footers by writing the size of the footer
+directly in the slot when we have enough space. "offset" is the offset 
between this slot's start, and the "storage"
+field in group (the start of the first slot). Given that all group
+slots are contiguous, this is used to access the metadata stored at the
+start of the group. Note that this offset is only a 16 bit value. To
+increase the range, it gets multiplied by UNIT, which is currently set
+to 16 and is the reason why 16 appears in a lot of places. This offset
+of 16 bits explains why the largest multislot groups will only have up 
to 8 slots: 8*8191 = 2^16-8.
+This is the meaning of the basic IB data, and is enough to run malloc,
+but it can't support aligned_alloc yet. The group metadata being 16
+Bytes long, what if one needs an alignment of 64B ?
+
+The slots being a fixed size means that most malloc calls actually
+won't fill it entirely and that we have some wiggle room as to where to
+place the user data in that slot: this is the "slack". But if the user
+data starts further back in the slot, how to find the IB ? The solution
+is to move the IB data, as well. This would mean though that once we
+return from the malloc call, we no longer know where the user data
+starts. This is no issue though, because the original IB slot has been 
repurposed. In the modified IB, the "offset_size"
+field gives us the offset between the start of the slot and the start
+of the user data. To make sure one doesn't use this IB as a "normal"
+IB, the "reserved" slot is set to 7, which triggers an assert if used
+to get the metadata.
+
+This offset is not only used in aligned_alloc, it is used in nearly all 
allocs.
+By cycling through every valid offset in a slot upon free and re-use,
+we increase the time to reuse of the same user address. This makes it
+easier to catch a double free.
+
+      Complex slot:                                    Modified IB:
+  +---------------------+ +----------------------+
+  |   modified IB (4B)  |                         | need_big_offset = 0  |
+  +---------------------+                         | index:5 = undefined  |
++-------------------------+ <- slot's start       | reserved:3 = 7       |
+| +---------------------+ |                       | uint16_t
+| +---------------------+ | offset_size |
+| |   optional offset   | | +----------------------+
+| +---------------------+ |
+| +---------------------+ |
+| |       IB (4B)       | |
+| +---------------------+ |
+| +---------------------+ |
+| |                     | |
+| |      user data      | |
+| |                     | |
+| +---------------------+ |
+| |     canary (1B)     | |
+| +---------------------+ |
+| |        slack        | |
+| +---------------------+ |
+| |   footer size (4B)  | |
+| +---------------------+ |
+| |     unused (4B)     | |
+| +---------------------+ |
++-------------------------+
+
+
+    For large allocations (≥128KiB):
+A group will only hold a single slot. It is not tracked and reused the
+same way as small groups are. It is mmapped upon allocation, and 
munmapped when freed.
+Given that this group has no maximum size, it is possible that one
+needs it to be aligned to a few MiB using aligned_alloc. This would be
+above what the 16 bit offset (and offset_size) fields can hold. In this 
case, "need_big_offset"
+is set to 1 and an additional 4B is added to the IB to hold a 32 bits 
offset.
+Note that it limits alignment to 64GiB (offset*UNIT).
+It may look like we have a problem given that there is only 7 Bytes of
+padding, but given that there is only one slot, we never use
+"number_active", so we can afford overwriting it with the extra IB.
+
+Also note that in most case there is some slack in large groups as well
+because mmap allocates at a page granularity. Internally, musl works
+with 4KiB page and simply considers larger pages as a bunch of 4KiB
+pages. As such, the size of the slot is a multiple of those 4KiB pages
+and not a multiple of the actual page size.
+
+     Large group:
++---------------------+
+| meta* meta          | \
+| int number_active:5 |  } UNIT (16B)
+| char padding[7]     | /
+|                     |
+|  +---------------+  |
+|  | extra IB (4B) |  |
+|  +---------------+  | \
+|  |               |  |  \
+|  |      slot     |  |   } maplen x 4KiB
+|  |               |  |  /
+|  +---------------+  | /
++---------------------+  <- end at a 4KiB boundary
+
+Other than those few changes, large groups behave like small ones.
+
+
+
+    META
+
+Groups only have the first part of the metadata. The second part is the
+bound meta object. There's one for each group and they each have a
+pointer to each other.
+
+        Meta:
++--------------------+
+| meta* prev, next   | <- make a circular buffer
+| group* mem         |
+| int avail_mask     | \ define 3 states for each slot
+| int freed_mask     | /
+| size_t last_idx:5  |
+| size_t freeable:1  |
+| size_t sizeclass:6 | <- size if small group
+| size_t maplen:52   | <- size if large group
++--------------------+
+
+This is where the size of the slots is defined, using the "sizeclass"
+field. It is linked to a global array with 48 fixed size ranging from 1
+to 8191. These sizes are then multiplied by UNIT to get the actual size
+in memory, hence the small/large group threshold at 128KiB. The
+"sizeclasses" [48-62] are unused and a "sizeclass" of 63 means it's a 
large group.
+Given the limitation of offset to 16 bits, there is a limitation of the
+number of slots for the largest of the small groups. It is set in 
"last_idx".
+The "avail_mask" and "freed_mask" are both bitmaps to hold the status
+of every slot in the group. The former holds whether the slot is
+available and never used, the latter whether the slot has been used but
+has been freed, or is currently inactive. When one calls malloc, it'll
+first attempt to allocate into an "avail_mask" slot. Failing this,
+it'll use a freed slot or activate some new slots. This allocation
+procedure is used to keep memory pages clean. A group can easily span
+several pages. As such, it's better to fill in slots in pages that have
+already been dirtied. This is what the "active_idx" field in group is
+used for. When a new slot is alloc-ed in a group, all the slots that 
are in the same page get activated.
+"maplen" is where the size of a large group is stored. It is the number
+of 4KiB pages the group spans.
+Then there is "freeable", which can be a bit more confusing. While it
+obviously specifies whether the group is freeable or not, currently a
+group is not freeable only when a new meta is allocated on zeroed memory.
+And last, the "prev" and "next" fields. This is so we can make a
+circular buffer out of those meta objects. This way, when one requires
+a slot, malloc can easily go through the various metas until it finds a
+satisfactory one. In practice it doesn't search for the optimal one but
+just uses some heuristics like "no slot available, need to activate a
+new slot ? take the next meta instead".
+
+
+
+    META_AREA
+
+We need some way to hold the memory where the meta will be allocated.
+That's the purpose of the meta_areas.
+
+     Meta_area:
++--------------------+
+| int64_t check      |
+| meta_area* next    | <- form a linked list
+| int nslot          |
+| meta slots[]       | <- slots to store meta objects
++--------------------+
+
+These work like a simple linked list where each link is bigger than the
+previous one. Growing the size exponentially ensures there will only be
+a small amount of links and mmap calls. "nslot" gives the number of
+slots where the meta object can stored in this link (≠ slots in
+groups). Though, in practice, "nslot" seems to be currently unused and
+the slots are tracked in the malloc_context object instead.
+
+
+
+    MALLOC_CONTEXT
+
+And finally, the context. This is a global singleton mainly used to
+manage the meta objects. Some of the fields have been omitted for brevity.
+
+        Malloc_context:
++------------------------------+
+| int64_t secret               |
+| meta* free_meta_head         | <- head of the free meta
+| meta* avail_meta             | <- next meta in meta_area
+| size_t avail_meta_count      |
+| size_t avail_meta_area_count |
+| meta_area head, tail         |
+| meta* active[48]             | <- circular buffer for each sizeclass
+| size_t usage_by_class[48]    | <- number of slot for each sizeclass
+| uint8_t unmap_seq[32]        | \ balance fragmentation/slot reuse
+| uint8_t bounce[32]           | /
++------------------------------+
+
+"head" and "tail" allow easy access to the meta_area. Given that meta
+slots are not tracked, we only need access to the tail of the list.
+There is some kind of memory protection trick with the meta area. When
+a new meta_area is created, only the first 4KiB page is made available
+for the slots. The number of slots fitting in that page is written down
+in "avail_meta_count". When "avail_meta_count" runs out of slots,
+another 4KiB worth of slots are added and "avail_meta_area_count" is
+decremented once. When it reaches 0, a new meta_area is mmap-ed. The
+point of doing so is that the pages originally mapped are initialized
+with no right to read nor write, and those rights are given back just
+as the malloc needs them. "avail_meta" and "free_meta_head" are the two
+sources of meta objects. The first is a pointer to the first free slot 
in the tail meta_area. The second is a pool of all the metas that have 
been freed.
+This is important because once again, the meta slots are not tracked in
+the meta area. If we weren't retaining the freed meta in this list,
+they would be lost and it would be a memory leak. This way, we
+prioritize reusing freed meta instead of initializing new ones. For
+easy access, there are a circular buffer for each sizeclass. Along with
+those there are counters of how many slots exist for any given 
sizeclass: "usage_by_class".
+The "bounce" and "unmap_seq" seems to be used for balancing
+fragmentation and address reuse.
+
--
2.27.0
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.