musl - Re: [PATCH 1/1] Add documentation for mallocng

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20220124144234.GG7074@brightrain.aerifal.cx>
Date: Mon, 24 Jan 2022 09:42:36 -0500
From: Rich Felker <dalias@...c.org>
To: Yoan Picchi <yoan.picchi@...s.arm.com>
Cc: musl@...ts.openwall.com, yoan.picchi@....com
Subject: Re: [PATCH 1/1] Add documentation for mallocng

On Mon, Jan 24, 2022 at 01:25:01PM +0000, Yoan Picchi wrote:
> Hello.
> Having received no review nor acknowledgement for the last few month
> I'm sending this reminder.
> This patch is about adding some documentation for mallocng as the
> current lack of thereof makes
> it time consuming to work with.

Hi. I'm sorry for not replying sooner. Documentation is intentionally
not maintained in the musl tree itself. There is a long-term official
doc project that's stalled, but also it's expected only to document
public contracts. The immediate-benefit place to put documentation of
internals, musl-specific behaviors, etc. is on the wiki. If there
doesn't seem to be a good place for this document on the wiki now,
myself or someone else can add one.

Thanks for your interest/work in documenting this.

Rich


> From: yoan.picchi@...s.arm.com <yoan.picchi@...s.arm.com>
> 
> Sent: 05 November 2021 13:36
> To: musl@...ts.openwall.com
> Cc: Yoan Picchi <Yoan.Picchi@....com>
> Subject: [PATCH 1/1] Add documentation for mallocng
> 
> From: Yoan Picchi <yoan.picchi@....com>
> 
> This is an attempt at describing the various data structures of
> mallocng and how they interract together.
> 
> Signed-off-by: Yoan Picchi <yoan.picchi@....com>
> Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d
> ---
>  src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++
>  1 file changed, 284 insertions(+)
>  create mode 100644 src/malloc/mallocng/readme.txt
> 
> diff --git a/src/malloc/mallocng/readme.txt
> b/src/malloc/mallocng/readme.txt new file mode 100644 index
> 00000000..c0c04847
> --- /dev/null
> +++ b/src/malloc/mallocng/readme.txt
> @@ -0,0 +1,284 @@
> +This doc is intended to give the reader a high level view of how musl's
> +malloc works through explaining its data structures, and is targeted
> +for whomever wants to work on malloc or is just curious about how it works.
> +
> +You can find an alternative unofficial shorter explanation here:
> +https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f
> +And some other information can be found in the mailing list archive:
> +https://www.openwall.com/lists/musl/2021/10/04/2
> +
> +
> +
> +    GROUP
> +
> +The group is the structure that will actually hold the user data. Given
> +that this structure will be required for any size of allocation, it
> +must be small to reduce the memory waste for tiny allocations, yet be
> +able to hold a large amount of data. There are two slightly different
> +usages depending on the size of the data to hold.
> +
> +    For small allocations (<128KiB):
> +A group has several slots, up to 32. The header part of the group is
> +16B long on 64 bits platforms. This is defined by the UNIT constant,
> +which also represent the minimal alignment inside a group. Then
> start the slots.
> +
> +     Small group:
> ++---------------------+
> +| meta* meta          | \
> +| int number_active:5 |  } UNIT (16B)
> +| char padding[7]     | /
> +|                     |
> +|  +---------------+  | \
> +|  |      slot     |  |  } sizeclass x UNIT
> +|  +---------------+  | /
> +|  |      slot     |  |
> +|  +---------------+  |
> +|       ... x32       |
> +|  +---------------+  |
> +|  |      slot     |  |
> +|  +---------------+  |
> ++---------------------+
> +
> +Note that a slot uses some memory before the slot's start to store the
> +in-band data (IB). This is safe because a slot's last 4B are always
> +unused. For the first slot, "padding" should be at least 4 Bytes long
> +to fit the IB data. This IB data stores the information required to
> find the various fields in the slot.
> +IB is not defined by any struct, and is accessed solely through pointer
> +arithmetic. If it had a structure, it would be like so:
> +
> +        Basic slot:                                         IB:
> +  +---------------------+   <- before the slot +----------------------+
> +  |       IB (4B)       |                         | char need_big_offset |
> +  +---------------------+                         | char index:5         |
> ++-------------------------+ <- slot's start       | char reserved:3      |
> +| +---------------------+ |                       | uint16_t offset      |
> +| |                     | | +----------------------+
> +| |      user data      | |
> +| |                     | |
> +| +---------------------+ | \
> +| |     canary (1B)     | |  \
> +| +---------------------+ |   \
> +| |        slack        | |    \
> +| +---------------------+ |     } size = reserved
> +| |   footer size (4B)  | |    /
> +| |    (exist only if   | |   /
> +| |    reserved == 5)   | |  /
> +| +---------------------+ | /
> +| |     unused (4B)     | |
> +| +---------------------+ |
> ++-------------------------+
> +
> +"need_big_offset" is always null for multislot groups.
> +"index" is the slot's index. It is used to find the slot's end once we
> +have the meta (explained later on). It might be redundant as one can
> +calculate it on the fly from the slot's address, the group's address
> +and the sizeclass, but by keeping the number of slots to 32, those 5
> +index bits save us the aforenoted calculation.
> +"reserved" specifies where the user data actually ends compared to the
> +end of the slot. This way a canary can be placed to detect overflows
> +right at the end of the user's data. If "reserved" is 0, then we have
> +no space for the canary and so, we don't use any. Being only 3 bits
> +long, we deal with larger footers by writing the size of the footer
> +directly in the slot when we have enough space. "offset" is the
> offset between this slot's start, and the "storage"
> +field in group (the start of the first slot). Given that all group
> +slots are contiguous, this is used to access the metadata stored at the
> +start of the group. Note that this offset is only a 16 bit value. To
> +increase the range, it gets multiplied by UNIT, which is currently set
> +to 16 and is the reason why 16 appears in a lot of places. This offset
> +of 16 bits explains why the largest multislot groups will only have
> up to 8 slots: 8*8191 = 2^16-8.
> +This is the meaning of the basic IB data, and is enough to run malloc,
> +but it can't support aligned_alloc yet. The group metadata being 16
> +Bytes long, what if one needs an alignment of 64B ?
> +
> +The slots being a fixed size means that most malloc calls actually
> +won't fill it entirely and that we have some wiggle room as to where to
> +place the user data in that slot: this is the "slack". But if the user
> +data starts further back in the slot, how to find the IB ? The solution
> +is to move the IB data, as well. This would mean though that once we
> +return from the malloc call, we no longer know where the user data
> +starts. This is no issue though, because the original IB slot has
> been repurposed. In the modified IB, the "offset_size"
> +field gives us the offset between the start of the slot and the start
> +of the user data. To make sure one doesn't use this IB as a "normal"
> +IB, the "reserved" slot is set to 7, which triggers an assert if used
> +to get the metadata.
> +
> +This offset is not only used in aligned_alloc, it is used in nearly
> all allocs.
> +By cycling through every valid offset in a slot upon free and re-use,
> +we increase the time to reuse of the same user address. This makes it
> +easier to catch a double free.
> +
> +      Complex slot:                                    Modified IB:
> +  +---------------------+ +----------------------+
> +  |   modified IB (4B)  |                         | need_big_offset = 0  |
> +  +---------------------+                         | index:5 = undefined  |
> ++-------------------------+ <- slot's start       | reserved:3 = 7       |
> +| +---------------------+ |                       | uint16_t
> +| +---------------------+ | offset_size |
> +| |   optional offset   | | +----------------------+
> +| +---------------------+ |
> +| +---------------------+ |
> +| |       IB (4B)       | |
> +| +---------------------+ |
> +| +---------------------+ |
> +| |                     | |
> +| |      user data      | |
> +| |                     | |
> +| +---------------------+ |
> +| |     canary (1B)     | |
> +| +---------------------+ |
> +| |        slack        | |
> +| +---------------------+ |
> +| |   footer size (4B)  | |
> +| +---------------------+ |
> +| |     unused (4B)     | |
> +| +---------------------+ |
> ++-------------------------+
> +
> +
> +    For large allocations (≥128KiB):
> +A group will only hold a single slot. It is not tracked and reused the
> +same way as small groups are. It is mmapped upon allocation, and
> munmapped when freed.
> +Given that this group has no maximum size, it is possible that one
> +needs it to be aligned to a few MiB using aligned_alloc. This would be
> +above what the 16 bit offset (and offset_size) fields can hold. In
> this case, "need_big_offset"
> +is set to 1 and an additional 4B is added to the IB to hold a 32
> bits offset.
> +Note that it limits alignment to 64GiB (offset*UNIT).
> +It may look like we have a problem given that there is only 7 Bytes of
> +padding, but given that there is only one slot, we never use
> +"number_active", so we can afford overwriting it with the extra IB.
> +
> +Also note that in most case there is some slack in large groups as well
> +because mmap allocates at a page granularity. Internally, musl works
> +with 4KiB page and simply considers larger pages as a bunch of 4KiB
> +pages. As such, the size of the slot is a multiple of those 4KiB pages
> +and not a multiple of the actual page size.
> +
> +     Large group:
> ++---------------------+
> +| meta* meta          | \
> +| int number_active:5 |  } UNIT (16B)
> +| char padding[7]     | /
> +|                     |
> +|  +---------------+  |
> +|  | extra IB (4B) |  |
> +|  +---------------+  | \
> +|  |               |  |  \
> +|  |      slot     |  |   } maplen x 4KiB
> +|  |               |  |  /
> +|  +---------------+  | /
> ++---------------------+  <- end at a 4KiB boundary
> +
> +Other than those few changes, large groups behave like small ones.
> +
> +
> +
> +    META
> +
> +Groups only have the first part of the metadata. The second part is the
> +bound meta object. There's one for each group and they each have a
> +pointer to each other.
> +
> +        Meta:
> ++--------------------+
> +| meta* prev, next   | <- make a circular buffer
> +| group* mem         |
> +| int avail_mask     | \ define 3 states for each slot
> +| int freed_mask     | /
> +| size_t last_idx:5  |
> +| size_t freeable:1  |
> +| size_t sizeclass:6 | <- size if small group
> +| size_t maplen:52   | <- size if large group
> ++--------------------+
> +
> +This is where the size of the slots is defined, using the "sizeclass"
> +field. It is linked to a global array with 48 fixed size ranging from 1
> +to 8191. These sizes are then multiplied by UNIT to get the actual size
> +in memory, hence the small/large group threshold at 128KiB. The
> +"sizeclasses" [48-62] are unused and a "sizeclass" of 63 means it's
> a large group.
> +Given the limitation of offset to 16 bits, there is a limitation of the
> +number of slots for the largest of the small groups. It is set in
> "last_idx".
> +The "avail_mask" and "freed_mask" are both bitmaps to hold the status
> +of every slot in the group. The former holds whether the slot is
> +available and never used, the latter whether the slot has been used but
> +has been freed, or is currently inactive. When one calls malloc, it'll
> +first attempt to allocate into an "avail_mask" slot. Failing this,
> +it'll use a freed slot or activate some new slots. This allocation
> +procedure is used to keep memory pages clean. A group can easily span
> +several pages. As such, it's better to fill in slots in pages that have
> +already been dirtied. This is what the "active_idx" field in group is
> +used for. When a new slot is alloc-ed in a group, all the slots
> that are in the same page get activated.
> +"maplen" is where the size of a large group is stored. It is the number
> +of 4KiB pages the group spans.
> +Then there is "freeable", which can be a bit more confusing. While it
> +obviously specifies whether the group is freeable or not, currently a
> +group is not freeable only when a new meta is allocated on zeroed memory.
> +And last, the "prev" and "next" fields. This is so we can make a
> +circular buffer out of those meta objects. This way, when one requires
> +a slot, malloc can easily go through the various metas until it finds a
> +satisfactory one. In practice it doesn't search for the optimal one but
> +just uses some heuristics like "no slot available, need to activate a
> +new slot ? take the next meta instead".
> +
> +
> +
> +    META_AREA
> +
> +We need some way to hold the memory where the meta will be allocated.
> +That's the purpose of the meta_areas.
> +
> +     Meta_area:
> ++--------------------+
> +| int64_t check      |
> +| meta_area* next    | <- form a linked list
> +| int nslot          |
> +| meta slots[]       | <- slots to store meta objects
> ++--------------------+
> +
> +These work like a simple linked list where each link is bigger than the
> +previous one. Growing the size exponentially ensures there will only be
> +a small amount of links and mmap calls. "nslot" gives the number of
> +slots where the meta object can stored in this link (≠ slots in
> +groups). Though, in practice, "nslot" seems to be currently unused and
> +the slots are tracked in the malloc_context object instead.
> +
> +
> +
> +    MALLOC_CONTEXT
> +
> +And finally, the context. This is a global singleton mainly used to
> +manage the meta objects. Some of the fields have been omitted for brevity.
> +
> +        Malloc_context:
> ++------------------------------+
> +| int64_t secret               |
> +| meta* free_meta_head         | <- head of the free meta
> +| meta* avail_meta             | <- next meta in meta_area
> +| size_t avail_meta_count      |
> +| size_t avail_meta_area_count |
> +| meta_area head, tail         |
> +| meta* active[48]             | <- circular buffer for each sizeclass
> +| size_t usage_by_class[48]    | <- number of slot for each sizeclass
> +| uint8_t unmap_seq[32]        | \ balance fragmentation/slot reuse
> +| uint8_t bounce[32]           | /
> ++------------------------------+
> +
> +"head" and "tail" allow easy access to the meta_area. Given that meta
> +slots are not tracked, we only need access to the tail of the list.
> +There is some kind of memory protection trick with the meta area. When
> +a new meta_area is created, only the first 4KiB page is made available
> +for the slots. The number of slots fitting in that page is written down
> +in "avail_meta_count". When "avail_meta_count" runs out of slots,
> +another 4KiB worth of slots are added and "avail_meta_area_count" is
> +decremented once. When it reaches 0, a new meta_area is mmap-ed. The
> +point of doing so is that the pages originally mapped are initialized
> +with no right to read nor write, and those rights are given back just
> +as the malloc needs them. "avail_meta" and "free_meta_head" are the two
> +sources of meta objects. The first is a pointer to the first free
> slot in the tail meta_area. The second is a pool of all the metas
> that have been freed.
> +This is important because once again, the meta slots are not tracked in
> +the meta area. If we weren't retaining the freed meta in this list,
> +they would be lost and it would be a memory leak. This way, we
> +prioritize reusing freed meta instead of initializing new ones. For
> +easy access, there are a circular buffer for each sizeclass. Along with
> +those there are counters of how many slots exist for any given
> sizeclass: "usage_by_class".
> +The "bounce" and "unmap_seq" seems to be used for balancing
> +fragmentation and address reuse.
> +
> --
> 2.27.0
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.