Date: Sun, 29 Jul 2018 21:26:18 +0200 From: Markus Wichmann <nullplan@....net> To: musl@...ts.openwall.com Subject: malloc implementation survey: omalloc Hi all, we discussed rewriting malloc() a while back, because, as I recall, Rich wasn't satisfied with the internal storage the current system is using (i.e. metadata is stored with the returned pointer) as well as some corner cases on lock contention, although fine grained locking is a nice feature in itself. I therefore had a look at existing malloc() algorithms, the rationale being that I thought malloc() to be a solved problem, so we only have to find the right solution. As it turns out, it appears Doug Lea was simply too successful: Many allocators follow his pattern in one way or another. But some systems buck the trend. So today I found omalloc, the allocator OpenBSD uses, in this nice repository: https://github.com/emeryberger/Malloc-Implementations Among other implementations, of course. Well, as written, the omalloc implementation there fulfills one of the criteria layed out by Rich, namely external storage (metadata saved away from the returned memory). Locking however is monolithic. Maybe something we can work on ourselves... So I thought I'd describe the algorithm as an overview and we can have a discussion on whether to pursue a similar algorithm ourselves. 1. Data Structures ================== A /region/ is a large, i.e. multiple-of-page-sized chunk of memory. A region info struct knows the region's location and size. A /chunk/ is a small power-of-two byte sized chunk of memory, but at least 16 bytes. I have no idea where that decision came from. A /chunk info struct/ contains the location of a chunk page (chunks are at most half a page in size), the size of each chunk in the page, the number of free chunks, a bitmap of free chunks and a pointer to the next like-sized chunk info struct. Global data structures are: A linked list of free chunk info structs, a linked list of chunk info structs with free elements for each possible size between 16 bytes and half a page, and a cache of 256 free regions. The most important data structure is a hash table. The hash table contains entries of two machine words, of which the first is the key, sans its lowest lb(PAGE_SIZE) bits. In essence, this means it has a key of 20 or 52 bits, one value of 12 bits and another value of one machine word. Adjust by one where appropriate for 8K archs. The hash table contains 512 entries after initialization and can only grow. The aim is to keep the load factor below 3/4 (i.e. if the number of free entries falls below a quarter of the total number of entries, the table is grown by doubling the number of total entries). I'll call the three parts page number, chunk size, and region size. 2. Algorithms ============= a. Allocation ------------- Allocation is split in two cases: Large and small. For large allocations (> 1/2 page) we only allocate a region to service the request. This means, the size is rounded up to the next page size. Then we search our region cache for a region large enough. If we find one that fits exactly, we return it. If we find a larger one, but none that fits exactly, we return the end of the region and adjust the saved size. If we find none of these, we escalate to the OS. If we did manage to find a region, we save in the hash table an entry with page number equal to the page number of the returned page, chunk size set to zero and region size set to the allocation size, and return the allocated page(s). For small allocations (<= 1/2 page) we need to allocate a chunk. For this, we round the request up to the next power of two, but at least to 16. Then we check if we have chunk info header in the slot for that size. If not, we need to allocate one, by getting a header from the list of free chunk info headers. If that is also empty, we allocate a page for it and fill it with free chunk info headers. They are constant sized. With chunk info header in hand, we allocate a page for it, then fill in the chunk info header, most importantly setting the bitmap to 1 for all valid chunks. Then we save in the global hash table the page number of the page containing the actual memory we'll return, as chunk size the binary log of the size of each chunk, and as region size a pointer to the chunk info header. Allocation of a chunk then means finding a one-bit, setting it to zero and returning the corresponding pointer in the page the header is pointing to. b. Freeing ---------- The page number of the pointer is looked up in the hash table. If not found, the pointer is invalid. If found, then we need to look at the chunk size portion of the hash table entry. If that is zero then we need to free a region, else we need to free a chunk. The free a region, we remove the entry from the hash table, then add the region to our cache using a very weird heuristic. In any case, any entry thrown out of the cache in the process, as well as the current region, should it not be added to the cache, will be unmapped. To free a chunk, we set the corresponding bit and increase the counter of free chunks in the chunk info header, whose address we have from the repurposed region size part of the hash table entry. If this set the number of free chunks to one, we add the chunk info header to the list of chunk info headers with free chunks for the given size. Also, if now every chunk in the page is free, we remove the chunk info header from the bucket list, add it to the list of free chunk info headers and unmap the page. c. Reallocation --------------- If both the old and new allocation size are larger than half a page, we try to remap. Else have not much choice but to allocate a new block and copy everything. d. Memory donation ------------------ Entire pages can be added to the cache of free regions. Smaller blocks can be added to the list of free chunk info headers. We have no use for even smaller blocks. 3. Critique =========== As written in the above repo, omalloc pulls a global lock around each operation. This is probably unaccaptable. Also, while the implementation offers a high degree of customizability, it uses syscall upon syscall. Most of which can probably be removed. The hash table uses linear probing to resolve conflicts, albeit backwards. According to wikipedia this encourages primary clustering, i.e. used entries tend to clump together. Also, the hash table growth algorithm isn't realtime; it allocates the new table and completely re-keys all entries from the old one. This means in the worst case, allocation has unbounded run-time, though this amortizes. Let's talk parallelism: The hash table can be secured with a lock. Both allocation and deallocation need to write to the table, so a mutex will have to do. As for the chunk info header lists, they could all be implemented in a lock-free way. Unfortunately this means that a chunk info header can disappear from one list for a moment and re-appear in another a moment later. Which can lead to a thread seeing a list empty which actually isn't. Protecting all lists with the same lock would solve that issue but reduce parallelism more. Finally, the free page cache... could be protected with a different lock. We also need a lock for each chunk info header. Unless we implement the bitmap in a lock-free way, but that means that bitmap and "number of free chunks" marker don't need to be consistent anymore. Decisions, decisions... Finally, the run time. All external storage solutions require iteration over structures. At the moment, in the single threaded case, our malloc requires merely the removal of a single element from a list. omalloc, on the other hand, for small allocations always requires an iteration over a bitmap to find the one that is set. And for large allocations always requires searching the free page cache, or a syscall. Or both, in the expensive case. So, is this a direction to pursue, or should we look further? Ciao, Markus
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.