musl - Re: malloc implementation survey: omalloc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180731074929.GB22386@voyager>
Date: Tue, 31 Jul 2018 09:49:30 +0200
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: malloc implementation survey: omalloc

On Mon, Jul 30, 2018 at 11:09:05AM -0400, Christopher Friedt wrote:
> I haven't looked at omalloc, but I wrote a deadly simple buddy allocator
> for Bionic some time ago with support for malloc(3), calloc(3), realloc(3),
> free(3), and posix_memalign(3). It would obviously also support
> aligned_alloc(3) for C11.
> 

Most allocators can fit these requirements. omalloc is capable of
delivering naturally aligned chunks of memory for everything up to a
page. After which we'd need to trick Linux's virtual allocator into
giving us the alignment we need. Typically via the old "allocate
size+align bytes" trick. Or pages in this case.

> It ran well on everything from a arm cortex-m0 to an intel core i7. The
> code compiled to 504 .text bytes, on cortex-m0, iirc.
> 
> I wrote it originally for using on the kernel-side of an rtos, but it could
> easily be extended to make a syscall when a process runs out of ram.
> 
> Obviously, a shortcoming is that the memory blocks must be PoT and there is
> the potential for fragmentation. Otherwise though, the meta-information is
> intrinsic within the pointer, which obliviates the need for external
> storage.
> 

That does sound interesting, but I don't really follow. The meta-info
which we need to save is the length of the object returned. How can you
encode that in the pointer without any external storage? Using internal
storage? Yeah, that is where we are right now, and that means there's
only one buffer overflow between us and oblivion at any given time.

Or do you essentially have a large block of memory, with one area for
the 16-byte objects, one for the 32-byte ones, &c. But then what do you
do if your given area runs out and the application still wants more?

Just so we don't talk past each other: External storage in this case
means the meta-data is saved away from the returned memory; ideally,
away from all user-visible memory, so that only a rampaging memory
corruption issue can destroy it. Internal storage means, the information
is saved close to the returned memory.

> Every call takes approximately the same average time which is about log2(
> depth ) for binary trees.
> 
> Objectively speaking though, in terms of choosing a new malloc
> implementation, the best one should be based on algorithmic complexity, but
> also size and speed metrics.
> 
> C

That much is clear as well. As I understood it, Rich didn't merely want
to improve the existing malloc, though, he wanted to find a new
implementation, which had best have some desirable properties. Those
being resillience against buffer overflows, as well as fine grained
locking for good multi-thread performance. omalloc has one of these. And
speed can be optimized.

Reading omalloc is complicated by the tons of customizability in the
algorithm. For some reason, they thought it would be a good idea to have
a file, /etc/malloc.conf, be a symlink that does not point to a file,
but rather has the configuration string as its target. Ugly. But it
means they can read it in a single syscall.

I still have no idea how many calls to the RNG are for security, and how
many are necessary to the algorithm. For instance, if a chunk has to be
handed out, they don't return the first free chunk, they return a random
free chunk. Draw a random number between 0 and the number of free chunks
and look for that many 0-bits. Now, is that done to reduce
predictability of the returned address or to reduce memory fragmentation
or for some other reason? I really can't tell.

Now, the applications that tax the allocator more than any others are OO
applications, that constantly allocate and free small objects. Like
firefox and the like. But those also use multiple threads, so having an
arena allocator would be nice, and in fact, including libtcmalloc did
speed up firefox a lot. libtcmalloc does nothing but run dlmalloc in a
thread-local version. We could modify omalloc like this as well. Not
sure how, though. Have multiple hash tables? But how to locate the one
that contains the object we're supposed to be freeing? Iterate over all
of them? That's going to be a boon to run-time. And how to extend the
list of hash tables? And, more importantly, how to shrink it when a
thread exits?

What I am getting at is this: We can't just benchmark a few algorithms
and stick with the fastest one. Nor is a simple algorithmic analysis
enough, since often enough, simple lists can be replaced with more
complicated containers yielding better performance for some loads. It's
complicated...

Ciao,
Markus
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.