musl - Re: Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220920054149.GK9709@brightrain.aerifal.cx>
Date: Tue, 20 Sep 2022 01:41:50 -0400
From: Rich Felker <dalias@...c.org>
To: baiyang <baiyang@...il.com>
Cc: musl <musl@...ts.openwall.com>
Subject: Re: Re: The heap memory performance (malloc/free/realloc) is
 significantly degraded in musl 1.2 (compared to 1.1)

On Tue, Sep 20, 2022 at 11:53:52AM +0800, baiyang wrote:
> > The ones that return some value larger than the requested size are
> > returning "the requested size, rounded up to a multiple of 16" or
> > similar. Not "the requested size plus 1500 bytes". 
> ...
> > They don't return 8100. They return something like 6608 or 6624.
> 
> No, AFAIK, There are many allocators whose return value of
> malloc_usable_size is 1KB (or more) larger than the requested value
> at malloc time.
> For Example: if you do "void* p = malloc(6700)" on tcmalloc, then
> "malloc_usable_size(p)" will return **8192**. Far more than just
> "rounded up to a multiple of 16".

OK, thanks for checking and correcting.

> > This does not follow at all. tcmalloc is fast because it does not have
> > global consistency, does not have any notable hardening, and (see the
> > name) keeps large numbers of freed slots *cached* to reuse, thereby
> > using lots of extra memory. Its malloc_usable_size is not fast because
> > of returning the wrong value, if it even does return the wrong value
> > (I have no idea). 
> 
> We don't need to refer to these features of tcmalloc, we only need
> to refer to its malloc_usable_size algorithm.

Those (mis)features are what provide a fast path here, regardless of
whether you care about them.

> > It's fast because they store the size in-band right
> > next to the allocated memory and trust that it's valid, rather than
> > computing it from out-of-band metadata that is not subject to
> > falsification unless the attacker already has nearly full control of
> > execution.
> 
> No, if I understand correctly, tcmalloce doesn't store the size
> in-band right next to the allocated memory. On the contrary, when
> executing malloc_usable_size(p) (actually GetSize(p)), it will first
> find the size class corresponding to p through a quick lookup table,
> and then return the length of the size class. See:
> https://github.com/google/tcmalloc/blob/9179bb884848c30616667ba129bcf9afee114c32/tcmalloc/tcmalloc.cc#L1099

OK, I was confusing tcmalloc with the more conventional "thread-local
freelist caching on top of dlmalloc type base" allocator strategy.
Indeed tcmalloc however is one of the gigantic ones.

> My understanding: the biggest impediment to our inability to apply
> similar optimizations is that we have to return 6700, not 8192 (of
> course, you've denied this is the reason).

Your understanding is wrong. I've told you how you can measure that
it's wrong. You insist on being stuck on it for no good reason.

If you want to understand *why* tcmalloc is different, start with the
comments at the top of the file you linked:

> //  4. The pagemap (which maps from page-number to descriptor),
> //     can be read without holding any locks, and written while holding
> //     the "pageheap_lock".
> //
> //     This multi-threaded access to the pagemap is safe for fairly
> //     subtle reasons.  We basically assume that when an object X is
> //     allocated by thread A and deallocated by thread B, there must
> //     have been appropriate synchronization in the handoff of object
> //     X from thread A to thread B.

This is the kind of thing I mean by lack of global consistency (no
synchronization around access to these data structures) and lack of
any meaningful hardening (*assuming* no memory lifetime usage errors
in the calling application).

The GetSize function you cited uses this global pagemap to go straight
from a page address to a sizeclass, via what amounts to a two-level or
three-level table indexed by upper bits of the address (comment says
3-level is only used in slower but lower-mem-use configuration). These
tables, at least in the 2-level form, are utterly *massive*. I'm not
sure if it creates them PROT_NONE and then only instantiates real
memory for the (initially fairly sparse) parts that get used, or if it
just allocates these giant things relying on overcommit, but either
way that's not compatible with small-memory-space systems or with
nommu.

On top of that, this approach relies on laying out whole pages (likely
large slabs of many pages at a time) of identical-sized objects so
that the size and other properties can be looked up by page number. I
have not looked into the details of "how bad" it gets, but it
completely precludes having any small processes, and precludes
promptly returning freed memory to the system, since *changing* the
pagemap is going to be costly and they're going to avoid doing it
(note the above comment on locking).

mallocng does not have any global mapping optimizing translation from
addresses to groups/metadata objects. Because we insist on global
consistency (a prerequisite for being able to define strong hardening
properties) and on being able to return freed memory promptly to the
system, maintaining such a data structure would cost a lot more time
(performance) than anything it could give, and it would make lock-free
operations (like your malloc_usable_size, or trivial realloc calls)
potentially require locking.

Instead of using the numeric value of the address to map to metadata,
we chase offsets from the object base address to the metadata, then
validate that it round-trips back to conclude that we didn't just
follow random junk from the caller passing an invalid/dangling pointer
in, or from this data being overwritten via heap-based buffer
overflows.

Fundamentally, this pointer chasing is going to be a little bit more
expensive than just using address bits as table indices, but not
really all that much. At least half of the cost difference, and
probably a lot more, is not the pointer/offset chasing but the
validation (hardening). If hypothetically you wanted to turn that all
off (e.g. by defining the assert macro to a no-op) you could have it
be a lot faster, and still have low memory usage too. I'm not sure but
for single-threaded loads I would not be surprised if it were getting
close to tcmalloc speed. Just casually building with assert() defined
to nop out the tests, I got double performance on your TEST2. Of
course, I don't recommend doing this. But it's an interesting test for
performing *measurement* (which you so far refuse to do) of what's
actually making the performance differences.

> On the other hand, if the low speed is not caused by having to
> return 6700, then we should be able to use a similar quick lookup
> table optimization ("tc_globals.pagemap().sizeclass(p)") to achieve
> at least dozens of times performance improvement.

Once again, the big difference is not the "6700". The
tc_globals.pagemap().sizeclass(p) in tcmalloc corresponds to lines
6-11 of malloc_usable_size in mallocng, not line 12, and the bulk of
the work here is in lines 6-11, mainly lines 7 and 10. I did a similar
casual test removing line 12 and just returning something based on the
earlier computations, and it made something like a 30% reduction in
test run time (with or without the hardening asserts nopped out).

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.