musl - Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220919134319.GN9709@brightrain.aerifal.cx>
Date: Mon, 19 Sep 2022 09:43:19 -0400
From: Rich Felker <dalias@...c.org>
To: baiyang <baiyang@...il.com>
Cc: musl <musl@...ts.openwall.com>
Subject: Re: The heap memory performance (malloc/free/realloc) is
 significantly degraded in musl 1.2 (compared to 1.1)

On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote:
> Hi there,
> 
> As we have discussed at
> https://github.com/openwrt/openwrt/issues/10752. The
> malloc_usable_size() function in musl 1.2 (mallocng) seems to have
> some performance issues.
> 
> It caused realloc and free spends too long time for get the chunk size.
> 
> As we mentioned in the discussion, tcmalloc and some other
> allocators can also accurately obtain the size class corresponding
> to a memory block and its precise size, and it is also very fast at
> the same time.
> 
> Can we make some improvements to the existing malloc_usable_size
> algorithm in mallocng? This should significantly improve the
> performance of existing algorithms.

Can you please start from a point of identifying the real-world case
in which you're hitting a performance degredation? Made-up tests are
generally not helpful and will almost always lead to focusing on the
wrong problem.

For now I'm going to focus on some things from the linked thread:

> > Considering that realloc itself contains a complete
> > malloc_usable_size (refer to here and here), So actually most
> > (66.7%) of the realloc time is spent doing malloc_usable_size.

In your test that increments the realloc size by one each iteration,
only one in every PAGESIZE calls has any real work to do. The rest do
nothing but set_size after obtaining the metadata on the object
they're acting on. It's completely expected that the runtime of these
will be dominated by obtaining the metadata; this isn't evidence of
anything wrong. And, moreover, it's almost surely a lot more than
66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap
syscalls and page faults accessing the new pages they produce.

> > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac
> > os x (malloc_size), and musl 1.1, even on low-end embedded
> > processors, the consumption of malloc_usable_size per 10 million
> > calls is mostly not more than a few hundred milliseconds.

It looks like mallocng's malloc_usable_size is taking around 150 ns
per call on your system, vs maybe 30-50 for others?

> > In addition, this very slow slab size acquisition algorithm also
> > needs to be called every time free (see here). So we believe it
> > should be the main reason for malloc/free and realloc performance
> > degradation in version 1.2.

Unless you have an application that's explicitly using
malloc_usable_size all over the place, it's highly unlikely that this
is the cause of your real-world performance problems. The vast
majority of reported problems with malloc performance have been in
multithreaded applications, where the dominating time cost is
fundamental: synchronization cost of having global consistency. There
you'll expect to find very similar performance figures from any other
allocator with global consistency, such as hardened_malloc.

If you're really having single-threaded performance problems that
aren't just in made-up benchmarks, please see if you can narrow down
the cause empirically rather than speculatively. For example, running
the program under perf and looking at where the time is being spent.

> > If we can improve its speed and make it close to implementations
> > like tcmalloc (tcmalloc can also accurately return the size of the
> > size class to which the chunk belongs), it should significantly
> > improve the performance of mallocng (at least in single-threaded
> > scenarios) .

tcmalloc is fast by not having global consistency, not being hardened
against memory errors like double-free and use-after-free, and not
avoiding fragmentation and excessive memory usage. Likewise for most
of the others. The run time costs in mallocng for looking up the
out-of-band metadata are largely fundamental to it being out-of-band
(not subject to direct falsification via typically exploitable
application bugs), size-efficient, 32-bit-compatible,
nommu-compatible, etc. Other approaches like in hardened_malloc can be
moderately more efficient to access the metadata, at the price of not
being at all amenable to small systems, which are a core goal of musl
we can't really disregard.

I can't say for sure there's not any room for optimization in the
metadata fetching though. Looking at the assembly output might be
informative, to see if we're doing anything that's making the compiler
emit gratuitously inefficient code.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.