musl - Re: MUSL - malloc - Multithread Suggestion

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1119627917.236795.1757086092561@webmail-oxcs.register.com>
Date: Fri, 5 Sep 2025 11:28:12 -0400 (EDT)
From: Greg McPherran <gm@...herranweb.com>
To: Rich Felker <dalias@...c.org>, Markus Wichmann <nullplan@....net>
Cc: musl@...ts.openwall.com
Subject: Re: MUSL - malloc - Multithread Suggestion

> fragmentation 
Logical. A thread local tl_malloc could be a coding strategy for some application scenarios, however, I see that the general issue remains open. 

> mallocng's design is such that, in theory, it could use thread-local
> active groups for each size class,
Interesting.

I've experimented with Alpine on various machines including on server. Alpine is my favorite (and I have test driven and many distributions). The discussion of multi-thread performance has been a minor cause of pause for me. However, I am gathering that the multithread performance comments about Alpine (musl) are not an issue for musl to resolve, they are an issue for coders to resolve by managing memory with thread local considerations in mind as part of application strategy. I actually already use that strategy, one example, I use thread local memory pools for same-type objects.

I propose that it's fair to say that multithread apps that exhibit reduced performance on Apline, do not expose an issue with musl, but rather expose an issue with the apps code design. 

Greg McPherran

> On 09/04/2025 2:14 PM EDT Rich Felker <dalias@...c.org> wrote:
> 
>  
> On Thu, Sep 04, 2025 at 07:29:50PM +0200, Markus Wichmann wrote:
> > Am Thu, Sep 04, 2025 at 07:33:10AM -0400 schrieb Greg McPherran:
> > > Hi, perhaps a thread_local (e.g. C23) memory pool would separate
> > > malloc cleanly for each thread, with no performance issue, e.g. mutex
> > > etc? 
> > 
> > The idea is well-known, and a library that implements can be found under
> > the name tcmalloc.
> > 
> > The thread_local keyword, however, cannot be used for a few reasons. One
> > is general: You must keep a thread-local arena alive for as long as it
> > has active allocations, since in C, one thread can allocate an object
> > and give it to another thread. So automatic deallocation when the thread
> > ends is out of the question.
> > 
> > The other is technical: The keyword is basically implemented on Linux
> > the same as the __thread extension keyword, which in the end uses ELF
> > TLS. But musl cannot at the moment use ELF TLS for itself in dynamic
> > linking mode because the dynamic loader is not set up for that. Fixing
> > that would require making the stage 2 relocation of the linker skip TLS
> > relocations as well, and also not using any code that uses TLS until the
> > stage 3 relocation has happened. Putting it in the allocator is
> > therefore not an option at all, since we need an allocator to get to the
> > stage 3 relocation.
> 
> This is all technical details that really have no bearing on whether
> or not it could be done, just how it would be done if it were. The
> only reason TLS can't be used inside musl is because we don't have a
> need for it and lack of need has informed lack of implementation. If
> our malloc needed thread-local state it would just be a pointer inside
> our struct __pthread. No need to make a meal of this; it's not the
> issue at hand.
> 
> > I also must correct one thing: Due to the allocations being able to be
> > shared, even with thread-local arenas the implementation needs locking,
> > since other threads might be concurrently freeing an object in the same
> > arena. But there should be less contention, yes.
> 
> Yes, there is a fundamental need for synchronization unless free is a
> no-op. But beyond that, amount of synchronization is a tradeoff
> between performance and one or both of memory consumption and
> hardening. At a very basic level, if thread B doesn't know thread A
> has space that's been freed that it can use, thread B is going to have
> to go get its own, and this badness scales rapidly with the number of
> threads, especially when you're allocating moderate to large numbers
> of slots of the same size together which is necessary to avoid bad
> fragmentation cases.
> 
> One way the inefficiency of having entire per-thread areans can be
> mitigated is the glibc fastbins approach -- not actually needing
> separate heaps, but just keeping recently-freed memory slots an
> intrinsic linked list for the thread to reuse. This is a huge
> hardening/security tradeoff, because these list pointers can be
> corrupted following most UAF and heap-based overflow bugs to seize
> control of execution flow.
> 
> The mallocng allocator was designed to favor very low memory overhead,
> low worst-case fragmentation cost, and strong hardening over
> performance. This is because it's much easier and safer to opt in to
> using a performance-oriented allocator for the few applications that
> are doing ridiculous things with malloc to make it a performance
> bottleneck than to opt out of trading safety for performance in every
> basic system utility that doesn't hammer malloc.
> 
> mallocng's design is such that, in theory, it could use thread-local
> active groups for each size class, potentially even activating that
> behavior only when there's high contention on a given one. This
> possibility has not been explored in depth, and it's not clear what
> the gains would be with free still being synchronized (which is an
> atomic+barriers, not a lock).
> 
> Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.