musl - Re: FYI: some observations when testing next-gen malloc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200309171227.GY11469@brightrain.aerifal.cx>
Date: Mon, 9 Mar 2020 13:12:27 -0400
From: Rich Felker <dalias@...c.org>
To: Pirmin Walthert <pirmin.walthert@...om.ch>
Cc: musl@...ts.openwall.com
Subject: Re: FYI: some observations when testing next-gen malloc

On Mon, Mar 09, 2020 at 05:49:02PM +0100, Pirmin Walthert wrote:
> Dear Rich,
> 
> First of all many thanks for your brilliant C library.
> 
> As I do not know whether the musl mailinglist is already the right
> place to discuss the next-gen malloc module, I decided to send you
> my observations directly.

It is, so I'm cc'ing the list now.

> I'd like to mention that I am not yet entirely sure whether the
> following is a problem with the new malloc code or with asterisk
> itself but maybe you can already keep the following in the back of
> your head if someone else is reporting similar behavior with a
> different application:
> 
> We use asterisk (16.7) in a musl libc based distribution and for
> some operations asterisk forks (in a thread) the main process to
> execute a system command. When using libmallocng.so (newest version
> with "fix race condition in lock-free path of free" applied, but
> already without that change) some of these forked child processes
> will hang during a call to pthread_mutex_unlock.
> 
> Unfortunatelly the backtrace is not of much help I guess, but the
> child process always seems to hang on pthread_mutex_unlock. So
> something seems to happen with the mutex on fork:
> 
> #0  0x00007f2152a20092 in pthread_mutex_unlock () from
> /lib/ld-musl-x86_64.so.1
> No symbol table info available.
> #1  0x0000000000000008 in ?? ()
> No symbol table info available.
> #2  0x0000000000000000 in ?? ()
> No symbol table info available.
> 
> I will for sure try to dig into this further. For the moment the
> only thing I know is that I did not yet observe this on any of the
> several hundred systems with musl 1.1.23 (same asterisk version),
> not on any of the around 5 with 1.2.0 (same asterisk version, old
> malloc) but quite frequently on the two systems with 1.1.24 and
> libmallocng.so.

This is completely expected and should happen with old or new malloc.
I'm surprised you haven't hit it before. After a multithreaded process
calls fork, the child inherits a state where locks may be permanently
held. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html

    - A process shall be created with a single thread. If a
      multi-threaded process calls fork(), the new process shall
      contain a replica of the calling thread and its entire address
      space, possibly including the states of mutexes and other
      resources. Consequently, to avoid errors, the child process may
      only execute async-signal-safe operations until such time as one
      of the exec functions is called.

It's not described very rigorously, but effectively it's in an async
signal context and can only call functions which are AS-safe.

A future version of the standard is expected to drop the requirement
that fork itself be async-signal-safe, and may thereby add
requirements to synchronize against some or all internal locks so that
the child can inherit a working context. But the right solution here is
always to stop using fork without exec.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.