musl - Re: Deadlock in dynamic linker?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDXvYJ6S4WBo63N0@voyager>
Date: Tue, 27 May 2025 18:59:12 +0200
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: Deadlock in dynamic linker?

Am Tue, May 27, 2025 at 11:20:07AM -0400 schrieb Rich Felker:
> On Sat, May 24, 2025 at 07:45:45AM +0200, Markus Wichmann wrote:
> > I'm thinking something like this: Thread A initializes liba.so. liba.so
> > has initializers and finalizers, so thread A adds liba.so to the fini
> > list before calling the initializers. The liba initializer calls
> > dlopen("libb.so"). libb.so also has initializers.
> > 
> > While thread A is not holding the init_fini_lock, thread B calls exit().
> > That progresses until __libc_exit_fini() sets shutting_down to 1. Then
> > it tries to destroy all the libraries, but the loop stops when it comes
> > to liba.
> > 
> > liba.so has a ctor_visitor, namely thread A, so thread B cannot advance.
> > Thread A meanwhile is hanging in the infinite wait loop trying to
> > initialize libb.so. The situation cannot change, and the process hangs
> > indefinitely.
> 
> I see. In particular you're assuming the dlopen of libb happened after
> the exit started.
> 

I had completely neglected to look at the global ldso lock, actually.
But looking at it again, I am actually assuming that the dlopen() is
*starting* before the __libc_exit_fini() (so that thread B hangs waiting
for the lock), but that thread B then overtakes thread A between the
latter's release of the global lock and the taking of the init_fini_lock.

This does mean that taking the init_fini_lock before releasing the
global lock would entirely prevent the issue. Not sure if that's
acceptable, though.

> > A simple way out of this pickle could be to add liba.so to the fini list
> > only after it was initialized. That way, thread B cannot hang on it, or
> > more generally, the finalizing thread cannot be halted by an incomplete
> > initialization in another thread. This might change the order of nodes
> > on the fini list, but only to account for dynamic dependencies. Isn't
> > that a good thing?
> 
> No, I think it's non-conforming, and also unsafe, as it can result in
> failure to run a dtor for something whose ctor already ran but did not
> finish. This is a worse outcome than a deadlock in a situation that's
> arguably undefined to begin with.
> 

But __libc_exit_fini() refuses to destroy libraries that haven't been
constructed completely. If p->constructed is zero, a node is skipped
even if it is on the fini list. And that flag is set in do_init_fini()
only after all constructors have returned.

By adding nodes to the fini list only when construction has finished, we
would also get rid of the need to have __libc_exit_fini() wait for
construction to finish first.

> What might be acceptable, though, is moving the setting of
> shutting_down to take place after the last dtor is peeled off the
> list. However, this probably requires splitting shutting_down into two
> variables, due to lock order issues. The value is needed under the
> global ldso lock in dlopen() to make dlopen return with an error if
> exit has already begun (this one should be kept before the dtor loop,
> I think), and the value is needed in do_init_fini to block execution
> of new ctors (this one should only take effect after all dtors have
> been run).
> 
> Does that sound right?
> 

After __libc_exit_fini() has run its course, there is no need to record
anything, because it keeps the init_fini_lock. Anything that could query
shutting_down at that point would already hang taking that lock.

Ciao,
Markus
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.