musl - Re: Deadlock in dynamic linker?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250527182614.GP1827@brightrain.aerifal.cx>
Date: Tue, 27 May 2025 14:26:15 -0400
From: Rich Felker <dalias@...c.org>
To: Markus Wichmann <nullplan@....net>
Cc: musl@...ts.openwall.com
Subject: Re: Deadlock in dynamic linker?

On Tue, May 27, 2025 at 06:59:12PM +0200, Markus Wichmann wrote:
> Am Tue, May 27, 2025 at 11:20:07AM -0400 schrieb Rich Felker:
> > On Sat, May 24, 2025 at 07:45:45AM +0200, Markus Wichmann wrote:
> > > I'm thinking something like this: Thread A initializes liba.so. liba.so
> > > has initializers and finalizers, so thread A adds liba.so to the fini
> > > list before calling the initializers. The liba initializer calls
> > > dlopen("libb.so"). libb.so also has initializers.
> > > 
> > > While thread A is not holding the init_fini_lock, thread B calls exit().
> > > That progresses until __libc_exit_fini() sets shutting_down to 1. Then
> > > it tries to destroy all the libraries, but the loop stops when it comes
> > > to liba.
> > > 
> > > liba.so has a ctor_visitor, namely thread A, so thread B cannot advance.
> > > Thread A meanwhile is hanging in the infinite wait loop trying to
> > > initialize libb.so. The situation cannot change, and the process hangs
> > > indefinitely.
> > 
> > I see. In particular you're assuming the dlopen of libb happened after
> > the exit started.
> > 
> 
> I had completely neglected to look at the global ldso lock, actually.
> But looking at it again, I am actually assuming that the dlopen() is
> *starting* before the __libc_exit_fini() (so that thread B hangs waiting
> for the lock), but that thread B then overtakes thread A between the
> latter's release of the global lock and the taking of the init_fini_lock.
> 
> This does mean that taking the init_fini_lock before releasing the
> global lock would entirely prevent the issue. Not sure if that's
> acceptable, though.
> 
> > > A simple way out of this pickle could be to add liba.so to the fini list
> > > only after it was initialized. That way, thread B cannot hang on it, or
> > > more generally, the finalizing thread cannot be halted by an incomplete
> > > initialization in another thread. This might change the order of nodes
> > > on the fini list, but only to account for dynamic dependencies. Isn't
> > > that a good thing?
> > 
> > No, I think it's non-conforming, and also unsafe, as it can result in
> > failure to run a dtor for something whose ctor already ran but did not
> > finish. This is a worse outcome than a deadlock in a situation that's
> > arguably undefined to begin with.
> > 
> 
> But __libc_exit_fini() refuses to destroy libraries that haven't been
> constructed completely. If p->constructed is zero, a node is skipped
> even if it is on the fini list. And that flag is set in do_init_fini()
> only after all constructors have returned.

p->constructed being zero can only happen and mean "incompletely
p->constructed" in the case where visitor is self (call to exit from
p->your own ctor). It's not a condition you can encounter. from
p->concurrency, since in that case you would not get past the condvar
p->wait due to there being a visitor.

> > What might be acceptable, though, is moving the setting of
> > shutting_down to take place after the last dtor is peeled off the
> > list. However, this probably requires splitting shutting_down into two
> > variables, due to lock order issues. The value is needed under the
> > global ldso lock in dlopen() to make dlopen return with an error if
> > exit has already begun (this one should be kept before the dtor loop,
> > I think), and the value is needed in do_init_fini to block execution
> > of new ctors (this one should only take effect after all dtors have
> > been run).
> > 
> > Does that sound right?
> 
> After __libc_exit_fini() has run its course, there is no need to record
> anything, because it keeps the init_fini_lock. Anything that could query
> shutting_down at that point would already hang taking that lock.

Indeed, that sounds right.

However I'm beginning to doubt this solution really works. The problem
is that, if A depends on B, A's ctor can be waiting to run pending B's
ctor finishing. But if a concurrent exit calls B's dtor, then A's ctor
is no longer free to run, because it depends on B being constructed.
We really do need to block execution of any ctor whose deps may
already have been destructed.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.