musl - Possible design for global thread list

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181010163306.GO17110@brightrain.aerifal.cx>
Date: Wed, 10 Oct 2018 12:33:06 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Possible design for global thread list

This came up as part of the tlsdesc discussion, and I'm documenting it
here in case we end up needing it. Doing this would be a radical
departure from musl's early intentional design of not tracking threads
globally, which theoretically has nice scalability properties but
which all break down when you consider that the kernel has to track
them anyway. It also involves a good bit of work, and possibly some
performance or realtime-property hits in some places (see below), so I
don't necessarily think it's a good idea. But in some ways it would be
a lot cleaner/simpler and would facilitate making TLS access somewhat
faster and making __synccall work independently of /proc and other
hacks.

At the time of the last __synccall overhaul in commit
78a8ef47c4d92b7680c52a85f80a81e29da86bb9, I believed it was impossible
to guarantee a consistent view of the list of still-live threads at
the kernel task level. As a result, __synccall depends on availability
of (and ability to open) /proc/self/task to ensure that it has
actually gotten them all, or waited for them to finish exiting.

It turns out that it's not actually impossible to keep a list, and
depending on your perspective, the way to keep such a list is either
incredibly beautiful or incredibly hideous. :-)

What makes it difficult is that exiting threads removing themselves
from the list cannot unlock the list themselves, or another thread
could obtain the lock and see the list with the exiting thread removed
before it actually exits. The solution is to let the kernel do it.
Instead of having the CLONE_CHILD_CLEARTID/set_tid_address for each
thread point to something in its own thread structure, have it point
to the global thread-list lock. Then the kernel unlocks the list (and
futex waits any waiters on it) atomically with respect to kernel task
termination.

Of course, this futex wake is already used for pthread_join, which
would need another mechanism. This is solved simply: pthread_exit can
FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
then has to wait on (but need not acquire) the thread-list lock after
waiting on the thread's own exit futex in order to ensure the exit has
actually finished. This is potentially subject to long waits if the
lock is under contention (lots of threads exiting or being created)
and retaken before pthread_join gets to run, but the probability of
collision can be made negligible (only possible under extremely rapid
tid reuse) by using the tid of the exiting thread as the wait value.
Alternatively, the tid of the joiner could be used, making collisions
impossible, but setting up to do this is more complex.

Such a lock is already used in the fallback/nommu implementation of
__unmapself. If we used a global thread list lock, no such lock in
__unmapself would be needed.

In order for __synccall to use a global thread list lock, the lock
would need to be async-signal-safe (since __synccall is used to
implement AS-safe functions like setuid). This means taking the lock
would require signals to be blocked, which imposes new requirements on
the implementation of pthread_create and dlopen. In some ways this is
nice; right now pthread_create has special cases for when the new
thread has to manipulate its signal mask after starting, and it would
be simpler/cleaner to just always do it. This would also allow working
around buggy kernels where the clone flags to set thread pointer, or
set/clear tid address, don't work (was a historic issue on some archs;
we just don't support those broken kernels now) by just not using them
and instead fixing things up in the new thread before unmasking
signals. It would have a slight performance cost though.

Blocking signals in dlopen might be uglier; it's a *long* operation,
and would break any sort of realtime signal handling going on.
However, I think the allocation of new TLS memory (which is all the
lock is needed for) could be moved to *after* all new DSOs are loaded
and relocated, as the last step before committing to success. This
helps with the current design where it's allocation-only, and where
__tls_get_new is responsible for acquiring that allocated memory later
in the thread that needs it, but if we switched to having dlopen
actually dole out the allocated memory to all live threads, there
would be a non-constant time operation that had to take place with the
thread list lock held, and with signals blocked.

One potential solution is making the list lock an AS-safe rwlock,
where only operations that will modify the list (thread creation and
exit) need to take the write lock, and only the write lock requires
signals to be blocked. dlopen and __synccall would then only be
readers. (This is an inversion of the roles with the current
__acquire_ptc/__release_ptc/__inhibit_ptc lock, where dlopen has the
"writer" role and thread creation/exit has the "reader" role.)

In order to implement an AS-safe rwlock without unbounded-time
blocking of signals while waiting for the lock, the wrlock logic would
have to go something like:

	for (;;) {
		futex wait for unlocked state;
		block signals;
		cas lock from unlocked to wr-locked state
		if (success) break;
		unblock signals;
	}

This would be unfortunate if rdlocks were a common operation, since
the "block signals" step gives a large window for an arriving reader
to steal the lock. Fortunately however they're very uncommon
operations that only make sense a small finite number of times in a
process lifetime, so I don't think this matters.

The biggest performance concern here is probably how long the lock
needs to be held for thread creation/exit and whether that interferes
with concurrent creation/exit. Naively I expect the answer is a big
yes. If a separate lock were used (just like now) to prevent dlopen
while the thread's memory is being allocated, the minimum the critical
section for creation could be is just around the call to __clone, and
linking into the list upon success. From what I remember, that can
easily be 15k cycles or so. At exit the critical section needs to
contain the __unmapself for a detached thread or the FUTEX_REQUEUE for
a joinable thread with a waiting joiner, but these are all fairly
quick syscalls.

So it's probably very questionable whether any of this makes sense to
pursue, but at least it's all now published in case we want/need to at
some point.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.