Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 12 Oct 2018 11:56:21 -0400
From: Rich Felker <>
Subject: Re: Possible design for global thread list

On Wed, Oct 10, 2018 at 12:33:06PM -0400, Rich Felker wrote:
> This came up as part of the tlsdesc discussion, and I'm documenting it
> here in case we end up needing it. Doing this would be a radical
> departure from musl's early intentional design of not tracking threads
> globally, which theoretically has nice scalability properties but
> which all break down when you consider that the kernel has to track
> them anyway. It also involves a good bit of work, and possibly some
> performance or realtime-property hits in some places (see below), so I
> don't necessarily think it's a good idea. But in some ways it would be
> a lot cleaner/simpler and would facilitate making TLS access somewhat
> faster and making __synccall work independently of /proc and other
> hacks.
> At the time of the last __synccall overhaul in commit
> 78a8ef47c4d92b7680c52a85f80a81e29da86bb9, I believed it was impossible
> to guarantee a consistent view of the list of still-live threads at
> the kernel task level. As a result, __synccall depends on availability
> of (and ability to open) /proc/self/task to ensure that it has
> actually gotten them all, or waited for them to finish exiting.
> It turns out that it's not actually impossible to keep a list, and
> depending on your perspective, the way to keep such a list is either
> incredibly beautiful or incredibly hideous. :-)
> What makes it difficult is that exiting threads removing themselves
> from the list cannot unlock the list themselves, or another thread
> could obtain the lock and see the list with the exiting thread removed
> before it actually exits. The solution is to let the kernel do it.
> Instead of having the CLONE_CHILD_CLEARTID/set_tid_address for each
> thread point to something in its own thread structure, have it point
> to the global thread-list lock. Then the kernel unlocks the list (and
> futex waits any waiters on it) atomically with respect to kernel task
> termination.
> Of course, this futex wake is already used for pthread_join, which
> would need another mechanism. This is solved simply: pthread_exit can
> FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
> then has to wait on (but need not acquire) the thread-list lock after
> waiting on the thread's own exit futex in order to ensure the exit has
> actually finished. This is potentially subject to long waits if the
> lock is under contention (lots of threads exiting or being created)
> and retaken before pthread_join gets to run, but the probability of
> collision can be made negligible (only possible under extremely rapid
> tid reuse) by using the tid of the exiting thread as the wait value.
> Alternatively, the tid of the joiner could be used, making collisions
> impossible, but setting up to do this is more complex.
> Such a lock is already used in the fallback/nommu implementation of
> __unmapself. If we used a global thread list lock, no such lock in
> __unmapself would be needed.
> In order for __synccall to use a global thread list lock, the lock
> would need to be async-signal-safe (since __synccall is used to
> implement AS-safe functions like setuid). This means taking the lock
> would require signals to be blocked, which imposes new requirements on
> the implementation of pthread_create and dlopen. In some ways this is
> nice; right now pthread_create has special cases for when the new
> thread has to manipulate its signal mask after starting, and it would
> be simpler/cleaner to just always do it. This would also allow working
> around buggy kernels where the clone flags to set thread pointer, or
> set/clear tid address, don't work (was a historic issue on some archs;
> we just don't support those broken kernels now) by just not using them
> and instead fixing things up in the new thread before unmasking
> signals. It would have a slight performance cost though.
> Blocking signals in dlopen might be uglier; it's a *long* operation,
> and would break any sort of realtime signal handling going on.
> However, I think the allocation of new TLS memory (which is all the
> lock is needed for) could be moved to *after* all new DSOs are loaded
> and relocated, as the last step before committing to success. This
> helps with the current design where it's allocation-only, and where
> __tls_get_new is responsible for acquiring that allocated memory later
> in the thread that needs it, but if we switched to having dlopen
> actually dole out the allocated memory to all live threads, there
> would be a non-constant time operation that had to take place with the
> thread list lock held, and with signals blocked.
> One potential solution is making the list lock an AS-safe rwlock,
> where only operations that will modify the list (thread creation and
> exit) need to take the write lock, and only the write lock requires
> signals to be blocked. dlopen and __synccall would then only be
> readers. (This is an inversion of the roles with the current
> __acquire_ptc/__release_ptc/__inhibit_ptc lock, where dlopen has the
> "writer" role and thread creation/exit has the "reader" role.)
> In order to implement an AS-safe rwlock without unbounded-time
> blocking of signals while waiting for the lock, the wrlock logic would
> have to go something like:
> 	for (;;) {
> 		futex wait for unlocked state;
> 		block signals;
> 		cas lock from unlocked to wr-locked state
> 		if (success) break;
> 		unblock signals;
> 	}
> This would be unfortunate if rdlocks were a common operation, since
> the "block signals" step gives a large window for an arriving reader
> to steal the lock. Fortunately however they're very uncommon
> operations that only make sense a small finite number of times in a
> process lifetime, so I don't think this matters.
> The biggest performance concern here is probably how long the lock
> needs to be held for thread creation/exit and whether that interferes
> with concurrent creation/exit. Naively I expect the answer is a big
> yes. If a separate lock were used (just like now) to prevent dlopen
> while the thread's memory is being allocated, the minimum the critical
> section for creation could be is just around the call to __clone, and
> linking into the list upon success. From what I remember, that can
> easily be 15k cycles or so. At exit the critical section needs to
> contain the __unmapself for a detached thread or the FUTEX_REQUEUE for
> a joinable thread with a waiting joiner, but these are all fairly
> quick syscalls.
> So it's probably very questionable whether any of this makes sense to
> pursue, but at least it's all now published in case we want/need to at
> some point.

It should be noted that all the high costs come from making the list
(1) AS-safe, and (2) inclusive of threads that no longer exist in the
abstract machine, but still have running kernel tasks. These
conditions are necessary and sufficient for __synccall to use it for
enumerating threads, but not at all necessary if we just wanted to be
able to install new dynamic TLS synchronously with dlopen.

It would be kinda disappointing to go to the trouble of adding a
global thread list without getting to use it for fixing __synccall,
but I think it is a legitimate option if installing new TLS at dlopen
time looks like the rigth thing to do but the AS-safe lock looks too
expensive or runs into troubles.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.