|
Message-ID: <20181010163306.GO17110@brightrain.aerifal.cx> Date: Wed, 10 Oct 2018 12:33:06 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Possible design for global thread list This came up as part of the tlsdesc discussion, and I'm documenting it here in case we end up needing it. Doing this would be a radical departure from musl's early intentional design of not tracking threads globally, which theoretically has nice scalability properties but which all break down when you consider that the kernel has to track them anyway. It also involves a good bit of work, and possibly some performance or realtime-property hits in some places (see below), so I don't necessarily think it's a good idea. But in some ways it would be a lot cleaner/simpler and would facilitate making TLS access somewhat faster and making __synccall work independently of /proc and other hacks. At the time of the last __synccall overhaul in commit 78a8ef47c4d92b7680c52a85f80a81e29da86bb9, I believed it was impossible to guarantee a consistent view of the list of still-live threads at the kernel task level. As a result, __synccall depends on availability of (and ability to open) /proc/self/task to ensure that it has actually gotten them all, or waited for them to finish exiting. It turns out that it's not actually impossible to keep a list, and depending on your perspective, the way to keep such a list is either incredibly beautiful or incredibly hideous. :-) What makes it difficult is that exiting threads removing themselves from the list cannot unlock the list themselves, or another thread could obtain the lock and see the list with the exiting thread removed before it actually exits. The solution is to let the kernel do it. Instead of having the CLONE_CHILD_CLEARTID/set_tid_address for each thread point to something in its own thread structure, have it point to the global thread-list lock. Then the kernel unlocks the list (and futex waits any waiters on it) atomically with respect to kernel task termination. Of course, this futex wake is already used for pthread_join, which would need another mechanism. This is solved simply: pthread_exit can FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join then has to wait on (but need not acquire) the thread-list lock after waiting on the thread's own exit futex in order to ensure the exit has actually finished. This is potentially subject to long waits if the lock is under contention (lots of threads exiting or being created) and retaken before pthread_join gets to run, but the probability of collision can be made negligible (only possible under extremely rapid tid reuse) by using the tid of the exiting thread as the wait value. Alternatively, the tid of the joiner could be used, making collisions impossible, but setting up to do this is more complex. Such a lock is already used in the fallback/nommu implementation of __unmapself. If we used a global thread list lock, no such lock in __unmapself would be needed. In order for __synccall to use a global thread list lock, the lock would need to be async-signal-safe (since __synccall is used to implement AS-safe functions like setuid). This means taking the lock would require signals to be blocked, which imposes new requirements on the implementation of pthread_create and dlopen. In some ways this is nice; right now pthread_create has special cases for when the new thread has to manipulate its signal mask after starting, and it would be simpler/cleaner to just always do it. This would also allow working around buggy kernels where the clone flags to set thread pointer, or set/clear tid address, don't work (was a historic issue on some archs; we just don't support those broken kernels now) by just not using them and instead fixing things up in the new thread before unmasking signals. It would have a slight performance cost though. Blocking signals in dlopen might be uglier; it's a *long* operation, and would break any sort of realtime signal handling going on. However, I think the allocation of new TLS memory (which is all the lock is needed for) could be moved to *after* all new DSOs are loaded and relocated, as the last step before committing to success. This helps with the current design where it's allocation-only, and where __tls_get_new is responsible for acquiring that allocated memory later in the thread that needs it, but if we switched to having dlopen actually dole out the allocated memory to all live threads, there would be a non-constant time operation that had to take place with the thread list lock held, and with signals blocked. One potential solution is making the list lock an AS-safe rwlock, where only operations that will modify the list (thread creation and exit) need to take the write lock, and only the write lock requires signals to be blocked. dlopen and __synccall would then only be readers. (This is an inversion of the roles with the current __acquire_ptc/__release_ptc/__inhibit_ptc lock, where dlopen has the "writer" role and thread creation/exit has the "reader" role.) In order to implement an AS-safe rwlock without unbounded-time blocking of signals while waiting for the lock, the wrlock logic would have to go something like: for (;;) { futex wait for unlocked state; block signals; cas lock from unlocked to wr-locked state if (success) break; unblock signals; } This would be unfortunate if rdlocks were a common operation, since the "block signals" step gives a large window for an arriving reader to steal the lock. Fortunately however they're very uncommon operations that only make sense a small finite number of times in a process lifetime, so I don't think this matters. The biggest performance concern here is probably how long the lock needs to be held for thread creation/exit and whether that interferes with concurrent creation/exit. Naively I expect the answer is a big yes. If a separate lock were used (just like now) to prevent dlopen while the thread's memory is being allocated, the minimum the critical section for creation could be is just around the call to __clone, and linking into the list upon success. From what I remember, that can easily be 15k cycles or so. At exit the critical section needs to contain the __unmapself for a detached thread or the FUTEX_REQUEUE for a joinable thread with a waiting joiner, but these are all fairly quick syscalls. So it's probably very questionable whether any of this makes sense to pursue, but at least it's all now published in case we want/need to at some point. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.