musl - Re: Illegal killlock skipping when transitioning to single-threaded state

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ae1b4d72f2de2fbf859ef1df9a2f7ff9@ispras.ru>
Date: Mon, 03 Oct 2022 09:16:03 +0300
From: Alexey Izbyshev <izbyshev@...ras.ru>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

On 2022-09-19 18:29, Rich Felker wrote:
> On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> Hi,
>> 
>> While reading pthread_exit() implementation I noticed that it can
>> set "libc.need_locks" to -1 while still holding the killlock of the
>> exiting thread:
>> 
>>     if (!--libc.threads_minus_1) libc.need_locks = -1;
>> 
>> If the remaining thread attempts to acquire the same killlock
>> concurrently (which is valid for joinable threads), it works fine
>> because LOCK() resets "libc.need_locks" only after a_cas():
>> 
>>     int need_locks = libc.need_locks;
>>     if (!need_locks) return;
>>     /* fast path: INT_MIN for the lock, +1 for the congestion */
>>     int current = a_cas(l, 0, INT_MIN + 1);
>>     if (need_locks < 0) libc.need_locks = 0;
>>     if (!current) return;
>> 
>> However, because "libc.need_locks" is reset when using LOCK() for
>> any other lock too, the following could happen (E is the exiting
>> thread, R is the remaining thread):
>> 
>> E: libc.need_locks = -1
>> R: LOCK(unrelated_lock)
>> R:   libc.need_locks = 0
>> R: UNLOCK(unrelated_lock)
>> R: LOCK(E->killlock) // skips locking, now both R and E think they
>> are holding the lock
>> R: UNLOCK(E->killlock)
>> E: UNLOCK(E->killlock)
>> 
>> The lack of mutual exclusion means that the tid reuse problem that
>> killlock is supposed to prevent might occur.
>> 
>> Moreover, when UNLOCK(E->killlock) is called concurrently by R and
>> E, a_fetch_add() could be done twice if timing is such that both
>> threads notice that "l[0] < 0":
>> 
>>    /* Check l[0] to see if we are multi-threaded. */
>>    if (l[0] < 0) {
>>        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
>>            __wake(l, 1, 1);
>>        }
>>    }
>> 
>> In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
>> with INT_MAX waiters"). This is a bad state because the next LOCK()
>> will wrap it to "unlocked" state instead of locking. Also, if more
>> than two threads attempt to use it, a deadlock will occur if two
>> supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
>> after a third thread registered itself as a waiter because the lock
>> will wrap to INT_MIN.
>> 
>> Reordering the "libc.need_locks = -1" assignment and
>> UNLOCK(E->killlock) and providing a store barrier between them
>> should fix the issue.
> 
> I think this all sounds correct. I'm not sure what you mean by a store
> barrier between them, since all lock and unlock operations are already
> full barriers.
> 

Before sending the report I tried to infer the intended ordering 
semantics of LOCK/UNLOCK by looking at their implementations. For 
AArch64, I didn't see why they would provide a full barrier (my 
reasoning is below), so I concluded that probably acquire/release 
semantics was intended in general and suggested an extra store barrier 
to prevent hoisting of "libc.need_locks = -1" store spelled after 
UNLOCK(E->killlock) back into the critical section.

UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple 
a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are 
implemented via load-acquire/store-release instructions. Therefore, if 
we consider a LOCK/UNLOCK critical section containing only plain loads 
and stores, (a) any such memory access can be reordered with the initial 
ldaxr in UNLOCK, and (b) any plain load following UNLOCK can be 
reordered with stlxr (assuming the processor predicts that stlxr 
succeeds), and further, due to (a), with any memory access inside the 
critical section. Therefore, UNLOCK is not full barrier. Is this right?

As for a store following UNLOCK, I'm not sure. A plain store following 
stlxr can be reordered with it, but here that store is conditional on 
stlxr success. So even if the processor predicts stlxr success, it can't 
make the following store visible before it's sure that stlxr succeeded. 
But I don't know whether the events "the processor knows that stlxr 
succeeded" and "the result of stlxr is globally visible" are separate or 
not, and if they are separate, what comes first. Depending on the 
answer, UNLOCK acts as a store barrier or not.

> I'm still a little bit concerned though that there's no real model for
> synchronization of the access to modify need_locks though.
> Conceptually, the store of -1 from the exiting thread and the store of
> 0 in the surviving thread are not ordered with respect to each other
> except by an unsynchronized causality relationship.
> 
I don't understand implications of the lack of (stronger) ordering of 
these two stores. Could you clarify?

> I suspect what "should be" happening is that, if we observe
> need_locks==-1 (as a relaxed atomic load in our memory model), we take
> the thread list lock (as the lock controlling write access to these
> values) which ensures that the exiting thread has exited, then confirm
> that we are the only thread left (are there async-signal ways without
> UB that another thread could be created in the mean time?) before
> setting it to 0. But this is a rather heavy operation. If there's an
> assurance that no other threads can be created in the middle of LOCK
> (which I think there should be), we could just use __tl_sync(), which
> is much lighter, to conclude that we've synchronized with being the
> only remaining thread and that unsynchronized access is okay.
> 

LOCK is called either with application signals blocked or from 
async-signal-unsafe functions, so interrupting it with a signal handler 
that calls pthread_create() is not possible without UB. Therefore, I 
don't see any issues with using __tl_sync() for synchronization with the 
exiting thread (although, as I said above, I don't understand what 
problem is being solved).

By the way, I noticed a suspicious usage of "libc.need_locks" in fork():

     int need_locks = libc.need_locks > 0;

This is different from how LOCK behaves: LOCK will still perform normal 
locking one last time if it sees -1, but here locking will be skipped 
entirely. Is this intentional?

Thanks,
Alexey
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.