musl - Re: Illegal killlock skipping when transitioning to single-threaded state

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221003123302.GF29905@brightrain.aerifal.cx>
Date: Mon, 3 Oct 2022 08:33:02 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

On Mon, Oct 03, 2022 at 09:16:03AM +0300, Alexey Izbyshev wrote:
> On 2022-09-19 18:29, Rich Felker wrote:
> >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>Hi,
> >>
> >>While reading pthread_exit() implementation I noticed that it can
> >>set "libc.need_locks" to -1 while still holding the killlock of the
> >>exiting thread:
> >>
> >>    if (!--libc.threads_minus_1) libc.need_locks = -1;
> >>
> >>If the remaining thread attempts to acquire the same killlock
> >>concurrently (which is valid for joinable threads), it works fine
> >>because LOCK() resets "libc.need_locks" only after a_cas():
> >>
> >>    int need_locks = libc.need_locks;
> >>    if (!need_locks) return;
> >>    /* fast path: INT_MIN for the lock, +1 for the congestion */
> >>    int current = a_cas(l, 0, INT_MIN + 1);
> >>    if (need_locks < 0) libc.need_locks = 0;
> >>    if (!current) return;
> >>
> >>However, because "libc.need_locks" is reset when using LOCK() for
> >>any other lock too, the following could happen (E is the exiting
> >>thread, R is the remaining thread):
> >>
> >>E: libc.need_locks = -1
> >>R: LOCK(unrelated_lock)
> >>R:   libc.need_locks = 0
> >>R: UNLOCK(unrelated_lock)
> >>R: LOCK(E->killlock) // skips locking, now both R and E think they
> >>are holding the lock
> >>R: UNLOCK(E->killlock)
> >>E: UNLOCK(E->killlock)
> >>
> >>The lack of mutual exclusion means that the tid reuse problem that
> >>killlock is supposed to prevent might occur.
> >>
> >>Moreover, when UNLOCK(E->killlock) is called concurrently by R and
> >>E, a_fetch_add() could be done twice if timing is such that both
> >>threads notice that "l[0] < 0":
> >>
> >>   /* Check l[0] to see if we are multi-threaded. */
> >>   if (l[0] < 0) {
> >>       if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
> >>           __wake(l, 1, 1);
> >>       }
> >>   }
> >>
> >>In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
> >>with INT_MAX waiters"). This is a bad state because the next LOCK()
> >>will wrap it to "unlocked" state instead of locking. Also, if more
> >>than two threads attempt to use it, a deadlock will occur if two
> >>supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
> >>after a third thread registered itself as a waiter because the lock
> >>will wrap to INT_MIN.
> >>
> >>Reordering the "libc.need_locks = -1" assignment and
> >>UNLOCK(E->killlock) and providing a store barrier between them
> >>should fix the issue.
> >
> >I think this all sounds correct. I'm not sure what you mean by a store
> >barrier between them, since all lock and unlock operations are already
> >full barriers.
> >
> 
> Before sending the report I tried to infer the intended ordering
> semantics of LOCK/UNLOCK by looking at their implementations. For
> AArch64, I didn't see why they would provide a full barrier (my
> reasoning is below), so I concluded that probably acquire/release
> semantics was intended in general and suggested an extra store
> barrier to prevent hoisting of "libc.need_locks = -1" store spelled
> after UNLOCK(E->killlock) back into the critical section.
> 
> UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are
> implemented via load-acquire/store-release instructions. Therefore,
> if we consider a LOCK/UNLOCK critical section containing only plain
> loads and stores, (a) any such memory access can be reordered with
> the initial ldaxr in UNLOCK, and (b) any plain load following UNLOCK
> can be reordered with stlxr (assuming the processor predicts that
> stlxr succeeds), and further, due to (a), with any memory access
> inside the critical section. Therefore, UNLOCK is not full barrier.
> Is this right?

If so, this is a bug in our implementation of aarch64 atomics. The
intent is that a_* all be seq_cst (practically, full acq+rel probably
suffices). Maybe someone more familiar with aarch64 can chime in.
I thought this was discussed and determined to be okay when the
aarch64 port was being reviewed, but now I'm starting to doubt that.

> As for a store following UNLOCK, I'm not sure. A plain store
> following stlxr can be reordered with it, but here that store is
> conditional on stlxr success. So even if the processor predicts
> stlxr success, it can't make the following store visible before it's
> sure that stlxr succeeded. But I don't know whether the events "the
> processor knows that stlxr succeeded" and "the result of stlxr is
> globally visible" are separate or not, and if they are separate,
> what comes first. Depending on the answer, UNLOCK acts as a store
> barrier or not.
> 
> >I'm still a little bit concerned though that there's no real model for
> >synchronization of the access to modify need_locks though.
> >Conceptually, the store of -1 from the exiting thread and the store of
> >0 in the surviving thread are not ordered with respect to each other
> >except by an unsynchronized causality relationship.
> >
> I don't understand implications of the lack of (stronger) ordering
> of these two stores. Could you clarify?

I'm talking about at the level of the memory model, not any
implementation details of the cpu. The store of 0 in the surviving
thread is not an atomic, and I was thinking this is possibly not
ordered with respect to the store of the -1 in the exiting thread
unless there's an intervening barrier, *after* the -1 is observed

However, looking back at the code, I was actually careful to ensure
this already. need_locks is loaded only once, before the a_cas, and
the a_cas provides the barrier. If we were instead re-loading
need_locks after the a_cas before testing <0 and assigning 0, there
would be a problem, at least at the abstract memory model level.

> >I suspect what "should be" happening is that, if we observe
> >need_locks==-1 (as a relaxed atomic load in our memory model), we take
> >the thread list lock (as the lock controlling write access to these
> >values) which ensures that the exiting thread has exited, then confirm
> >that we are the only thread left (are there async-signal ways without
> >UB that another thread could be created in the mean time?) before
> >setting it to 0. But this is a rather heavy operation. If there's an
> >assurance that no other threads can be created in the middle of LOCK
> >(which I think there should be), we could just use __tl_sync(), which
> >is much lighter, to conclude that we've synchronized with being the
> >only remaining thread and that unsynchronized access is okay.
> >
> 
> LOCK is called either with application signals blocked or from
> async-signal-unsafe functions, so interrupting it with a signal
> handler that calls pthread_create() is not possible without UB.
> Therefore, I don't see any issues with using __tl_sync() for
> synchronization with the exiting thread (although, as I said above,
> I don't understand what problem is being solved).

Based on the above, I agree there doesn't seem to be an actual problem
here. The reason it works is subtle, but present.

> By the way, I noticed a suspicious usage of "libc.need_locks" in fork():
> 
>     int need_locks = libc.need_locks > 0;
> 
> This is different from how LOCK behaves: LOCK will still perform
> normal locking one last time if it sees -1, but here locking will be
> skipped entirely. Is this intentional?

This is probably subtly incorrect. I think it requires more analysis
to determine if there's anything that can actually go wrong, but it
should probably just be changed anyway.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.