Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 3 Oct 2022 08:33:02 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

On Mon, Oct 03, 2022 at 09:16:03AM +0300, Alexey Izbyshev wrote:
> On 2022-09-19 18:29, Rich Felker wrote:
> >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>Hi,
> >>
> >>While reading pthread_exit() implementation I noticed that it can
> >>set "libc.need_locks" to -1 while still holding the killlock of the
> >>exiting thread:
> >>
> >>    if (!--libc.threads_minus_1) libc.need_locks = -1;
> >>
> >>If the remaining thread attempts to acquire the same killlock
> >>concurrently (which is valid for joinable threads), it works fine
> >>because LOCK() resets "libc.need_locks" only after a_cas():
> >>
> >>    int need_locks = libc.need_locks;
> >>    if (!need_locks) return;
> >>    /* fast path: INT_MIN for the lock, +1 for the congestion */
> >>    int current = a_cas(l, 0, INT_MIN + 1);
> >>    if (need_locks < 0) libc.need_locks = 0;
> >>    if (!current) return;
> >>
> >>However, because "libc.need_locks" is reset when using LOCK() for
> >>any other lock too, the following could happen (E is the exiting
> >>thread, R is the remaining thread):
> >>
> >>E: libc.need_locks = -1
> >>R: LOCK(unrelated_lock)
> >>R:   libc.need_locks = 0
> >>R: UNLOCK(unrelated_lock)
> >>R: LOCK(E->killlock) // skips locking, now both R and E think they
> >>are holding the lock
> >>R: UNLOCK(E->killlock)
> >>E: UNLOCK(E->killlock)
> >>
> >>The lack of mutual exclusion means that the tid reuse problem that
> >>killlock is supposed to prevent might occur.
> >>
> >>Moreover, when UNLOCK(E->killlock) is called concurrently by R and
> >>E, a_fetch_add() could be done twice if timing is such that both
> >>threads notice that "l[0] < 0":
> >>
> >>   /* Check l[0] to see if we are multi-threaded. */
> >>   if (l[0] < 0) {
> >>       if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
> >>           __wake(l, 1, 1);
> >>       }
> >>   }
> >>
> >>In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
> >>with INT_MAX waiters"). This is a bad state because the next LOCK()
> >>will wrap it to "unlocked" state instead of locking. Also, if more
> >>than two threads attempt to use it, a deadlock will occur if two
> >>supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
> >>after a third thread registered itself as a waiter because the lock
> >>will wrap to INT_MIN.
> >>
> >>Reordering the "libc.need_locks = -1" assignment and
> >>UNLOCK(E->killlock) and providing a store barrier between them
> >>should fix the issue.
> >
> >I think this all sounds correct. I'm not sure what you mean by a store
> >barrier between them, since all lock and unlock operations are already
> >full barriers.
> >
> 
> Before sending the report I tried to infer the intended ordering
> semantics of LOCK/UNLOCK by looking at their implementations. For
> AArch64, I didn't see why they would provide a full barrier (my
> reasoning is below), so I concluded that probably acquire/release
> semantics was intended in general and suggested an extra store
> barrier to prevent hoisting of "libc.need_locks = -1" store spelled
> after UNLOCK(E->killlock) back into the critical section.
> 
> UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are
> implemented via load-acquire/store-release instructions. Therefore,
> if we consider a LOCK/UNLOCK critical section containing only plain
> loads and stores, (a) any such memory access can be reordered with
> the initial ldaxr in UNLOCK, and (b) any plain load following UNLOCK
> can be reordered with stlxr (assuming the processor predicts that
> stlxr succeeds), and further, due to (a), with any memory access
> inside the critical section. Therefore, UNLOCK is not full barrier.
> Is this right?

If so, this is a bug in our implementation of aarch64 atomics. The
intent is that a_* all be seq_cst (practically, full acq+rel probably
suffices). Maybe someone more familiar with aarch64 can chime in.
I thought this was discussed and determined to be okay when the
aarch64 port was being reviewed, but now I'm starting to doubt that.

> As for a store following UNLOCK, I'm not sure. A plain store
> following stlxr can be reordered with it, but here that store is
> conditional on stlxr success. So even if the processor predicts
> stlxr success, it can't make the following store visible before it's
> sure that stlxr succeeded. But I don't know whether the events "the
> processor knows that stlxr succeeded" and "the result of stlxr is
> globally visible" are separate or not, and if they are separate,
> what comes first. Depending on the answer, UNLOCK acts as a store
> barrier or not.
> 
> >I'm still a little bit concerned though that there's no real model for
> >synchronization of the access to modify need_locks though.
> >Conceptually, the store of -1 from the exiting thread and the store of
> >0 in the surviving thread are not ordered with respect to each other
> >except by an unsynchronized causality relationship.
> >
> I don't understand implications of the lack of (stronger) ordering
> of these two stores. Could you clarify?

I'm talking about at the level of the memory model, not any
implementation details of the cpu. The store of 0 in the surviving
thread is not an atomic, and I was thinking this is possibly not
ordered with respect to the store of the -1 in the exiting thread
unless there's an intervening barrier, *after* the -1 is observed

However, looking back at the code, I was actually careful to ensure
this already. need_locks is loaded only once, before the a_cas, and
the a_cas provides the barrier. If we were instead re-loading
need_locks after the a_cas before testing <0 and assigning 0, there
would be a problem, at least at the abstract memory model level.

> >I suspect what "should be" happening is that, if we observe
> >need_locks==-1 (as a relaxed atomic load in our memory model), we take
> >the thread list lock (as the lock controlling write access to these
> >values) which ensures that the exiting thread has exited, then confirm
> >that we are the only thread left (are there async-signal ways without
> >UB that another thread could be created in the mean time?) before
> >setting it to 0. But this is a rather heavy operation. If there's an
> >assurance that no other threads can be created in the middle of LOCK
> >(which I think there should be), we could just use __tl_sync(), which
> >is much lighter, to conclude that we've synchronized with being the
> >only remaining thread and that unsynchronized access is okay.
> >
> 
> LOCK is called either with application signals blocked or from
> async-signal-unsafe functions, so interrupting it with a signal
> handler that calls pthread_create() is not possible without UB.
> Therefore, I don't see any issues with using __tl_sync() for
> synchronization with the exiting thread (although, as I said above,
> I don't understand what problem is being solved).

Based on the above, I agree there doesn't seem to be an actual problem
here. The reason it works is subtle, but present.

> By the way, I noticed a suspicious usage of "libc.need_locks" in fork():
> 
>     int need_locks = libc.need_locks > 0;
> 
> This is different from how LOCK behaves: LOCK will still perform
> normal locking one last time if it sees -1, but here locking will be
> skipped entirely. Is this intentional?

This is probably subtly incorrect. I think it requires more analysis
to determine if there's anything that can actually go wrong, but it
should probably just be changed anyway.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.