musl - Re: Illegal killlock skipping when transitioning to single-threaded state

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221004141937.GM29905@brightrain.aerifal.cx>
Date: Tue, 4 Oct 2022 10:19:38 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> > On 2022-10-04 02:05, Rich Felker wrote:
> > >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> > >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > >>> * Szabolcs Nagy <nsz@...t70.net> [2022-10-03 15:26:15 +0200]:
> > >>>
> > >>> > * Alexey Izbyshev <izbyshev@...ras.ru> [2022-10-03 09:16:03 +0300]:
> > >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> > >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > >>> > ...
> > >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> > >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > >>> > > > > should fix the issue.
> > >>> > > >
> > >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> > >>> > > > barrier between them, since all lock and unlock operations are already
> > >>> > > > full barriers.
> > >>> > > >
> > >>> > >
> > >>> > > Before sending the report I tried to infer the intended ordering semantics
> > >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > >>> > > see why they would provide a full barrier (my reasoning is below), so I
> > >>> > > concluded that probably acquire/release semantics was intended in general
> > >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > >>> > > section.
> > >>> > >
> > >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> > >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > >>> > > barrier. Is this right?
> > >>> >
> > >>> > i dont think this is right.
> > >>>
> > >>>
> > >>> i think i was wrong and you are right.
> > >>>
> > >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > >>> starting with 'something == 0' the exiting E and remaining R threads:
> > >>>
> > >>> E:something=1      // protected by killlock
> > >>> E:UNLOCK(killlock)
> > >>> E:need_locks=-1
> > >>>
> > >>> R:LOCK(unrelated)  // reads need_locks == -1
> > >>> R:need_locks=0
> > >>> R:UNLOCK(unrelated)
> > >>> R:LOCK(killlock)   // does not lock
> > >>> R:read something   // can it be 0 ?
> > >>>
> > >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> > >>> because
> > >>>
> > >>> T1
> > >>> 	something=1
> > >>> 	ldaxr ... killlock
> > >>> 	stlxr ... killlock
> > >>> 	need_locks=-1
> > >>>
> > >>> T2
> > >>> 	x=need_locks
> > >>> 	ldaxr ... unrelated
> > >>> 	stlxr ... unrelated
> > >>> 	y=something
> > >>>
> > >>> can end with x==-1 and y==0.
> > >>>
> > >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > >>>
> > >>> i need to think how to support such lock usage on aarch64
> > >>> without adding too many dmb.
> > >>
> > >>I don't really understand this, but FWIW gcc emits
> > >>
> > >>    ldxr
> > >>    ...
> > >>    stlxr
> > >>    ...
> > >>    dmb ish
> > >>
> > >>for __sync_val_compare_and_swap. So this is probably the right thing
> > >>we should have. And it seems to match what the kernel folks discussed
> > >>here:
> > >>
> > >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> > >>
> > >>I wondered if there are similar issues for any others archs which need
> > >>review, but it looks like all the other llsc archs have explicit
> > >>pre/post barriers defined.
> > >
> > >Actually I don't understand what's going on with cmpxchg there. The
> > >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> > >some follow-up in the thread I don't understand, about the case where
> > >the cas fails, but we already handle that by doing an explicit barrier
> > >in that case.
> > >
> > I think in that follow-up[1] they mean the following case (in musl
> > terms):
> > 
> > volatile int x, flag;
> > 
> > T1:
> >     x = 1;
> >     a_store(&flag, 1);
> > 
> > T2:
> >     while (!flag);
> >     a_cas(&x, 0, 1); // can this fail?
> > 
> > They want it to never fail. But if a_cas() is implemented as
> > ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> > with the load of flag.
> > 
> > Note that musl does *not* handle this now, because a_barrier() in
> > the failure path is after a_ll().
> > 
> > [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> 
> OK, then indeed this too needs to be fixed -- the a_cas is failing to
> synchronize with the a_store. How do we do that? Based on my
> understanding, my proposed patch doesn't fix it.
> 
> Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
> synchronization to a retry in the case where the comparison fails?
> 
> If this is actually the case, it's disturbing that GCC does not seem
> to be getting it right either...

One stupid (or maybe not stupid) option would be to always stlrx back
the old value on comparison failure, so we can rely on the stlrx to
fail if anything was reordered wrong. I think this is actually kinda
how the x86 CAS works, or how it historically worked, so maybe it's
not a bad choice. It's also analogous to how the non-CAS operations
work: no branching out of the ll/sc loop, and just defining the result
as old == t ? s : old:

static inline int a_cas(volatile int *p, int t, int s)
{
	int old;
	a_pre_llsc();
	do old = a_ll(p);
	while (!a_sc(p, old == t ? s : old));
	a_post_llsc();
	return old;
}

In fact if we use this as the general pattern for a_cas (in
src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at
all, on any archs -- but that probably depends on the ISA's SC
semantics...

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.