musl - Re: Illegal killlock skipping when transitioning to single-threaded state

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e90e65c83e96fce54e188fbe6e0413cd@ispras.ru>
Date: Tue, 04 Oct 2022 18:43:49 +0300
From: Alexey Izbyshev <izbyshev@...ras.ru>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

On 2022-10-04 17:19, Rich Felker wrote:
> On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
>> On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
>> > On 2022-10-04 02:05, Rich Felker wrote:
>> > >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
>> > >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>> > >>> * Szabolcs Nagy <nsz@...t70.net> [2022-10-03 15:26:15 +0200]:
>> > >>>
>> > >>> > * Alexey Izbyshev <izbyshev@...ras.ru> [2022-10-03 09:16:03 +0300]:
>> > >>> > > On 2022-09-19 18:29, Rich Felker wrote:
>> > >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> > >>> > ...
>> > >>> > > > > Reordering the "libc.need_locks = -1" assignment and
>> > >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
>> > >>> > > > > should fix the issue.
>> > >>> > > >
>> > >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
>> > >>> > > > barrier between them, since all lock and unlock operations are already
>> > >>> > > > full barriers.
>> > >>> > > >
>> > >>> > >
>> > >>> > > Before sending the report I tried to infer the intended ordering semantics
>> > >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> > >>> > > see why they would provide a full barrier (my reasoning is below), so I
>> > >>> > > concluded that probably acquire/release semantics was intended in general
>> > >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> > >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> > >>> > > section.
>> > >>> > >
>> > >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> > >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> > >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
>> > >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> > >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> > >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> > >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
>> > >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
>> > >>> > > barrier. Is this right?
>> > >>> >
>> > >>> > i dont think this is right.
>> > >>>
>> > >>>
>> > >>> i think i was wrong and you are right.
>> > >>>
>> > >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> > >>> starting with 'something == 0' the exiting E and remaining R threads:
>> > >>>
>> > >>> E:something=1      // protected by killlock
>> > >>> E:UNLOCK(killlock)
>> > >>> E:need_locks=-1
>> > >>>
>> > >>> R:LOCK(unrelated)  // reads need_locks == -1
>> > >>> R:need_locks=0
>> > >>> R:UNLOCK(unrelated)
>> > >>> R:LOCK(killlock)   // does not lock
>> > >>> R:read something   // can it be 0 ?
>> > >>>
>> > >>> and here something can be 0 (ie. not protected by killlock) on aarch64
>> > >>> because
>> > >>>
>> > >>> T1
>> > >>> 	something=1
>> > >>> 	ldaxr ... killlock
>> > >>> 	stlxr ... killlock
>> > >>> 	need_locks=-1
>> > >>>
>> > >>> T2
>> > >>> 	x=need_locks
>> > >>> 	ldaxr ... unrelated
>> > >>> 	stlxr ... unrelated
>> > >>> 	y=something
>> > >>>
>> > >>> can end with x==-1 and y==0.
>> > >>>
>> > >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
>> > >>>
>> > >>> i need to think how to support such lock usage on aarch64
>> > >>> without adding too many dmb.
>> > >>
>> > >>I don't really understand this, but FWIW gcc emits
>> > >>
>> > >>    ldxr
>> > >>    ...
>> > >>    stlxr
>> > >>    ...
>> > >>    dmb ish
>> > >>
>> > >>for __sync_val_compare_and_swap. So this is probably the right thing
>> > >>we should have. And it seems to match what the kernel folks discussed
>> > >>here:
>> > >>
>> > >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
>> > >>
>> > >>I wondered if there are similar issues for any others archs which need
>> > >>review, but it looks like all the other llsc archs have explicit
>> > >>pre/post barriers defined.
>> > >
>> > >Actually I don't understand what's going on with cmpxchg there. The
>> > >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
>> > >some follow-up in the thread I don't understand, about the case where
>> > >the cas fails, but we already handle that by doing an explicit barrier
>> > >in that case.
>> > >
>> > I think in that follow-up[1] they mean the following case (in musl
>> > terms):
>> >
>> > volatile int x, flag;
>> >
>> > T1:
>> >     x = 1;
>> >     a_store(&flag, 1);
>> >
>> > T2:
>> >     while (!flag);
>> >     a_cas(&x, 0, 1); // can this fail?
>> >
>> > They want it to never fail. But if a_cas() is implemented as
>> > ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
>> > with the load of flag.
>> >
>> > Note that musl does *not* handle this now, because a_barrier() in
>> > the failure path is after a_ll().
>> >
>> > [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
>> 
>> OK, then indeed this too needs to be fixed -- the a_cas is failing to
>> synchronize with the a_store. How do we do that? Based on my
>> understanding, my proposed patch doesn't fix it.
>> 
>> Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
>> synchronization to a retry in the case where the comparison fails?
>> 
ldarx will not work for the same reason as ldrx doesn't work (it can 
still be reordered with the load of flag).

>> If this is actually the case, it's disturbing that GCC does not seem
>> to be getting it right either...
> 
This is indeed disturbing, considering that comment[1] claims that 
__sync RMWs were once defined as:

   __sync_synchronize();
   operation...
   __sync_synchronize();

which clearly doesn't hold for the current implementation of 
__sync_val_compare_and_swap() for AArch64.

I can speculate that some of confusion is due to the fact that 
apparently each __sync builtin originally mapped to a single Itanium 
instruction, so the issues with ordering of "internal" LL/SC accesses 
simply didn't exist.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24

> One stupid (or maybe not stupid) option would be to always stlrx back
> the old value on comparison failure, so we can rely on the stlrx to
> fail if anything was reordered wrong. I think this is actually kinda
> how the x86 CAS works, or how it historically worked, so maybe it's
> not a bad choice. It's also analogous to how the non-CAS operations
> work: no branching out of the ll/sc loop, and just defining the result
> as old == t ? s : old:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
> 	int old;
> 	a_pre_llsc();
> 	do old = a_ll(p);
> 	while (!a_sc(p, old == t ? s : old));
> 	a_post_llsc();
> 	return old;
> }
> 
I can't comment on whether always doing a store (or even many stores, 
until "stabilization") on a failed CAS is better than an extra dmb. 
Maybe if CAS failure normally happens only on contention, and we 
optimize for the non-contended case, it's indeed so. As for correctness, 
I agree that it should work.

> In fact if we use this as the general pattern for a_cas (in
> src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at
> all, on any archs -- but that probably depends on the ISA's SC
> semantics...
> 
This pattern works on AArch64 without a_pre_llsc() only because a_sc() 
is a store-release. If it were a plain store-exclusive, preceding memory 
accesses could be reordered with it.

Alexey
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.