|   | 
| 
 | 
Message-ID: <e4b4f0fbb0bb7d168a4c9c2f50fdb48a@ispras.ru>
Date: Tue, 04 Oct 2022 21:15:12 +0300
From: Alexey Izbyshev <izbyshev@...ras.ru>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state
On 2022-10-04 18:57, Rich Felker wrote:
> On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote:
>> On 2022-10-04 17:19, Rich Felker wrote:
>> >On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
>> >>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
>> >>> On 2022-10-04 02:05, Rich Felker wrote:
>> >>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
>> >>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>> >>> >>> * Szabolcs Nagy <nsz@...t70.net> [2022-10-03 15:26:15 +0200]:
>> >>> >>>
>> >>> >>> > * Alexey Izbyshev <izbyshev@...ras.ru> [2022-10-03 09:16:03 +0300]:
>> >>> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
>> >>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> >>> >>> > ...
>> >>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
>> >>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
>> >>> >>> > > > > should fix the issue.
>> >>> >>> > > >
>> >>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
>> >>> >>> > > > barrier between them, since all lock and unlock operations are already
>> >>> >>> > > > full barriers.
>> >>> >>> > > >
>> >>> >>> > >
>> >>> >>> > > Before sending the report I tried to infer the intended ordering semantics
>> >>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> >>> >>> > > see why they would provide a full barrier (my reasoning is below), so I
>> >>> >>> > > concluded that probably acquire/release semantics was intended in general
>> >>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> >>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> >>> >>> > > section.
>> >>> >>> > >
>> >>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> >>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> >>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
>> >>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> >>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> >>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> >>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
>> >>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
>> >>> >>> > > barrier. Is this right?
>> >>> >>> >
>> >>> >>> > i dont think this is right.
>> >>> >>>
>> >>> >>>
>> >>> >>> i think i was wrong and you are right.
>> >>> >>>
>> >>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> >>> >>> starting with 'something == 0' the exiting E and remaining R threads:
>> >>> >>>
>> >>> >>> E:something=1      // protected by killlock
>> >>> >>> E:UNLOCK(killlock)
>> >>> >>> E:need_locks=-1
>> >>> >>>
>> >>> >>> R:LOCK(unrelated)  // reads need_locks == -1
>> >>> >>> R:need_locks=0
>> >>> >>> R:UNLOCK(unrelated)
>> >>> >>> R:LOCK(killlock)   // does not lock
>> >>> >>> R:read something   // can it be 0 ?
>> >>> >>>
>> >>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
>> >>> >>> because
>> >>> >>>
>> >>> >>> T1
>> >>> >>> 	something=1
>> >>> >>> 	ldaxr ... killlock
>> >>> >>> 	stlxr ... killlock
>> >>> >>> 	need_locks=-1
>> >>> >>>
>> >>> >>> T2
>> >>> >>> 	x=need_locks
>> >>> >>> 	ldaxr ... unrelated
>> >>> >>> 	stlxr ... unrelated
>> >>> >>> 	y=something
>> >>> >>>
>> >>> >>> can end with x==-1 and y==0.
>> >>> >>>
>> >>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
>> >>> >>>
>> >>> >>> i need to think how to support such lock usage on aarch64
>> >>> >>> without adding too many dmb.
>> >>> >>
>> >>> >>I don't really understand this, but FWIW gcc emits
>> >>> >>
>> >>> >>    ldxr
>> >>> >>    ...
>> >>> >>    stlxr
>> >>> >>    ...
>> >>> >>    dmb ish
>> >>> >>
>> >>> >>for __sync_val_compare_and_swap. So this is probably the right thing
>> >>> >>we should have. And it seems to match what the kernel folks discussed
>> >>> >>here:
>> >>> >>
>> >>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
>> >>> >>
>> >>> >>I wondered if there are similar issues for any others archs which need
>> >>> >>review, but it looks like all the other llsc archs have explicit
>> >>> >>pre/post barriers defined.
>> >>> >
>> >>> >Actually I don't understand what's going on with cmpxchg there. The
>> >>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
>> >>> >some follow-up in the thread I don't understand, about the case where
>> >>> >the cas fails, but we already handle that by doing an explicit barrier
>> >>> >in that case.
>> >>> >
>> >>> I think in that follow-up[1] they mean the following case (in musl
>> >>> terms):
>> >>>
>> >>> volatile int x, flag;
>> >>>
>> >>> T1:
>> >>>     x = 1;
>> >>>     a_store(&flag, 1);
>> >>>
>> >>> T2:
>> >>>     while (!flag);
>> >>>     a_cas(&x, 0, 1); // can this fail?
>> >>>
>> >>> They want it to never fail. But if a_cas() is implemented as
>> >>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
>> >>> with the load of flag.
>> >>>
>> >>> Note that musl does *not* handle this now, because a_barrier() in
>> >>> the failure path is after a_ll().
>> >>>
>> >>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
>> >>
>> >>OK, then indeed this too needs to be fixed -- the a_cas is failing to
>> >>synchronize with the a_store. How do we do that? Based on my
>> >>understanding, my proposed patch doesn't fix it.
>> >>
>> >>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
>> >>synchronization to a retry in the case where the comparison fails?
>> >>
>> ldarx will not work for the same reason as ldrx doesn't work (it can
>> still be reordered with the load of flag).
>> 
>> >>If this is actually the case, it's disturbing that GCC does not seem
>> >>to be getting it right either...
>> >
>> This is indeed disturbing, considering that comment[1] claims that
>> __sync RMWs were once defined as:
>> 
>>   __sync_synchronize();
>>   operation...
>>   __sync_synchronize();
>> 
>> which clearly doesn't hold for the current implementation of
>> __sync_val_compare_and_swap() for AArch64.
>> 
>> I can speculate that some of confusion is due to the fact that
>> apparently each __sync builtin originally mapped to a single Itanium
>> instruction, so the issues with ordering of "internal" LL/SC
>> accesses simply didn't exist.
>> 
>> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24
>> 
>> >One stupid (or maybe not stupid) option would be to always stlrx back
>> >the old value on comparison failure, so we can rely on the stlrx to
>> >fail if anything was reordered wrong. I think this is actually kinda
>> >how the x86 CAS works, or how it historically worked, so maybe it's
>> >not a bad choice. It's also analogous to how the non-CAS operations
>> >work: no branching out of the ll/sc loop, and just defining the result
>> >as old == t ? s : old:
>> >
>> >static inline int a_cas(volatile int *p, int t, int s)
>> >{
>> >	int old;
>> >	a_pre_llsc();
>> >	do old = a_ll(p);
>> >	while (!a_sc(p, old == t ? s : old));
>> >	a_post_llsc();
>> >	return old;
>> >}
>> >
>> I can't comment on whether always doing a store (or even many
>> stores, until "stabilization") on a failed CAS is better than an
>> extra dmb. Maybe if CAS failure normally happens only on contention,
>> and we optimize for the non-contended case, it's indeed so. As for
>> correctness, I agree that it should work.
> 
> OK. I think the alternative is something like:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
> 	int old;
> 	a_pre_llsc();
> 	do {
> 		old = a_ll(p);
> 		if (old != t) {
> 			a_barrier();
> 			if (*p != old) continue;
> 			break;
> 		}
> 	} while (!a_sc(p, s));
> 	a_post_llsc();
> 	return old;
> }
> 
This also looks correct to me.
> This is more/messier code, and still has the possibility of unbounded
> looping waiting for consistency under contention. It doesn't perform a
> store to the atomic object during such contention, which might be less
> of a burden on other threads, but on the other hand it performs full
> barriers during this contention, which might be more of a burden.
> 
> Conceptually I think I like the version that always stores for
> symmetry with the other atomic operations that makes it "obviously
> correct if they're correct". But I'm not sure if there are cost
> reasons we should prefer a different choice. At least it "shouldn't be
> worse" than the other atomic ops are if we do it that way.
> 
Symmetry is indeed nice, but I'm not sure it's really appropriate 
because at least in my mind a_cas() is different from other atomic RMW 
ops by making the "W" part conditional. I do agree on the correctness 
part though.
Interestingly, the kernel actually removed the full barrier semantics 
from a failed atomic_cmpxchg(), with the reasoning that "nobody needs 
it": 
https://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355981.html
Alexey
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.