musl - Re: Illegal killlock skipping when transitioning to single-threaded state

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221004081624.GH2158779@port70.net>
Date: Tue, 4 Oct 2022 10:16:24 +0200
From: Szabolcs Nagy <nsz@...t70.net>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com, Alexey Izbyshev <izbyshev@...ras.ru>
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

* Rich Felker <dalias@...c.org> [2022-10-04 00:59:51 -0400]:
> On Mon, Oct 03, 2022 at 11:00:21PM -0400, Rich Felker wrote:
> > On Mon, Oct 03, 2022 at 10:58:32PM -0400, Rich Felker wrote:
> > > On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > > > * Szabolcs Nagy <nsz@...t70.net> [2022-10-03 15:26:15 +0200]:
> > > > 
> > > > > * Alexey Izbyshev <izbyshev@...ras.ru> [2022-10-03 09:16:03 +0300]:
> > > > > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > > > > ...
> > > > > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > > > > should fix the issue.
> > > > > > > 
> > > > > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > > > > barrier between them, since all lock and unlock operations are already
> > > > > > > full barriers.
> > > > > > > 
> > > > > > 
> > > > > > Before sending the report I tried to infer the intended ordering semantics
> > > > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > > > > see why they would provide a full barrier (my reasoning is below), so I
> > > > > > concluded that probably acquire/release semantics was intended in general
> > > > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > > > > section.
> > > > > > 
> > > > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > > > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > > > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > > > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > > > > barrier. Is this right?
> > > > > 
> > > > > i dont think this is right.
> > > > 
> > > > 
> > > > i think i was wrong and you are right.
> > > > 
> > > > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > > > starting with 'something == 0' the exiting E and remaining R threads:
> > > > 
> > > > E:something=1      // protected by killlock
> > > > E:UNLOCK(killlock)
> > > > E:need_locks=-1
> > > > 
> > > > R:LOCK(unrelated)  // reads need_locks == -1
> > > > R:need_locks=0
> > > > R:UNLOCK(unrelated)
> > > > R:LOCK(killlock)   // does not lock
> > > > R:read something   // can it be 0 ?
> > > > 
> > > > and here something can be 0 (ie. not protected by killlock) on aarch64
> > > > because
> > > > 
> > > > T1
> > > > 	something=1
> > > > 	ldaxr ... killlock
> > > > 	stlxr ... killlock
> > > > 	need_locks=-1
> > > > 
> > > > T2
> > > > 	x=need_locks
> > > > 	ldaxr ... unrelated
> > > > 	stlxr ... unrelated
> > > > 	y=something
> > > > 
> > > > can end with x==-1 and y==0.
> > > > 
> > > > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > > > 
> > > > i need to think how to support such lock usage on aarch64
> > > > without adding too many dmb.
> > > 
> > > OK, after reading a lot more, I think I'm starting to get what you're
> > > saying. Am I correct in my understanding that the problem is that the
> > > "R:LOCK(unrelated)" as implemented does not synchronize with the
> > > "E:UNLOCK(killlock)" because they're different objects?


yes if the same lock is used that solves this reordering.
btw there is a tool where you can check this (but bit tricky to use).
the online version is http://diy.inria.fr/www/?record=aarch64
with some explanation at
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/how-to-use-the-memory-model-tool
e.g. try

AArch64 samelock
{
0:X0=x; 0:X1=y; 0:X2=lock; 0:X9=l2;
1:X0=x; 1:X1=y; 1:X2=lock; 1:X9=l2;
}
P0                  | P1                ;
MOV W3, #1          | MOV W3,#1         ;
STR W3,[X0]         | LDR W4,[X1]       ;
LDAXR W4,[X2]       | LDAXR W5,[X2]     ;
STLXR W5,W3,[X2]    | STLXR W6,W3,[X2]  ;
STR W3,[X1]         | LDR W7,[X0]       ;
exists(0:X5=0 /\ 1:X6=0 /\ 1:X4=1 /\ 1:X7=0)

if you change X2 to X9 on one side then the reordering is observable.

however note that independent ldaxr cannot move across an stlxr
(in either direction) so normal loads in independent critical
sections won't be observed mixed. here the issue is that we
expect normal access after unlock outside the critical section
to be ordered.

> > > If so, I think this would be fully solved by using __tl_sync in the
> > > code path that resets need_locks to 0 after observing -1, by virtue of
> > > providing a common object (the thread list lock) to synchronize on.
> > > This is the "weaker memory model friendly" approach we should probably
> > > strive to achieve some day.
> > > 
> > > However, all existing code in musl is written assuming what I call the
> > > "POSIX memory model" where the only operation is "synchronizes memory"
> > > and that underspecified phrase has to be interpreted as "is a full
> > > barrier" to admit any consistent model. Said differently, the code was
> > > written assuming every a_* synchronizes with every other a_*, without
> > > any regard for whether they act on the same objects. This likely even
> > > matters for how our waiter accounting works (which is probably a good
> > > argument for removing it and switching to waiter flags). So I think,
> > > if the issue as I understand it now exists, we do need to fix it. Then
> > > we can revisit this at some later time as part of a big project.
> > 
> > Forgot, I should include links to the material I've been reading:
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
> > https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2
> > https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba
> > 
> > which is where the GCC folks seem to have encountered and fixed their
> > corresponding issue.
> 
> Proposed patch attached.
> 
> Rich

the patch looks good to me.
thanks.

> diff --git a/arch/aarch64/atomic_arch.h b/arch/aarch64/atomic_arch.h
> index 40fefc25..c01a3ab3 100644
> --- a/arch/aarch64/atomic_arch.h
> +++ b/arch/aarch64/atomic_arch.h
> @@ -2,7 +2,7 @@
>  static inline int a_ll(volatile int *p)
>  {
>  	int v;
> -	__asm__ __volatile__ ("ldaxr %w0,%1" : "=r"(v) : "Q"(*p));
> +	__asm__ __volatile__ ("ldxr %w0,%1" : "=r"(v) : "Q"(*p));
>  	return v;
>  }
>  
> @@ -20,25 +20,13 @@ static inline void a_barrier()
>  	__asm__ __volatile__ ("dmb ish" : : : "memory");
>  }
>  
> -#define a_cas a_cas
> -static inline int a_cas(volatile int *p, int t, int s)
> -{
> -	int old;
> -	do {
> -		old = a_ll(p);
> -		if (old != t) {
> -			a_barrier();
> -			break;
> -		}
> -	} while (!a_sc(p, s));
> -	return old;
> -}
> +#define a_post_llsc a_barrier
>  
>  #define a_ll_p a_ll_p
>  static inline void *a_ll_p(volatile void *p)
>  {
>  	void *v;
> -	__asm__ __volatile__ ("ldaxr %0, %1" : "=r"(v) : "Q"(*(void *volatile *)p));
> +	__asm__ __volatile__ ("ldxr %0, %1" : "=r"(v) : "Q"(*(void *volatile *)p));
>  	return v;
>  }
>  
> @@ -50,20 +38,6 @@ static inline int a_sc_p(volatile int *p, void *v)
>  	return !r;
>  }
>  
> -#define a_cas_p a_cas_p
> -static inline void *a_cas_p(volatile void *p, void *t, void *s)
> -{
> -	void *old;
> -	do {
> -		old = a_ll_p(p);
> -		if (old != t) {
> -			a_barrier();
> -			break;
> -		}
> -	} while (!a_sc_p(p, s));
> -	return old;
> -}
> -
>  #define a_ctz_64 a_ctz_64
>  static inline int a_ctz_64(uint64_t x)
>  {
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.