musl - Re: [PATCH 1/8] (V2) a new lock algorithm with lock value and CS counts in the same atomic int

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171220215827.GW1627@brightrain.aerifal.cx>
Date: Wed, 20 Dec 2017 16:58:27 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH 1/8] (V2) a new lock algorithm with lock value and
 CS counts in the same atomic int

On Fri, Jun 16, 2017 at 09:11:35AM +0200, Jens Gustedt wrote:
> A variant of this new lock algorithm has been presented at SAC'16, see
> https://hal.inria.fr/hal-01304108. A full version of that paper is
> available at https://hal.inria.fr/hal-01236734.

Finally getting back to this, and trying to finish it in the current
(1.1.19) release cycle...

> The main motivation of this is to improve on the safety of the basic lock
> implementation in musl. This is achieved by squeezing lock value and
> "waiter" count into a single int. Thereby an unlock operation does
> exactly one memory transfer (a_fetch_add) and never touches the value
> again, but still detects if a waiter has to be woken up.
> 
> Another effect could be improved performance, but don't take this too
> seriously, these lock/unlock functions are probably never used in
> performance critical parts of libc.

Given this, I'm still mildly skeptical of the spin tuning and addition
of the new __futexwait inline with a single caller.

> Depending on the architecture, the performance of the previous lock
> strategy was not always so great under no or medium congestion. Now, in
> case of no congestion, there are exactly two memory operations, one to
> lock an one to unlock. So if we would start to use this same new strategy
> also for user locks such as mutexes, semaphores or the locks used for
> lockfull atomics, we might see a performance improvement.

Indeed, but it would only be usable for some cases (for mutexes, only
those without owner tracking), and seems like it would add some
complexity. I'm not opposed to looking into it at some point, but
let's get the internal lock stuff done first. When updating, can you
trim down the proposed commit message a bit (or do you mind if I do
so?) to focus on the change being made and the motivation without
going into lots of speculation about related changes that could be
made?

> Also, the previous algorithm could be terrible under high load. Key to
> the better performance is, again, the combination of lock value and
> "waiter" count into one atomic entity. This drastically reduces the
> memory transfers that have to be performed under load. In particular our
> main acquisition loop changes this atomic int exactly once, namely when
> the lock is acquired, and so the mutual disturbance between acquiring
> threads is reduced. Again, this is probably not an issue, because this
> lock/unlock algorithm is *not* used in places that arbitrarily hammer
> thousands of requests on the same poor lock. (A throughput test in thread
> creation shows about 50000 threads created per second on my machine, not
> enough that all of this could make a difference.)
> 
> The main price for the improved safety is a little bit larger code.
> 
> Also under high congestion, the scheduling behavior will be different
> compared to the previous algorithm. In that case, a successful
> put-to-sleep may appear out of order compared to the arrival in the
> critical section. I am not sure that this is very different from the
> previous behavior, nor that scheduling order for these lock primitives is
> anything that an application should ever count on.
> 
> For the patch itself, I found only one other place where knowledge of the
> internals of the lock algorithm is used. Here I patch this usage with the
> appropriate CAS, but this should perhaps be extracted as a __trylock
> function or, maybe even better, macro.

The context for this part changed slightly from commit
c1e27367a9b26b9baac0f37a12349fc36567c8b6, but reversion of that patch
could probably be merged with your patch since the old code is valid
once the lock implementation is changed.

> This is version 2 of the patch that takes some of the remarks of
> Alexander and Rich into account:
> 
>  - all lock values are now formulated with respect to INT_MIN
>  - unlock now uses __wake to wake up waiters
>  - a new inline function __futexwait with the same complexity as __wake
>    is use to set threads to sleep
>  - __unlock covers the case that no lock had been recorded
>  - the fast path of __lock is moved outside the loop in a prominent position
> ---
>  src/internal/pthread_impl.h |  6 ++++++
>  src/thread/__lock.c         | 42 +++++++++++++++++++++++++++++++++++++-----
>  src/thread/pthread_detach.c |  2 +-
>  3 files changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/src/internal/pthread_impl.h b/src/internal/pthread_impl.h
> index 757b86ad..0622ad52 100644
> --- a/src/internal/pthread_impl.h
> +++ b/src/internal/pthread_impl.h
> @@ -132,6 +132,12 @@ static inline void __wake(volatile void *addr, int cnt, int priv)
>  	__syscall(SYS_futex, addr, FUTEX_WAKE|priv, cnt) != -ENOSYS ||
>  	__syscall(SYS_futex, addr, FUTEX_WAKE, cnt);
>  }
> +static inline void __futexwait(volatile void *addr, int val, int priv)
> +{
> +	if (priv) priv = 128;
> +	__syscall(SYS_futex, addr, FUTEX_WAIT|priv, val) != -ENOSYS ||
> +	__syscall(SYS_futex, addr, FUTEX_WAIT, val);
> +}
>  
>  void __acquire_ptc(void);
>  void __release_ptc(void);
> diff --git a/src/thread/__lock.c b/src/thread/__lock.c
> index 0874c04a..56092240 100644
> --- a/src/thread/__lock.c
> +++ b/src/thread/__lock.c
> @@ -2,14 +2,46 @@
>  
>  void __lock(volatile int *l)
>  {
> -	if (libc.threads_minus_1)
> -		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
> +	if (!libc.threads_minus_1) return;
> +        /* fast path: INT_MIN for holding the lock, +1 to count this
> +           thread in the critical section. */

There seems to be some mixing of tabs and spaces here and in a few
other lines. Tabs should be used for indentation levels, spaces (after
any initial indent) to column-align. Generally musl style also has
continuation comment lines begin with a * aligned with the * of the
opening /* too.

> +	int current = a_cas(l, 0, INT_MIN + 1);
> +        if (!current) return;
> +	/* A first spin lock acquisition loop, for the case of
> +	   medium congestion. */
> +	for (unsigned i = 0; i < 10; ++i) {
> +		if (current < 0) current -= INT_MIN + 1;
> +		// assertion: current >= 0
> +		int val = a_cas(l, current, INT_MIN + (current + 1));
> +		if (val == current) return;
> +		current = val;
> +	}
> +	// Spinning failed, so mark ourselves as being inside the CS.
> +	current = a_fetch_add(l, 1) + 1;
> +	/* The main lock acquisition loop for heavy congestion. The only
> +	   change to the value performed inside that loop is a successful
> +	   lock via the CAS that acquires the lock. */
> +	for (;;) {
> +		/* We can only go into wait, if we know that somebody holds the
> +		   lock and will eventually wake us up, again. */
> +		if (current < 0) {
> +			__futexwait(l, current, 1);
> +			current -= INT_MIN + 1;
> +		}
> +		/* assertion: current > 0, because the count
> +		   includes us already. */
> +		int val = a_cas(l, current, INT_MIN + current);
> +		if (val == current) return;
> +		current = val;
> +	}
>  }

As long as you're including comments, which I think makes sense given
that the algorithm is somewhat nontrivial, it might be nice to add a
quick exposition on the states. As I understand them:

- v==0 unlocked, no waiters
- v>=0 unlocked, with v waiters contending
- v<0  locked, with v-INT_MIN-1 waiters

One thing I don't recall understanding is why the lock owner is
"included" in the waiter count, i.e. why v<0 indicates v-INT_MIN-1
waiters rather than v-INT_MIN waiters. Is there some race it's
avoiding? Sorry if this is something obvious I knew at one time; it's
just been a long time since I visited this code due to the
very-drawn-out 1.1.17 release cycle.

>  void __unlock(volatile int *l)
>  {
> -	if (l[0]) {
> -		a_store(l, 0);
> -		if (l[1]) __wake(l, 1, 1);
> +	/* We have to check if l[0] had been touched at all. */

I don't understand this comment.

> +	if (l[0] < 0) {
> +		if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
> +			__wake(l, 1, 1);
> +		}
>  	}
>  }

This would be a candidate for single if with &&, but I'm not sure if
it's more or less legible that way. I don't have a strong preference.

> diff --git a/src/thread/pthread_detach.c b/src/thread/pthread_detach.c
> index ed77f74d..818626ed 100644
> --- a/src/thread/pthread_detach.c
> +++ b/src/thread/pthread_detach.c
> @@ -6,7 +6,7 @@ int __pthread_join(pthread_t, void **);
>  static int __pthread_detach(pthread_t t)
>  {
>  	/* Cannot detach a thread that's already exiting */
> -	if (a_swap(t->exitlock, 1))
> +	if (a_cas(t->exitlock, 0, -INT_MAX))
>  		return __pthread_join(t, 0);
>  	t->detached = 2;
>  	__unlock(t->exitlock);

As noted above this needs slight update to revert the commit that
worked around the use-after-free bug.

Overall I think it looks pretty much good to go, aside from minor
style things. I'm open to your ideas on __futexwait, maybe just
writing it inline here since it's only used in one place; while I
don't like duplicating what __wait already does, __wait is mildly big
and using your __futexwait version instead will probably offset the
full size increase of the new lock (and maybe more) in programs that
have no reason other than __lock to pull in __wait.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.