musl - [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <868874$8age42@mail2-relais-roc.national.inria.fr>
Date: Fri, 16 Jun 2017 09:11:35 +0200
From: Jens Gustedt <Jens.Gustedt@...ia.fr>
To: musl@...ts.openwall.com
Subject: [PATCH] a new lock algorithm with lock value and CS counts in the
 same atomic int

A variant of this new lock algorithm has been presented at SAC'16, see
https://hal.inria.fr/hal-01304108. A full version of that paper is
available at https://hal.inria.fr/hal-01236734.

The main motivation of this is to improve on the safety of the basic lock
implementation in musl. This is achieved by squeezing lock value and
"waiter" count into a single int. Thereby an unlock operation does
exactly one memory transfer (a_fetch_add) and never touches the value
again, but still detects if a waiter has to be woken up.

Another effect could be improved performance, but don't take this too
seriously, these lock/unlock functions are probably never used in
performance critical parts of libc.

Depending on the architecture, the performance of the previous lock
strategy was not always so great under no or medium congestion. Now, in
case of no congestion, there are exactly two memory operations, one to
lock an one to unlock. So if we would start to use this same new strategy
also for user locks such as mutexes, semaphores or the locks used for
lockfull atomics, we might see a performance improvement.

Also, the previous algorithm could be terrible under high load. Key to
the better performance is, again, the combination of lock value and
"waiter" count into one atomic entity. This drastically reduces the
memory transfers that have to be performed under load. In particular our
main acquisition loop changes this atomic int exactly once, namely when
the lock is acquired, and so the mutual disturbance between acquiring
threads is reduced. Again, this is probably not an issue, because this
lock/unlock algorithm is *not* used in places that arbitrarily hammer
thousands of requests on the same poor lock. (A throughput test in thread
creation shows about 50000 threads created per second on my machine, not
enough that all of this could make a difference.)

The main price for the improved safety is a little bit larger code.

Also under high congestion, the scheduling behavior will be different
compared to the previous algorithm. In that case, a successful
put-to-sleep may appear out of order compared to the arrival in the
critical section. I am not sure that this is very different from the
previous behavior, nor that scheduling order for these lock primitives is
anything that an application should ever count on.

For the patch itself, I found only one other place where knowledge of the
internals of the lock algorithm is used. Here I patch this usage with the
appropriate CAS, but this should perhaps be extracted as a __trylock
function or, maybe even better, macro.
---
 src/thread/__lock.c         | 50 ++++++++++++++++++++++++++++++++++++++++-----
 src/thread/pthread_detach.c |  2 +-
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/src/thread/__lock.c b/src/thread/__lock.c
index 0874c04a..e970a98c 100644
--- a/src/thread/__lock.c
+++ b/src/thread/__lock.c
@@ -1,15 +1,55 @@
 #include "pthread_impl.h"
 
+
+// INT_MIN for the lock, +1 for the count of this thread in the
+// critical section, has it that the difference between locked and
+// unlocked is ±INT_MAX.
+#if (INT_MIN+1) != -INT_MAX
+#error we need -INT_MAX to be INT_MIN+1
+#endif
+
 void __lock(volatile int *l)
 {
-	if (libc.threads_minus_1)
-		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
+	/* This test is mostly useless, now. Leaving it to the first CAS is
+	   probably just as efficient. */
+	if (libc.threads_minus_1) {
+		int current = 0;
+		/* A first spin lock acquisition loop, for the case of
+		   no or medium congestion. */
+		for (unsigned i = 0; i < 10; ++i) {
+			/* This is purely speculative. */
+			if (current < 0) current += INT_MAX;
+			// assertion: current >= 0
+			int val = a_cas(l, current, current - INT_MAX);
+			if (val == current) return;
+			current = val;
+		}
+		// Spinning failed, so mark ourselves as being inside the CS.
+		current = a_fetch_add(l, 1) + 1;
+		/* The main lock acquisition loop for heavy congestion. The only
+		   change to the value performed inside that loop is a successful
+		   lock via the CAS that acquires the lock. */
+		for (;;) {
+			/* We can only go into wait, if we know that somebody holds the
+			   lock and will eventually wake us up, again. */
+			if (current < 0) {
+				__syscall(SYS_futex, l, FUTEX_WAIT|FUTEX_PRIVATE, current, 0) != -ENOSYS
+					|| __syscall(SYS_futex, l, FUTEX_WAIT, current, 0);
+				/* This is purely speculative. */
+				current += INT_MAX;
+			}
+			/* assertion: current > 0, because the count
+			   includes us already. */
+			int val = a_cas(l, current, INT_MIN + current);
+			if (val == current) return;
+			current = val;
+		}
+	}
 }
 
 void __unlock(volatile int *l)
 {
-	if (l[0]) {
-		a_store(l, 0);
-		if (l[1]) __wake(l, 1, 1);
+	if (a_fetch_add(l, INT_MAX) != -INT_MAX) {
+		__syscall(SYS_futex, l, FUTEX_WAKE|FUTEX_PRIVATE, 1);
 	}
 }
diff --git a/src/thread/pthread_detach.c b/src/thread/pthread_detach.c
index ed77f74d..818626ed 100644
--- a/src/thread/pthread_detach.c
+++ b/src/thread/pthread_detach.c
@@ -6,7 +6,7 @@ int __pthread_join(pthread_t, void **);
 static int __pthread_detach(pthread_t t)
 {
 	/* Cannot detach a thread that's already exiting */
-	if (a_swap(t->exitlock, 1))
+	if (a_cas(t->exitlock, 0, -INT_MAX))
 		return __pthread_join(t, 0);
 	t->detached = 2;
 	__unlock(t->exitlock);
-- 
2.11.0
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.