musl - Re: Resuming work on new semaphore

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150423160624.GF17573@brightrain.aerifal.cx>
Date: Thu, 23 Apr 2015 12:06:24 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Resuming work on new semaphore

I'm going to try to summarize some of the issues that have been
discussed on IRC since this.

On Sun, Apr 12, 2015 at 01:22:34AM +0300, Alexander Monakov wrote:
> On Mon, 6 Apr 2015, Alexander Monakov wrote:
> > One other thing to consider.  In the absence of concurrent operations on the
> > semaphore, return value of sem_getvalue should be equal to the number of times
> > sem_trywait will indicate success when called repeatedly.  So if the
> > implementation performs post-stealing in trywait, it should return the higher
> > bound as semaphore value.  Likewise for timedwait.
> 
> If we accept the above, it follows that in the new implementation getvalue
> should return not max(0, val[0] + val[1]), but rather max(0, val[0]) + val[1].

Indeed. But then max(0, val[0]) + val[1] can overflow SEM_VALUE_MAX
unless we prevent it, which takes some work, but I think it's
possible.

> int sem_post(sem_t *sem)
> {
> 	int val;
> 	do val = sem->__val[0];
> 	while (val != a_cas(sem->__val, val, val+!!(val<SEM_VALUE_MAX)));
> 	if (val < 0) {
> 		int priv = sem->__val[2];
> 		a_inc(sem->__val+1);
> 		__wake(sem->__val+1, 1, priv);
> 	}
> 	if (val < SEM_VALUE_MAX) return 0;
> 	errno = EOVERFLOW;
> 	return -1;
> }

The first observation we made was that this checks val<SEM_VALUE_MAX
twice in the success path for an extra useless branch. It can be fixed
by something like this (my1):

int sem_post(sem_t *sem)
{
	int val = sem->__val[0];
	val -= val==SEM_VALUE_MAX;
	while (a_cas(sem->__val, val, val+1) != val) {
		if ((val = sem->__val[0]) == SEM_VALUE_MAX) {
			errno = EOVERFLOW;
			return -1;
		}
	}
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

or this (my1b):

int sem_post(sem_t *sem)
{
	int old, val = sem->__val[0];
	val -= val==SEM_VALUE_MAX;
	while ((old = a_cas(sem->__val, val, val+1)) != val) {
		if ((val = old) == SEM_VALUE_MAX) {
			errno = EOVERFLOW;
			return -1;
		}
	}
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

The latter saves the result of a_cas to prevent an extra load, but I
don't think it makes any significant difference and it might be seen
as uglier.

However neither of those address the overflow issue, which I've tried
to address here:

#define VAL0_MAX ((SEM_VALUE_MAX+1)/2)
#define VAL1_MAX (SEM_VALUE_MAX/2)

int sem_post(sem_t *sem)
{
	int val = sem->__val[0];
	val -= val==VAL0_MAX;
	while (a_cas(sem->__val, val, val+1) != val) {
		if ((val = sem->__val[0]) == VAL0_MAX) {
			int tmp = sem->__val[1];
			if (tmp >= VAL1_MAX) {
				errno = EOVERFLOW;
				return -1;
			}
			if (a_cas(sem->__val+1, tmp, tmp+1) == tmp) {
				return 0;
			}
			val--;
		}
	}
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

This is code whose idea was discussed on IRC but not yet presented, so
it may have significant bugs. The idea is to limit the main sem value
component and the wake count separately to half the max. Once val[0]
hits VAL0_MAX, further posts will be in the form of wakes for
nonexistent waiters (which are ok but more costly). This allows the
total observed value to reach all the way up to SEM_VALUE_MAX.

If this happens, waiters will consume all of val[0] first, and the
wakes will all remain pending until val[0] reaches 0. At that point,
new waiters will decrement val[0] to a negative value (indicating a
waiter), attempt a futex wait, fail because there are wakes pending,
consume one of the wakes, and exit.

(Note: this useless futex wait can be optimized out by reordering the
do-while loop body in sem_timedwait.)

During this state, there is a race window where val[1] can exceed
VAL1_MAX -- if a post happens after a new waiter decrements val[0] but
before it consumes a wake from val[1], a concurrent post will
increment val[0] back to 0 and increment val[1] unconditionally.
However, the magnitude of such overshoot is bounded by the number of
tasks which is necessarily bounded by INT_MAX/4 which is less than
VAL1_MAX, so no integer overflow can happen here (except in the case
of async-killed waiters).

Does this all sound correct?

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.