musl - Re: sem_post() can miss waiters

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <583865e860854d059f50d23a9acd7a8f@ispras.ru>
Date: Wed, 14 Dec 2022 13:29:54 +0300
From: Alexey Izbyshev <izbyshev@...ras.ru>
To: musl@...ts.openwall.com
Subject: Re: sem_post() can miss waiters

On 2022-12-14 04:48, Rich Felker wrote:
> On Tue, Nov 22, 2022 at 08:41:44AM +0300, Alexey Izbyshev wrote:
>> On 2022-11-22 03:09, Rich Felker wrote:
>> >If I understand correctly, the cost of the first option is generally
>> >an extra "last" broadcast wake that shouldn't be needed. Is that
>> >right?
>> >
>> >For example, if sem starts with count 0 and thread A calls wait, then
>> >thread B calls post twice, both posts make a syscall even though only
>> >the first one should have.
>> >
>> Yes, exactly.
>> 
>> >What if you instead perform the broadcast wake when the observed
>> >waiters count is <=1 rather than ==0? This should have no cost in the
>> >common case, but in the race case, I think it forces any hiding
>> >(just-arrived) extra waiters to wake, fail, and re-publish their
>> >waiting status to the waiters bit.
>> >
>> Indeed, I think this solves the overhead issue quite nicely, thanks.
>> So sem_post wake up logic would basically boil down to this:
>> 
>> * WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since
>> it's likely that some waiters will remain (and it's always fine to
>> err on the side of preserving the WAITERS_BIT); wake a single
>> waiter.
>> 
>> * WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake
>> all waiters. In non-racy cases only a single waiter will be woken.
>> 
>> * WAITERS_BIT is unset: don't wake anybody. Even if there are some
>> waiters, another sem_post (that reset the WAITERS_BIT) is
>> responsible for waking them.
>> 
>> In code:
>> 
>> int val, new, waiters, priv = sem->__val[2];
>> do {
>>     val = sem->__val[0];
>>     waiters = sem->__val[1];
>>     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>>         errno = EOVERFLOW;
>>         return -1;
>>     }
>>     new = val + 1;
>>     if (waiters <= 1)
>>         new &= ~0x80000000;
>> } while (a_cas(sem->__val, val, new) != val);
>> if (val<0) __wake(sem->__val, waiters <= 1 ? -1 : 1, priv);
> 
> Yes, this looks good to me. How is the attached patch?
> 
The patch looks good to me.

>> >I think there's another really stupid solution too that I would even
>> >consider, partly motivated by the fact that, with long-lived
>> >process-shared semaphores, the waiters count can become stale if
>> >waiters are killed. (Note: semaphores aren't required to be robust
>> >against this, but it's ugly that they're not.) This solution is just
>> >to keep the current code, but drop the waiters count entirely, and use
>> >broadcast wakes for all wakes. Then, any still-live waiters are
>> >*always* responsible for re-asserting their waiting status when all
>> >but one fail to acquire the semaphore after a wake. Of course this is
>> >a thundering herd, which is arguably something we should not want, but
>> >we accept it for correctness in several other places like condvars...
>> >
>> I'm not sure that the kind of "partial robustness" that could be
>> achieved by always waking all waiters is worth punishing normal
>> cases. Also, I don't think waiters count being stale is an issue per
>> se because it can be wrong only in one way (greater than the real
>> count of waiters), assuming you don't mean overflow (but overflow
>> could be handled by simply pinning it at INT_MAX permanently). But
>> many other issues are possible if we allow killing processes at
>> arbitrary points, including sem_post not sending a wake up
>> notification after updating the semaphore value, or sem_wait
>> consuming a notification but not updating the value. Granted, some
>> of these issues may "self-heal" on a future sem_wait/sem_post (in
>> the sense that the semaphore will leave a forbidden state), but they
>> might still interfere with the program shutdown logic (e.g. if the
>> program expects to claim the semaphore N times at the end without
>> being aware that some consumers are dead), and, of course, with any
>> logic outside of semaphore manipulation. So it seems to me that
>> either the program already watches for death of processes it cares
>> about, or it's not even clear that allowing further progress (due to
>> a broadcast) is always more desirable than blocking.
> 
> Indeed, this seems like it's just a bad direction to go for the
> reasons you described. One thing I'd like to note here though, but not
> act on at this time:
> 
> When we were designing the semaphore long ago, there was a vague
> proposal to make the post operation responsible for removing waiter
> counts, rather than the waiter. It was thrown out because it didn't
> seem to work. But if we ever did have a reason to want to do this, I
> think it's possible now, since it's essentially just a "countdown to
> broadcast wake" with no constraint that it be an accurate count, and
> the new waiters bit is what really controls whether wakes happen.
> 
Indeed, now the waiters count is just a hint used for resetting the 
waiters bit.
However, its increments/decrements must still be balanced for it to 
remain useful, and I don't see an easy way to achieve that if sem_post 
is responsible for decrementing it.

One further note: after this bug report I'd been pointed to a semaphore 
redesign proposal from 2015 (starts in 
https://www.openwall.com/lists/musl/2015/02/27/1 after some off-list 
discussion, ends in April). In that design, the waiters count is stored 
in the semaphore word, and sem_post is indeed responsible for adjusting 
it. However, another counter ("wake count") is still needed to make the 
semaphore work.

Thanks,
Alexey
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.