musl - Re: sem_post() can miss waiters

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87064ce2eef96ffbced93da7812d5358@ispras.ru>
Date: Tue, 22 Nov 2022 08:41:44 +0300
From: Alexey Izbyshev <izbyshev@...ras.ru>
To: musl@...ts.openwall.com
Subject: Re: sem_post() can miss waiters

On 2022-11-22 03:09, Rich Felker wrote:
> On Mon, Nov 21, 2022 at 08:14:50AM +0300, Alexey Izbyshev wrote:
>> Hi,
>> 
>> While reading musl's POSIX semaphore implementation I realized that
>> its accounting of waiters is incorrect (after commit
>> 88c4e720317845a8e01aee03f142ba82674cd23d) due to combination of the
>> following factors:
>> 
>> * sem_post relies on a potentially stale waiters count when deciding
>> whether to wake a waiter.
>> * Presence of waiters is not reflected in the semaphore value when
>> the semaphore is unlocked.
>> 
>> Here is an example scenario with 4 threads:
>> 
>> * initial state
>>   - val = 1, waiters = 0
>> * T1
>>   - enters sem_post
>>   - reads val = 1, waiters = 0, is preempted
>>   - val = 1, waiters = 0
>> * T2
>>   - sem_wait
>>   - another sem_wait (blocks)
>>   - val = -1, waiters = 1
>> * T3
>>   - sem_wait (blocks)
>>   - val = -1, waiters = 2
>> * T4
>>   - sem_post (wakes T2)
>>   - val = 1, waiters = 2
>> * T1
>>   - continues in sem_post
>>   - a_cas(1, 2) succeeds
>>   - T3 is NOT woken because sem_post thinks there are no waiters
>>   - val = 2, waiters = 1
>> * T2
>>   - wakes and completes sem_wait
>>   - val = 1, waiters = 1
>> 
>> So in the end T3 remains blocked despite that the semaphore is 
>> unlocked.
>> 
>> Here is two ideas how this can be fixed without reintroducing the
>> problem with accessing a potentially destroyed semaphore after it's
>> unlocked.
>> 
>> * Idea 1
>>   - use a proper WAITERS_BIT in the semaphore state instead of just
>> a single value and preserve it in sem_post by default
>>   - in sem_post, when we notice that waiters count is zero but
>> WAITERS_BIT is set, reset WAITERS_BIT, but wake all potential
>> waiters afterwards.
>> 
>> This ensures that any new waiters that could arrive after we read
>> waiters count are taken into account (and they will proceed to lock
>> the semaphore and potentially set WAITERS_BIT again after they are
>> woken).
>> 
>> The sem_post body would look like this (the full patch is attached):
>> 
>> int val, new, priv = sem->__val[2], sync = 0;
>> do {
>>     val = sem->__val[0];
>>     waiters = sem->__val[1];
>>     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>>         errno = EOVERFLOW;
>>         return -1;
>>     }
>>     new = val + 1;
>>     if (!waiters)
>>         new &= ~0x80000000;
>> } while (a_cas(sem->__val, val, new) != val);
>> if (val<0) __wake(sem->__val, waiters ? 1 : -1, priv);
>> 
>> The main downside of this approach is that a FUTEX_WAKE system call
>> will be performed each time the semaphore transitions from "maybe
>> have waiters" to "no waiters" state, and in practice it will be
>> useless in most cases.
>> 
>> * Idea 2
>>   - use a proper WAITERS_BIT as above
>>   - also add a new DETECT_NEW_WAITERS_BIT
>>   - in sem_post, when we notice that waiters count is zero,
>> WAITERS_BIT is set and DETECT_NEW_WAITERS_BIT is not set, atomically
>> transition into "waiters detection" state (DETECT_NEW_WAITERS_BIT is
>> set, WAITERS_BIT is unset) *without unlocking the semaphore*. Then
>> recheck waiters count again, and if it's still zero, try to
>> atomically transition to "no waiters" state (both bits are unset)
>> while also unlocking the semaphore. If this succeeds, we know that
>> no new waiters arrived before we unlocked the semaphore and hence no
>> wake up is needed. Regardless of whether we succeeded, we always
>> leave "waiters detection" state before exiting the CAS loop.
>>   - in sem_post, wake a single waiter if at least one of WAITERS_BIT
>> or DETECT_NEW_WAITERS_BIT is set. This ensures no waiters are missed
>> in sem_post calls racing with the sem_post currently responsible for
>> waiters detection.
>> 
>> The sem_post body would look like this (the full patch is attached):
>> 
>> int val, new, priv = sem->__val[2], sync = 0;
>> for (;;) {
>>     int cnt;
>>     val = sem->__val[0];
>>     cnt = val & SEM_VALUE_MAX;
>>     if (cnt < SEM_VALUE_MAX) {
>>         if (!sync && val < 0 && !(val & 0x40000000) && !sem->__val[1]) 
>> {
>>             new = cnt | 0x40000000;
>>             if (a_cas(sem->__val, val, new) != val)
>>                 continue;
>>             val = new;
>>             sync = 1;
>>         }
>>         new = val + 1;
>>     } else {
>>         new = val;
>>     }
>>     if (sync) {
>>         if (sem->__val[1])
>>             new |= 0x80000000;
>>         new &= ~0x40000000;
>>     }
>>     if (a_cas(sem->__val, val, new) == val)
>>         break;
>> }
>> if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>>     errno = EOVERFLOW;
>>     return -1;
>> }
>> if (new < 0 || (new & 0x40000000)) __wake(sem->__val, 1, priv);
>> 
>> Compared to the first approach, there is just an extra a_cas instead
>> of a system call, though we lose one bit in SEM_VALUE_MAX.
>> 
>> Hope this helps,
> 
> Thanks for the detailed analysis and work on fixes. I'm not sure how
> this was overlooked at the time of design.
> 
> If I understand correctly, the cost of the first option is generally
> an extra "last" broadcast wake that shouldn't be needed. Is that
> right?
> 
> For example, if sem starts with count 0 and thread A calls wait, then
> thread B calls post twice, both posts make a syscall even though only
> the first one should have.
> 
Yes, exactly.

> What if you instead perform the broadcast wake when the observed
> waiters count is <=1 rather than ==0? This should have no cost in the
> common case, but in the race case, I think it forces any hiding
> (just-arrived) extra waiters to wake, fail, and re-publish their
> waiting status to the waiters bit.
> 
Indeed, I think this solves the overhead issue quite nicely, thanks. So 
sem_post wake up logic would basically boil down to this:

* WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since it's 
likely that some waiters will remain (and it's always fine to err on the 
side of preserving the WAITERS_BIT); wake a single waiter.

* WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake all 
waiters. In non-racy cases only a single waiter will be woken.

* WAITERS_BIT is unset: don't wake anybody. Even if there are some 
waiters, another sem_post (that reset the WAITERS_BIT) is responsible 
for waking them.

In code:

int val, new, waiters, priv = sem->__val[2];
do {
     val = sem->__val[0];
     waiters = sem->__val[1];
     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
         errno = EOVERFLOW;
         return -1;
     }
     new = val + 1;
     if (waiters <= 1)
         new &= ~0x80000000;
} while (a_cas(sem->__val, val, new) != val);
if (val<0) __wake(sem->__val, waiters <= 1 ? -1 : 1, priv);

> Regarding your second solution, I think it's more elegant and
> efficient and would be preferable if we were doing this from scratch,
> but changing SEM_VALUE_MAX is arguably an ABI change we should not
> make.
> 
With your suggested improvement, the first solution is probably even 
more efficient in common cases, assuming there are no hidden costs in 
the kernel for waking a single waiter with "wake all" operation instead 
of "wake single".

> I think there's another really stupid solution too that I would even
> consider, partly motivated by the fact that, with long-lived
> process-shared semaphores, the waiters count can become stale if
> waiters are killed. (Note: semaphores aren't required to be robust
> against this, but it's ugly that they're not.) This solution is just
> to keep the current code, but drop the waiters count entirely, and use
> broadcast wakes for all wakes. Then, any still-live waiters are
> *always* responsible for re-asserting their waiting status when all
> but one fail to acquire the semaphore after a wake. Of course this is
> a thundering herd, which is arguably something we should not want, but
> we accept it for correctness in several other places like condvars...
> 
I'm not sure that the kind of "partial robustness" that could be 
achieved by always waking all waiters is worth punishing normal cases. 
Also, I don't think waiters count being stale is an issue per se because 
it can be wrong only in one way (greater than the real count of 
waiters), assuming you don't mean overflow (but overflow could be 
handled by simply pinning it at INT_MAX permanently). But many other 
issues are possible if we allow killing processes at arbitrary points, 
including sem_post not sending a wake up notification after updating the 
semaphore value, or sem_wait consuming a notification but not updating 
the value. Granted, some of these issues may "self-heal" on a future 
sem_wait/sem_post (in the sense that the semaphore will leave a 
forbidden state), but they might still interfere with the program 
shutdown logic (e.g. if the program expects to claim the semaphore N 
times at the end without being aware that some consumers are dead), and, 
of course, with any logic outside of semaphore manipulation. So it seems 
to me that either the program already watches for death of processes it 
cares about, or it's not even clear that allowing further progress (due 
to a broadcast) is always more desirable than blocking.

Alexey
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.