musl - Re: ABA problem in aio

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904194219.GQ1827@brightrain.aerifal.cx>
Date: Thu, 4 Sep 2025 15:42:20 -0400
From: Rich Felker <dalias@...c.org>
To: Markus Wichmann <nullplan@....net>
Cc: musl@...ts.openwall.com
Subject: Re: ABA problem in aio_suspend()

On Wed, Sep 03, 2025 at 08:27:33PM +0200, Markus Wichmann wrote:
> Hi all,
> 
> I think the current aio_suspend() implementation is susceptible to an
> ABA problem in __aio_fut, causing missed completions. Currently,
> __aio_fut is set to the ID of the first thread calling aio_suspend() and
> finding more than one aiocb to wait for. Of course, that means the
> *second* thread to do so doesn't change __aio_fut at all, and has no
> confidence __aio_fut hasn't been changed in between checking it and
> waiting for it.
> 
> I'm thinking of a scenario like this: Two threads are calling
> aio_suspend() on disjoint sets of aiocbs. Both have more than one
> operation to wait for. Thread 1 is successfully waiting in
> __timedwait_cp(), so it has already set __aio_fut to 1.
> 
> Now thread 2 also joins the fray. It sees __aio_fut set to 1, so does
> not change it. When thread 2 is just about to call __timedwait_cp(), it
> gets suspended. Then one of thread 2's operations finishes. The AIO
> worker sees the nonzero __aio_fut, sets it to zero, and causes a
> broadcast wake on it. However, this only wakes thread 1, which quickly
> sees that none of its operations finished, and so thread 1 sets
> __aio_fut back to 1 and continues to wait.
> 
> Now thread 2 resumes. __timedwait_cp() sees __aio_fut set to 1, as was
> expected, and so thread 2 goes to wait until timeout. Even though one of
> its operations is already finished.
> 
> This might be remedied by turning __aio_fut into a counter with waiters
> bit. All workers completing increment __aio_fut and clear the waiters
> bit and cause a broadcast wake if the waiters bit was set. aio_suspend()
> reads the current counter *before* checking the operations, then sets
> the waiters bit (immediately rechecking the operations if that failed).
> This is in theory still susceptible to an ABA problem, but now a thread
> would have to sleep through 2 billion updates of __aio_fut to be
> affected. I think that is an acceptable risk.

On a quick read your report sounds accurate. I'm not sure how soon
we'll have a fix ready though; I don't want to make this a
release-blocker.

We generally don't do generation counters like this that are clearly
logically incorrect and just rely on "it's going to wake before
something that could plausibly happen happens". I think the "obvious"
way to make it correct is not to allow keeping an existing waiter's
tid in the futex, but always putting one's own tid there. However this
can create high contention on the futex, perhaps unboundedly so with a
lot of threads/cores. That's probably why I tried to just use an
existing waiter tid.

The obvious way you'd *want* to make it correct is with a condvar, but
this can't be done because aio_suspend is required to be AS-safe, and
thus cannot use locks, at least not without a lot of care. And it
can't do anything like having the waiter join a userspace wait queue,
since the AS-safety of aio_suspend means it's legal to longjmp out of
it.

Ultimately this is another place where we suffer because Linux's futex
interface does not have a good condvar-like primitive, but makes us do
the whole think inefficiently in userspace. I suspect there may be no
"good" solution short of wrapping signal handling in libc (a topic
that's been discussed before because it solves a number of related
problems, but that's rather heavy-handed).

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.