musl - Re: My current understanding of cond var access restrictions

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140814182314.GI12888@brightrain.aerifal.cx>
Date: Thu, 14 Aug 2014 14:23:14 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: My current understanding of cond var access restrictions

On Thu, Aug 14, 2014 at 08:12:46PM +0200, Jens Gustedt wrote:
> Am Donnerstag, den 14.08.2014, 12:58 -0400 schrieb Rich Felker:
> > On Thu, Aug 14, 2014 at 06:27:21PM +0200, Jens Gustedt wrote:
> > > Am Donnerstag, den 14.08.2014, 10:41 -0400 schrieb Rich Felker:
> > > > Thus I'm skeptical of trying an approach like this when it would be
> > > > easier, and likely less costly on the common usage cases, just to
> > > > remove requeue and always use broadcast wakes. I modified your test
> > > > case for the bug to use a process-shared cv (using broadcast wake),
> > > > and as expected, the test runs with no failure.
> > > 
> > > You shouldn't draw much conclusion from the fact that it works in that
> > > case. This still heavily interacts with the waiters count and thus a
> > > signaling thread that comes after such a broadcast might wake up a
> > > thread that it shouldn't.
> > > 
> > > (But I didn't do a full analysis of that situation.)
> > 
> > In the process-shared case, broadcast just increments the sequence
> > number and wakes all futex waiters. It's very simple.
> > 
> > Formally, there's no such thing as waking up a thread you shouldn't,
> > since spurious wakes are always allowed.
> 
> I meant an internal wake, not one that returns to user space, and that
> might wake up a thread that came into waiting after the corresponding
> signaling thread entered his call. But sequence numbers here probably
> are sufficient to ensure that at that point this is already a thread
> that has the right to wakeup.
> 
> I am just getting sort of paranoid on that stuff :)

Indeed, wrongly waking a waiter that arrives _after_ the signal is a
potential serious bug; glibc actually has this bug and it's remained
unfixed for years as far as I know.

However, without holding the mutex while signaling, or perhaps other
equivalent synchronization, threre is no "happens-before" relationship
between new waiters and signal, so it can't matter. And if the mutex
is held, or other synchronization happens, a new waiter cannot arrive
during the signal operation.

I'm pretty confident musl is fully correct here, since I've both
analyzed the issue before and run the glibc test case that reproduces
the error against musl, with no failure on musl.

> > The current implementation
> > has a lot of potential for spurious wakes but they don't happen except
> > in certain situations:
> > 
> > - If a futex wait gets interrupted by a signal, the wait will always
> >   terminate after the signal handler returns if any intervening
> >   signals or broadcasts happened (except in the case of a full
> >   wraparound of the sequence number, i.e. exactly 2<<32 cv signals
> >   while stuck in a signal handler, which I don't know how to fix, but
> >   it would be easy to write a test for this) even if the signal was
> >   already received by another waiter.
> > 
> > - If the sequence number gets incremented by a signal before the
> >   initial futex wait, the waiter will return immediately; this can
> >   happen to multiple waiters even for just one signal.
> > 
> > Really sequence numbers are the wrong tool here, but they were
> > introduced because the previous approach (having each waiter write its
> > own tid, and futex wait comparing that tid) lead to pathologically bad
> > performance under heavy waiter arrival where waiters were constantly
> > returning because another waiter was almost always able to write its
> > tid before the first one could block on the futex. I'd like to have a
> > better solution, but I can't think of any.
> 
> I don't think they are too bad, actually. They help to distinguish two
> phases for a waiting thread. In the first, he has released the mutex
> and no signal or broadcast has been issued. A thread should never
> attempt to relock the mutex and/or return to user space during that
> phase.
> 
> And then the second phase after such a signal or broadcast, where any
> wakeup could be legitimate and in the worst case just be spurious.

Yes. Really the only reason I dislike sequence numbers is the
theoretical possibility of wrapping after 2<<32 signals. This would
require extreme scheduling delays to realize without a signal handler
intentionally preventing the waiter from making forward progress, so
it's unlikely to impact anything, but it still seems wrong.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.