Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 12 Mar 2022 10:01:32 -0500
From: Rich Felker <dalias@...c.org>
To: Gavin Howard <gavin.d.howard@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: Possible PEBKAC or Bug in musl Semaphores and/or Mutexes

On Fri, Mar 11, 2022 at 10:45:52PM -0700, Gavin Howard wrote:
> > I don't think there's enough detail to debug your issue based on the
> > high-level description you gave. The bug is almost surely in something
> > we can't see. As long as all modification to and reading from the flag
> > is done with the mutex held, you don't need to worry about data races
> > on it.
> 
> While I can't show a lot of the code due to licensing issues, I will
> show what I can.
> 
> Main Thread #1 (after joining all other threads:
> 
> ```
> s = y_strucon_mutex_lock(&procs_lock);
> if (y_err(s != y_STATUS_SUCCESS)) return;
> done = 1;
> y_strucon_sem_post(&procs_sem, 1);
> y_strucon_mutex_unlock(&procs_lock);
> 
> r = pthread_join(childthread->sysid, &data);
> if (y_err(r != 0))
> {
>     y_panica("Deadlock in thread join");
> }
> ```
> 
> When you see a `y_` prefix, that is my own code.
> 
> `y_strucon_mutex_*` functions are wrappers around the `pthread_mutex_*`
> functions, and same with `y_strucon_sem_*` for semaphores. The variable
> `s` is the standard return type of most functions; it's a status code.
> 
> Main Thread #2:
> 
> ```
> do
> {
>     int status;
> 
>     s = y_strucon_sem_wait(&procs_sem);
>     if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
>     s = y_strucon_mutex_lock(&procs_lock);
>     if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
>     if (y_unlikely(done != 0))
>     {
>         y_strucon_mutex_unlock(&procs_lock);
>         pthread_exit((void*) (uintptr_t) s);
>     }
> 
>     y_strucon_mutex_unlock(&procs_lock);
> 
>     do
>     {
>         r = waitpid(-1, &status, 0);
>     }
>     while (r < 0 && errno == EINTR);
> 
>     y_assert(r > 0, "Child process does not exist");
> 
>     y_procchildinfo info;
> 
>     info.pid = r;
>     info.code = status;
> 
>     s = y_strucon_mutex_lock(&procs_lock);
>     if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
>     y_fd* fdptr;
> 
>     // This function returns a pointer to the fd in fdptr, if the
>     // process has been registered before.
>     if (y_map_exists_v(procs, &info.pid, &fdptr))
>     {
>         yc_assert(fdptr != NULL, YC_ASSERT_NULL);
> 
>         s = y_io_bareio_write(*fdptr, (y_uchar*) &info,
>                               sizeof(y_procchildinfo));
>         if (y_err(s != y_STATUS_SUCCESS))
>         {
>             y_panicva("Could not generate child status notification "
>                       "using fd %d", *fdptr);
>         }
> 
>         s = y_map_remove(procs, &info.pid);
>         if (y_err(s != y_STATUS_SUCCESS)) return s;
>     }
>     else
>     {
>         s = y_vec_push(&unregprocs, &info);
>         if (y_err(s != y_STATUS_SUCCESS)) return s;
>     }
> 
>     y_strucon_mutex_unlock(&procs_lock);
> }
> while (s == y_STATUS_SUCCESS);
> ```
> 
> What a worker thread does to start and register a process:
> 
> ```
> pid_t chpid = fork();
> 
> if (chpid < 0) return y_STATUS_CHILD_ERR;
> 
> if (chpid == 0)
> {
>     // Setup and exec() child. Abort if exec() fails.
>     abort();
> }
> 
> s = y_strucon_mutex_lock(&procs_lock);
> if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
> size_t i;
> size_t len = y_vec_len(&unregprocs);
> 
> for (i = 0; i < len; ++i)
> {
>     y_procchildinfo* ptr = (y_procchildinfo*) y_vec_get(&unregprocs, i);
> 
>     if (ptr->pid == chpid)
>     {
>         s = y_io_bareio_write(fd, (y_uchar*) ptr, sizeof(y_procchildinfo));
>         if (y_err(s != y_STATUS_SUCCESS))
>         {
>             y_panica("Could not generate child status notification");
>         }
> 
>         s = y_vec_popAt(&unregprocs, i);
> 
>         y_strucon_sem_post(&procs_sem, 1);
> 
>         y_strucon_mutex_unlock(&procs_lock);
> 
>         return s;
>     }
> }
> 
> s = y_map_insert(procs, &chpid, &fd);
> 
> y_strucon_sem_post(&procs_sem, 1);
> 
> y_strucon_mutex_unlock(&procs_lock);
> ```
> 
> As you can see, except for the presence of a vector (resizable array)
> for the possibility of a waitpid() returning a child process before it
> is "registered," the high-level description I gave wasn't very
> high-level but matches pretty well.
> 
> Notice that all accesses of the map and vector are protected by the
> mutex, so there's no data race on them. (I have another problem with
> musl's rwlocks on another map, but I'll send another message to the
> mailing list about that.)
> 
> > You should probably look at the strace of your program to see what's
> > going on with the ECHILD. It's not clear whether you're calling
> > waidpid with an argument of -1 or a specific pid, but in the latter
> > case you could look for whether the child was already reaped and
> > whether the pid was ever valid (from the strace log). Also, be aware
> > that ignoring SIGCHLD may break waiting for it, in case you're doing
> > that.
> 
> As shown above, I'm always calling it with waitpid(-1, ...), mostly
> because I don't know what order child processes will finish in. And this
> thread is the only one calling waitpid(), so it couldn't have been
> reaped by another thread by accident.
> 
> SIGCHLD is not blocked. In fact, a signal handler is explicitly
> registered for it is Main Thread #2, which, if it gets a SIGCHLD, does
> nothing. I do this because I want to be sure that the problem you
> mention does not happen, but also because after attempting to make
> SIGCHLD work for this same purpose, I found out that Linux will drop
> SIGCHLD's without a second thought. I know this because the code using
> them would deadlock nearly every time.
> 
> > As an aside, overall, the whole setup looks like you were looking for
> > condition variables but skipped over them for some reason.
> 
> I skipped over condition variables for a *very* important reason: they
> would cause deadlock.
> 
> When a condition variable is signalled, it will do nothing if there are
> no threads waiting on it. At least, that's what POSIX condition
> variables are guaranteed to do. ("The pthread_cond_broadcast() and
> pthread_cond_signal() functions shall have no effect if there are no
> threads currently blocked on cond.")

This is why condition variables necessarily have an associated
predicate (in your case, involving your flag and possibly other
state). You can *never* just do pthread_cond_wait. The only correct
usage is:

	while (!predicate(...)) pthread_cond_wait(mtx,cnd);

Correct use of condition variables ensures that you see any relevant
changes to the state the predicate is evaluated on, without requiring
you to explicitly work out a way to do the same thing with semaphores
or other lower-level primitives. It does not matter whatsoever whether
there are already waiters when you call signal. If not, they'll
necessarily see the state change *before* they wait. The state change
cannot happen between evaluation of the predicate and the wait because
the calling thread holds the mutex.

> What this means is that if Main Thread #2 is blocked on waitpid(), then
> if another thread creates a child and signals the condition variable,
> then after Main Thread #2 returns from waitpid(), it will block on the
                                                    ^^^^^^^^^^^^^^^^^^^^
> condition variable. If another thread creates another child, sure, it
  ^^^^^^^^^^^^^^^^^^

No it won't, because it evaluates the predicate that tells it there's
something to do before calling wait. If you're not doing that, you're
using cond vars in a fundamentally wrong way.

> [...]
> There is one thing I can think of that might cause the problem: the
> processor is reordering the syscall to be before the if (done != 0)
> check.

No, this is absolutely not what's happening. Neither the processor nor
the compiler (in general the latter is at least as much of a concern
if not more when you have missing/incorrect synchronization) can
reorder things like that.

> This doesn't make sense to me because 1) the arch I'm on is the mostly
> TSO x86_64, and 2) while a mutex unlock may allow some memory
> operations to be reordered from after to before the lock, I don't know
> that a syscall would be reordered. In fact, I don't know how syscalls
> are treated in such cases; are they treated as memory operations at all?
> I have no clue.

On some level they're like any other external function call to a
function whose definition the layer optimizing the caller can't see:
they must be assumed to be able to store to or load from any memory
whose address has been exposed, and thus cannot be reordered
whatsoever.

Your problem is not mysterious reordering. Your problem is in your
program's logic somewhere. Please use the debugging tools at your
disposal, especially strace which will probably reveal the problem to
you right away (by letting you see the exact sequence of events that
happened and trace through it to figure out where it mismatches your
expectation).

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.