musl - Re: Possible PEBKAC or Bug in musl Semaphores and/or Mutexes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=dzRNjB2dRZA1u_a=guZ6MHNUrmcArOfi6F37o2HxEfUcD1A@mail.gmail.com>
Date: Fri, 11 Mar 2022 22:45:52 -0700
From: Gavin Howard <gavin.d.howard@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: Possible PEBKAC or Bug in musl Semaphores and/or Mutexes

> I don't think there's enough detail to debug your issue based on the
> high-level description you gave. The bug is almost surely in something
> we can't see. As long as all modification to and reading from the flag
> is done with the mutex held, you don't need to worry about data races
> on it.

While I can't show a lot of the code due to licensing issues, I will
show what I can.

Main Thread #1 (after joining all other threads:

```
s = y_strucon_mutex_lock(&procs_lock);
if (y_err(s != y_STATUS_SUCCESS)) return;
done = 1;
y_strucon_sem_post(&procs_sem, 1);
y_strucon_mutex_unlock(&procs_lock);

r = pthread_join(childthread->sysid, &data);
if (y_err(r != 0))
{
    y_panica("Deadlock in thread join");
}
```

When you see a `y_` prefix, that is my own code.

`y_strucon_mutex_*` functions are wrappers around the `pthread_mutex_*`
functions, and same with `y_strucon_sem_*` for semaphores. The variable
`s` is the standard return type of most functions; it's a status code.

Main Thread #2:

```
do
{
    int status;

    s = y_strucon_sem_wait(&procs_sem);
    if (y_err(s != y_STATUS_SUCCESS)) return s;

    s = y_strucon_mutex_lock(&procs_lock);
    if (y_err(s != y_STATUS_SUCCESS)) return s;

    if (y_unlikely(done != 0))
    {
        y_strucon_mutex_unlock(&procs_lock);
        pthread_exit((void*) (uintptr_t) s);
    }

    y_strucon_mutex_unlock(&procs_lock);

    do
    {
        r = waitpid(-1, &status, 0);
    }
    while (r < 0 && errno == EINTR);

    y_assert(r > 0, "Child process does not exist");

    y_procchildinfo info;

    info.pid = r;
    info.code = status;

    s = y_strucon_mutex_lock(&procs_lock);
    if (y_err(s != y_STATUS_SUCCESS)) return s;

    y_fd* fdptr;

    // This function returns a pointer to the fd in fdptr, if the
    // process has been registered before.
    if (y_map_exists_v(procs, &info.pid, &fdptr))
    {
        yc_assert(fdptr != NULL, YC_ASSERT_NULL);

        s = y_io_bareio_write(*fdptr, (y_uchar*) &info,
                              sizeof(y_procchildinfo));
        if (y_err(s != y_STATUS_SUCCESS))
        {
            y_panicva("Could not generate child status notification "
                      "using fd %d", *fdptr);
        }

        s = y_map_remove(procs, &info.pid);
        if (y_err(s != y_STATUS_SUCCESS)) return s;
    }
    else
    {
        s = y_vec_push(&unregprocs, &info);
        if (y_err(s != y_STATUS_SUCCESS)) return s;
    }

    y_strucon_mutex_unlock(&procs_lock);
}
while (s == y_STATUS_SUCCESS);
```

What a worker thread does to start and register a process:

```
pid_t chpid = fork();

if (chpid < 0) return y_STATUS_CHILD_ERR;

if (chpid == 0)
{
    // Setup and exec() child. Abort if exec() fails.
    abort();
}

s = y_strucon_mutex_lock(&procs_lock);
if (y_err(s != y_STATUS_SUCCESS)) return s;

size_t i;
size_t len = y_vec_len(&unregprocs);

for (i = 0; i < len; ++i)
{
    y_procchildinfo* ptr = (y_procchildinfo*) y_vec_get(&unregprocs, i);

    if (ptr->pid == chpid)
    {
        s = y_io_bareio_write(fd, (y_uchar*) ptr, sizeof(y_procchildinfo));
        if (y_err(s != y_STATUS_SUCCESS))
        {
            y_panica("Could not generate child status notification");
        }

        s = y_vec_popAt(&unregprocs, i);

        y_strucon_sem_post(&procs_sem, 1);

        y_strucon_mutex_unlock(&procs_lock);

        return s;
    }
}

s = y_map_insert(procs, &chpid, &fd);

y_strucon_sem_post(&procs_sem, 1);

y_strucon_mutex_unlock(&procs_lock);
```

As you can see, except for the presence of a vector (resizable array)
for the possibility of a waitpid() returning a child process before it
is "registered," the high-level description I gave wasn't very
high-level but matches pretty well.

Notice that all accesses of the map and vector are protected by the
mutex, so there's no data race on them. (I have another problem with
musl's rwlocks on another map, but I'll send another message to the
mailing list about that.)

> You should probably look at the strace of your program to see what's
> going on with the ECHILD. It's not clear whether you're calling
> waidpid with an argument of -1 or a specific pid, but in the latter
> case you could look for whether the child was already reaped and
> whether the pid was ever valid (from the strace log). Also, be aware
> that ignoring SIGCHLD may break waiting for it, in case you're doing
> that.

As shown above, I'm always calling it with waitpid(-1, ...), mostly
because I don't know what order child processes will finish in. And this
thread is the only one calling waitpid(), so it couldn't have been
reaped by another thread by accident.

SIGCHLD is not blocked. In fact, a signal handler is explicitly
registered for it is Main Thread #2, which, if it gets a SIGCHLD, does
nothing. I do this because I want to be sure that the problem you
mention does not happen, but also because after attempting to make
SIGCHLD work for this same purpose, I found out that Linux will drop
SIGCHLD's without a second thought. I know this because the code using
them would deadlock nearly every time.

> As an aside, overall, the whole setup looks like you were looking for
> condition variables but skipped over them for some reason.

I skipped over condition variables for a *very* important reason: they
would cause deadlock.

When a condition variable is signalled, it will do nothing if there are
no threads waiting on it. At least, that's what POSIX condition
variables are guaranteed to do. ("The pthread_cond_broadcast() and
pthread_cond_signal() functions shall have no effect if there are no
threads currently blocked on cond.")

What this means is that if Main Thread #2 is blocked on waitpid(), then
if another thread creates a child and signals the condition variable,
then after Main Thread #2 returns from waitpid(), it will block on the
condition variable. If another thread creates another child, sure, it
will unblock, but eventually, when the build is near done, there will be
no more child processes created, and Main Thread #2 will permanently
block on the condition variable. In turn, this will cause a deadlock
because the worker threads all poll() on the read end of the pipe (the
write end was put into the map for Main Thread #2), so if Main Thread #2
blocks on the condition variable, they will all block waiting for their
child(ren) to be reaped, and the build system will deadlock.

Likewise, having Main Thread #2 hold a mutex except when on a condition
variable means that worker threads won't be able to register children
except when Main Thread #2 is blocked on the condition variable. If Main
Thread #2 holds a lock on a mutex while in waitpid(), this means that
the build system can only run one command at a time. If you release the
lock just before calling waitpid(), allowing worker threads to make
progress, you still have the problem I mentioned above should the
condition variable be signalled during that call.

A semaphore is the right choice because it can be signalled while Main
Thread #2 is not blocked on it, and Main Thread #2 will get that signal
when it tries to wait on it. And it can be signalled many times, which
will tell Main Thread #2 how many children it needs to reap.

There is one thing I can think of that might cause the problem: the
processor is reordering the syscall to be before the if (done != 0)
check.

This doesn't make sense to me because 1) the arch I'm on is the mostly
TSO x86_64, and 2) while a mutex unlock may allow some memory
operations to be reordered from after to before the lock, I don't know
that a syscall would be reordered. In fact, I don't know how syscalls
are treated in such cases; are they treated as memory operations at all?
I have no clue.

Thank you for your time.

Gavin Howard
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.