musl - Re: PATCH: don't call cleanup handlers after a regular return from the thread start function

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140806093529.GT1674@brightrain.aerifal.cx>
Date: Wed, 6 Aug 2014 05:35:29 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: PATCH: don't call cleanup handlers after a regular return
 from the thread start function

On Wed, Aug 06, 2014 at 09:15:26AM +0200, Jens Gustedt wrote:
> > > > - In the case of longjmp, it's already dangerous to longjmp across
> > > >   arbitrary code. Unless you know what code you're jumping over is
> > > >   doing or have a contract that it's valid to jump over it, you can't
> > > >   use longjmp.
> > > 
> > > sure, but just consider the nice tool-library that implements
> > > try-catch sort of things with setjmp/longjmp.
> > 
> > It's not possible to mix this with cancellation (or with C++
> > exceptions, for that matter). But as long as neither "unwinds" across
> > the other, you're fine. This is a restriction you need to follow and
> > be aware of anyway. It's not something we can "work around" except by
> > using the same underlying implementation for all three, and that's
> > undesirable for various reasons (one of which is that people should
> > not be relying on it).
> > 
> > Even if you did want to support this, making it work is a lot harder
> > than allowing longjmp out of pthread_once/call_once. The latter are
> > special cases where there's at most one cancellation cleanup layer and
> > it's implementation-internal. The general case has arbitrary
> > cancellation cleanup buffers at different call frame contexts, and
> > they're most certainly not static.
> 
> Perhaps I wouldn't like to support this, but make the library more
> robust against such issues as said above. Give diagnostics ("cleanup
> handler stack not empty after thread termination") and ease debugging.
> 
> If, as we seem both to intend, all this ends up to be clearly UB,
> nothing prevents us from being nice if this happens. What do you think
> of my initial patch (which is not very intrusive), but instead of
> clearing `cancelbuf` to call abort?

Your initial patch only covered the case where the illegal return is
from the start function itself, which I think is probably the least
likely to happen in practice. It doesn't cover the case where it
happens from other functions. If this is deemed useful to catch and
translate into a predictable crash, I would prefer a check for the
cancellation buffer pointer against the stack pointer, crashing (via
a_crash() like other code that does light checks for UB such as the
code in malloc) if the cancellation buffer is below the stack pointer.
However some consideration is needed (like: do we want to preclude
cancellation from crossing signal handler boundaries when it's done
"safely", e.g. by pushing a cleanup handler before unblocking signals,
when the signal runs on an alternate signal stack?) so I'm a bit
hesitant to add such checks that might not always be valid. Of course
split-stack would also invalidate this strategy, but I think
split-stack is already hopelessly broken anyway (there's a good past
mailing list thread on the topic of split-stack and why it's a
solution in search of a problem :).

BTW one "safety" we currently have is that pthread_cleanup_pop does
not just "pop one cleanup context"; it resets the context to the one
that was in effect at the time the corresponding push was called,
potentially popping multiple contexts if one or more pop was skipped
via illegal returns or similar. In principle a check-and-crash could
be added asserting that self->cancelbuf == cb before doing this, but
I'm mildly hesitant to add conditionals to something that should be a
fast-path.

> > > (There is even less interaction between calls to call_once with
> > > different objects, than we have for the current implementation of
> > > pthread_once. That one uses a `static int waiters` that is shared
> > > between all calls.)
> > 
> > Yes, that's an unfortunate issue which I should fix. But making the
> > once_flag a big structure rather than a single int (which will surely
> > disagree with the glibc ABI) is not the solution. The solution is just
> > to have a potential-waiters flag in the single int, rather than a
> > waiter count. The incidence of contention is so low that it doesn't
> > matter if it results in a rare excess syscall, especially since once
> > calls are once-per-program-invocation.
> 
> Sure, as I said changing the ABI for pthread_once is out of the
> question.
> 
> Yes, I agree that this is a minor "problem", if at all.
> 
> You could even have a counter in there, the state only needs 2 bit,
> the rest could be used as a wait counter.
> 
> One idea would be to have that int as a counter, and to have special
> values of the counter reflect the state. 0 would be unitialized, -1
> completely initialized, all other would be the number of waiters +1.

Yes, that's probably a slightly better design. As long as our target
is just a basically-linux-compatible system, we have at least 2 bits
free anyway (tids are limited below 1<<30 since the upper bits are
used for special purposes in the robust futex API) but of course a
design that doesn't encode this knowledge is more general and cleaner.

> > > > call_once does not have specified behavior if the callback is
> > > > cancelled (because, as far as C11 is concerned, cancellation does not
> > > > even exist). Thus there is no need for a cleanup function at all.
> > > 
> > > After digging into it I don't agree. In a mixed pthread/C thread
> > > environment cancellation will eventually happen, so we need a cleanup
> > > function to avoid deadlocks.
> > 
> > No it won't. Cancellation is unusable with almost all existing code
> > (and almost all new code). To use cancellation, the target thread must
> > be running code which was written to be aware of the possibility of
> > being cancelled. Thus, it's not going to "just happen".
> 
> With "will happen" I mean that code will just do it, perhaps by some
> tool implemented with pthreads shooting at a C thread.

Yes. But my point is that crashing in libc code is probably the least
of your worries here. It's more likely that the crash (or at least
deadlock) occurs in application code from cancelling a thread while
it's in an inconsistent state, holds locks, etc. without a chance to
do cleanup (since the target thread was not designed for
cancellation).

One "safe" approach might be for call_once to simply block
cancellation for the duration of the call. In fact C11 threads could
even start with cancellation blocked, unless of course POSIX mandates
otherwise in the next issue. This should really all be a topic for
discussion with the Austin Group for how POSIX should be aligned with
C11 threads, though, and I'm hesitant to make any public behaviors for
musl that look like API guarantees (rather than "this is the way it
works for now, but don't depend on it, since it's going to change when
the next issue of POSIX is published") at this time.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.