Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 2 Jun 2023 10:16:55 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] [RFC] make clone() usable

On Fri, Jun 02, 2023 at 10:28:09AM +0300, Alexey Izbyshev wrote:
> On 2023-06-01 23:08, Rich Felker wrote:
> >On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> >>On 2023-05-31 02:35, Rich Felker wrote:
> >>>As discussed before (see the 2021 thread "Incorrect thread TID
> >>>caching") clone() has been effectively unusable because it produces a
> >>>child process in invalid state, with wrong tid in its thread
> >>>structure, among other problems.
> >>>
> >>>The attached proposed patch attempts to make clone() usable by having
> >>>it share the _Fork logic for establishing a consistent process state
> >>>after forking, and also blocks use of flags which produce invalid
> >>>state.
> >>>
> >>Tangentially, while thinking about this, I realized that because the
> >>kernel clears the address of the exit futex on clone and _Fork
> >>doesn't re-establish it, the following blocks forever:
> >>
> >>void *thr(void *arg) {
> >>        // Blocks in __tl_sync because __thread_list_lock is never
> >>unlocked
> >>        pthread_join(*(pthread_t *)arg, NULL));
> >>        return NULL;
> >>}
> >>
> >>int main() {
> >>        if (!fork()) {
> >>                static pthread_t pt;
> >>                pt = pthread_self();
> >>                pthread_create(&(pthread_t){0}, NULL, thr, &pt);
> >>                pthread_exit(NULL);
> >>        }
> >>        wait(NULL);
> >>}
> >>
> >>This could be fixed by the following change in _Fork.c:
> >>
> >>-               self->tid = __syscall(SYS_gettid);
> >>+               self->tid = __syscall(SYS_set_tid_address,
> >>&__thread_list_lock);
> >
> >Wow, I had no idea Linux's fork cleared the exit futex address. That's
> >a big bug we've had around for nobody to have noticed...
> >
> >>>With CLONE_VM, the old behavior is kept, with a caveat that the child
> >>>context is extremely restrictive, ala vfork.
> >>>
> >>>It was raised previously (pardon the pun) that raise() should perhaps
> >>>be modified to make a SYS_gettid syscall rather than using the thread
> >>>structure tid, so that it's possible to call raise() in a vfork
> >>>context without signaling the wrong process. It might make sense to do
> >>>that now too.
> >>>
> >>Yes, it would be nice to have raise and abort (which also reads the
> >>cached tid separately) working in a vfork child.
> >>
> >>>diff --git a/src/linux/clone.c b/src/linux/clone.c
> >>>index 8c1af7d3..e8f76bbc 100644
> >>>--- a/src/linux/clone.c
> >>>+++ b/src/linux/clone.c
> >>>@@ -4,18 +4,55 @@
> >>> #include <sched.h>
> >>> #include "pthread_impl.h"
> >>> #include "syscall.h"
> >>>+#include "lock.h"
> >>>+#include "fork_impl.h"
> >>>+
> >>>+struct clone_start_args {
> >>>+	int (*func)(void *);
> >>>+	void *arg;
> >>>+	sigset_t sigmask;
> >>>+};
> >>>+
> >>>+static int clone_start(void *arg)
> >>>+{
> >>>+	struct clone_start_args *csa = arg;
> >>>+	__post_Fork(0);
> >>>+	__restore_sigs(&csa->sigmask);
> >>>+	return csa->func(csa->arg);
> >>>+}
> >>>
> >>> int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
> >>> {
> >>>+	struct clone_start_args csa;
> >>> 	va_list ap;
> >>>-	pid_t *ptid, *ctid;
> >>>-	void  *tls;
> >>>+	pid_t *ptid = 0, *ctid = 0;
> >>>+	void  *tls = 0;
> >>>+
> >>>+	/* Flags that produce an invalid thread/TLS state are disallowed. */
> >>>+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
> >>>+	if (flags & badflags)
> >>>+		return __syscall_ret(-EINVAL);
> >>>
> >>Maybe also check for NULL stack, as proposed in
> >>https://www.openwall.com/lists/musl/2022/08/02/1 ?
> >
> >I guess that can be done. Normally we don't check for null pointers,
> >but I think the syscall would silently keep the existing stack pointer
> >if passed null rather than blowing up, which we probably don't want --
> >even without CLONE_VM, where it'd blow up, we'd be giving
> >appears-to-work behavior without a contract for it to work.
> >
> Yes, and this is not only about CLONE_VM. On some arches (e.g. i386,
> x86_64, aarch64) musl's __clone stores things on "stack" and hence
> segfaults if "stack" is NULL, but on other arches it doesn't (e.g.
> arm). Exposing this is inconsistent and confusing. If support for
> passing NULL "stack" is ever desired, __clone should be changed
> first.

OK. Crashing here would actually be okay, but having it appear to work
on some archs while crashing on others is not a good situation. So I
think just reporting it as an error is okay for now. We can revisit
this later if it would be useful.

> >>> 	va_start(ap, arg);
> >>>-	ptid = va_arg(ap, pid_t *);
> >>>-	tls  = va_arg(ap, void *);
> >>>-	ctid = va_arg(ap, pid_t *);
> >>>+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
> >>>+	 	ptid = va_arg(ap, pid_t *);
> >>>+	if (flags & CLONE_CHILD_SETTID) {
> >>>+		tls = va_arg(ap, void *);
> >>>+		ctid = va_arg(ap, pid_t *);
> >>>+	}
> >>> 	va_end(ap);
> >>>
> >>>-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls,
> >>>ctid));
> >>>+	if (flags & CLONE_VM)
> >>>+		return __syscall_ret(__clone(func, stack, flags, arg, ptid,
> >>>tls, ctid));
> >>>+
> >>>+	__block_all_sigs(&csa.sigmask);
> >>>+	LOCK(__abort_lock);
> >>>+
> >>>+	csa.func = func;
> >>>+	csa.arg = arg;
> >>>+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
> >>>+
> >>>+	__post_Fork(ret);
> >>>+	__restore_sigs(&csa.sigmask);
> >>>+	return __syscall_ret(ret);
> >>> }
> >>
> >>Another thing that might be somewhat relevant if we expect to see
> >>wider usage of clone is that syscall (the public function) currently
> >>blindly extracts six arguments via va_arg, which on some
> >>architectures requires sufficient stack space not to crash. So on
> >>i386 the following silly function will crash if passed to clone,
> >>provided that "stack" points to the beginning of unmapped space past
> >>the stack page end.
> >
> >That's changed in the patch. It only calls va_arg for arguments whose
> >existence the flags implies.
> >
> I was talking about va_arg usage in syscall() (in misc/syscall.c).
> __clone only aligns stack on 16 bytes (at least on arches that I
> checked), but doesn't reserve space. So a sufficiently unlucky
> user-provided function will crash when called from __clone in the
> child, as in my example.

OK, this is really a bug in the syscall() function, albeit one that
doesn't seem to admit any reasonable fix. Note that this function is
not used in libc; it's only present for use by applications.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.