Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 1 Jun 2023 16:08:07 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] [RFC] make clone() usable

On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> On 2023-05-31 02:35, Rich Felker wrote:
> >As discussed before (see the 2021 thread "Incorrect thread TID
> >caching") clone() has been effectively unusable because it produces a
> >child process in invalid state, with wrong tid in its thread
> >structure, among other problems.
> >
> >The attached proposed patch attempts to make clone() usable by having
> >it share the _Fork logic for establishing a consistent process state
> >after forking, and also blocks use of flags which produce invalid
> >state.
> >
> Tangentially, while thinking about this, I realized that because the
> kernel clears the address of the exit futex on clone and _Fork
> doesn't re-establish it, the following blocks forever:
> 
> void *thr(void *arg) {
>         // Blocks in __tl_sync because __thread_list_lock is never
> unlocked
>         pthread_join(*(pthread_t *)arg, NULL));
>         return NULL;
> }
> 
> int main() {
>         if (!fork()) {
>                 static pthread_t pt;
>                 pt = pthread_self();
>                 pthread_create(&(pthread_t){0}, NULL, thr, &pt);
>                 pthread_exit(NULL);
>         }
>         wait(NULL);
> }
> 
> This could be fixed by the following change in _Fork.c:
> 
> -               self->tid = __syscall(SYS_gettid);
> +               self->tid = __syscall(SYS_set_tid_address,
> &__thread_list_lock);

Wow, I had no idea Linux's fork cleared the exit futex address. That's
a big bug we've had around for nobody to have noticed...

> >With CLONE_VM, the old behavior is kept, with a caveat that the child
> >context is extremely restrictive, ala vfork.
> >
> >It was raised previously (pardon the pun) that raise() should perhaps
> >be modified to make a SYS_gettid syscall rather than using the thread
> >structure tid, so that it's possible to call raise() in a vfork
> >context without signaling the wrong process. It might make sense to do
> >that now too.
> >
> Yes, it would be nice to have raise and abort (which also reads the
> cached tid separately) working in a vfork child.
> 
> >diff --git a/src/linux/clone.c b/src/linux/clone.c
> >index 8c1af7d3..e8f76bbc 100644
> >--- a/src/linux/clone.c
> >+++ b/src/linux/clone.c
> >@@ -4,18 +4,55 @@
> > #include <sched.h>
> > #include "pthread_impl.h"
> > #include "syscall.h"
> >+#include "lock.h"
> >+#include "fork_impl.h"
> >+
> >+struct clone_start_args {
> >+	int (*func)(void *);
> >+	void *arg;
> >+	sigset_t sigmask;
> >+};
> >+
> >+static int clone_start(void *arg)
> >+{
> >+	struct clone_start_args *csa = arg;
> >+	__post_Fork(0);
> >+	__restore_sigs(&csa->sigmask);
> >+	return csa->func(csa->arg);
> >+}
> >
> > int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
> > {
> >+	struct clone_start_args csa;
> > 	va_list ap;
> >-	pid_t *ptid, *ctid;
> >-	void  *tls;
> >+	pid_t *ptid = 0, *ctid = 0;
> >+	void  *tls = 0;
> >+
> >+	/* Flags that produce an invalid thread/TLS state are disallowed. */
> >+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
> >+	if (flags & badflags)
> >+		return __syscall_ret(-EINVAL);
> >
> Maybe also check for NULL stack, as proposed in
> https://www.openwall.com/lists/musl/2022/08/02/1 ?

I guess that can be done. Normally we don't check for null pointers,
but I think the syscall would silently keep the existing stack pointer
if passed null rather than blowing up, which we probably don't want --
even without CLONE_VM, where it'd blow up, we'd be giving
appears-to-work behavior without a contract for it to work.

> > 	va_start(ap, arg);
> >-	ptid = va_arg(ap, pid_t *);
> >-	tls  = va_arg(ap, void *);
> >-	ctid = va_arg(ap, pid_t *);
> >+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
> >+	 	ptid = va_arg(ap, pid_t *);
> >+	if (flags & CLONE_CHILD_SETTID) {
> >+		tls = va_arg(ap, void *);
> >+		ctid = va_arg(ap, pid_t *);
> >+	}
> > 	va_end(ap);
> >
> >-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls,
> >ctid));
> >+	if (flags & CLONE_VM)
> >+		return __syscall_ret(__clone(func, stack, flags, arg, ptid,
> >tls, ctid));
> >+
> >+	__block_all_sigs(&csa.sigmask);
> >+	LOCK(__abort_lock);
> >+
> >+	csa.func = func;
> >+	csa.arg = arg;
> >+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
> >+
> >+	__post_Fork(ret);
> >+	__restore_sigs(&csa.sigmask);
> >+	return __syscall_ret(ret);
> > }
> 
> Another thing that might be somewhat relevant if we expect to see
> wider usage of clone is that syscall (the public function) currently
> blindly extracts six arguments via va_arg, which on some
> architectures requires sufficient stack space not to crash. So on
> i386 the following silly function will crash if passed to clone,
> provided that "stack" points to the beginning of unmapped space past
> the stack page end.

That's changed in the patch. It only calls va_arg for arguments whose
existence the flags implies.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.