musl - Re: Running code on all other threads (for sandboxing)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250824155215.GL1827@brightrain.aerifal.cx>
Date: Sun, 24 Aug 2025 11:52:16 -0400
From: Rich Felker <dalias@...c.org>
To: Demi Marie Obenour <demiobenour@...il.com>
Cc: musl@...ts.openwall.com, libc-alpha@...rceware.org
Subject: Re: Running code on all other threads (for sandboxing)

On Sun, Aug 24, 2025 at 08:11:30AM -0400, Demi Marie Obenour wrote:
> On 8/23/25 22:18, Rich Felker wrote:
> > On Fri, Aug 22, 2025 at 09:34:55PM -0400, Demi Marie Obenour wrote:
> >> There are cases where it is highly desirable for a process
> >> to start out with full user rights (or at least close to them),
> >> initialize, and then drop these privileges using Linux kernel
> >> features like seccomp.  Unfortunately, this breaks if the
> >> process uses third-party libraries that create threads during
> >> initialization.  In particular, Mesa can do this, and there is
> >> no realistic alternative to it as Mesa is ~2 million lines of
> >> GPU compiler and driver code.  Loading Mesa later is undesirable
> >> as it prevents removing all filesystem access.
> >>
> >> There are two ways to fix this problem:
> >>
> >> 1. Fix the problem in the Linux kernel.
> >> 2. Work around it in userspace, as is already done for setuid()
> >>    and friends.
> >>
> >> For the second, it should be sufficient to provide a function
> >> that runs a caller-provided function on each thread, while
> >> ensuring that the process is atomic with respect to other
> >> threads in the process.  This function only needs to make
> >> system calls and crashes the process if there is an error.
> >> If the function uses anything that isn't a syscall or
> >> compiler builtin, it gets to keep both pieces.
> >>
> >> Is this something that would make sense to implement?  I know
> >> that this problem has been an issue for Chromium on Linux.
> > 
> > I'm not sure what the right solution to this specific problem is, but
> > I don't think exposing a "run arbitrary code in each thread" as a
> > public API is a good choice. Such code would run in a context which is
> > worse/more-restrictive even than "async signal" context, making it
> > really difficult to define any reasonable class of "what you're
> > allowed to do here". I know you said "syscalls", but even that
> > requires defining what you mean by syscalls (raw via asm? via
> > syscall()? any function that's "traditionally just a syscall"?) and
> > further specifying which syscalls are actually allowed (any which
> > break the __synccall context assumptions would need to be forbidden).
> 
> I think just seccomp() and compiler-inserted calls to functions
> like memcpy().  memcpy() should only depend on a valid stack (which
> *is* guaranteed unless I am greatly mistaken) and seccomp() is just
> a wrapper around syscall().

Yes, I agree you can make a set that meets just the needs for this
particular problem, but if it's not meant as a general API where folks
can do other things (which needs some specification of the limits on
that) with it, rather than just for seccomp, then it would be much
better to just have a libc "seccomp_all_threads()" function that does
__synccall internally and applies a filter to all threads. This avoids
introducing supported behavior that might have really bad unforseen
consequences.

However:

> > I think there are potentially semi-portable solutions to your problem
> > that don't require such a big hammer as arbitrary __synccall.
> > 
> > One that comes to mind is installing a SECCOMP_RET_USER_NOTIF or
> > SECCOMP_RET_TRAP filter before loading Mesa. This could allow the
> > filesystem access to load Mesa libraries only until you set a flag
> > that loading has finished, then cause filesystem access syscalls to
> > fail once the flag has been set.
> 
> Would this involve emulating all the filesystem syscalls?

No, the notify handler just has to reply that it wants to allow the
syscall.

> The problem
> is that the flag would need to be set in a way that it can’t be unset.

If you close the notify fd, all future syscalls that would go to it
fail with ENOSYS and there's no way to undo that. I think this meets
your needs.

> > Another approach is doing what I'd call "manual __synccall" with your
> > own signal, which is better than exposing actual __synccall because
> > the application code does not run in an invalid-libc context, but this
> > would only work if Mesa's hidden threads don't mask signals. A library
> > creating its own threads behind the scenes *should* be masking all
> > signals, so this probably doesn't work. Even if Mesa botched it, you
> > wouldn't want to preclude them fixing it.
> 
> Also, from my reading of past mailing list posts, this is inherently
> racy against thread creation.
> 
> > There is probably also a way to do this with ptrace, which blocked
> > signals wouldn't interfere with, but that gets really nasty really
> > quick.
> > 
> > Unfortunately there don't seem to be any ways to inject new seccomp
> > filters into another task (even a thread of your own process)
> > directly. This is what Linux really should be offering here.
> 
> Actually, it already supports this (SECCOMP_FILTER_FLAG_TSYNC).
> I don't think this is supported for Landlock, though.

Oh, that's even better then.

Is there a separate need for Landlock here though?

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.