kernel-hardening - Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrUwkV4_65y7UjSgrq5WHOcZZ=+znKArehvhb1xEGG9HXw@mail.gmail.com>
Date: Thu, 8 Mar 2018 23:53:52 +0000
From: Andy Lutomirski <luto@...nel.org>
To: Mickaël Salaün <mic@...ikod.net>
Cc: Tycho Andersen <tycho@...ho.ws>, LKML <linux-kernel@...r.kernel.org>, 
	Alexei Starovoitov <ast@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>, 
	Casey Schaufler <casey@...aufler-ca.com>, Daniel Borkmann <daniel@...earbox.net>, 
	David Drysdale <drysdale@...gle.com>, "David S . Miller" <davem@...emloft.net>, 
	"Eric W . Biederman" <ebiederm@...ssion.com>, James Morris <james.l.morris@...cle.com>, 
	Jann Horn <jann@...jh.net>, Jonathan Corbet <corbet@....net>, 
	Michael Kerrisk <mtk.manpages@...il.com>, Kees Cook <keescook@...omium.org>, 
	Paul Moore <paul@...l-moore.com>, Sargun Dhillon <sargun@...gun.me>, 
	"Serge E . Hallyn" <serge@...lyn.com>, Shuah Khan <shuah@...nel.org>, Tejun Heo <tj@...nel.org>, 
	Thomas Graf <tgraf@...g.ch>, Will Drewry <wad@...omium.org>, 
	Kernel Hardening <kernel-hardening@...ts.openwall.com>, Linux API <linux-api@...r.kernel.org>, 
	LSM List <linux-security-module@...r.kernel.org>, 
	Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün <mic@...ikod.net> wrote:
>
> On 07/03/2018 02:21, Andy Lutomirski wrote:
>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <mic@...ikod.net> wrote:
>>>
>>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>>> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>>>> container, but I don't want to allow moun() in general and I want to
>>>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>>>> mount using seccomp and calls out to the container manager for help.
>>>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>>>> supported.
>>>>>>
>>>>>> Well, I think this use case should be handled with something like
>>>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>>>> https://github.com/stemjail/stemshim
>>>>>
>>>>> I doubt that will work for containers.  Containers that use user
>>>>> namespaces and, for example, setuid programs aren't going to honor
>>>>> LD_PRELOAD.
>>>>
>>>> Or anything that calls syscalls directly, like go programs.
>>>
>>> That's why the vDSO-like approach. Enforcing an access control is not
>>> the issue here, patching a buggy userland (without patching its code) is
>>> the issue isn't it?
>>>
>>> As far as I remember, the main problem is to handle file descriptors
>>> while "emulating" the kernel behavior. This can be done with a "shim"
>>> code mapped in every processes. Chrome used something like this (in a
>>> previous sandbox mechanism) as a kind of emulation (with the current
>>> seccomp-bpf ). I think it should be doable to replace the (userland)
>>> emulation code with an IPC wrapper receiving file descriptors through
>>> UNIX socket.
>>>
>>
>> Can you explain exactly what you mean by "vDSO-like"?
>>
>> When a 64-bit program does a syscall, it just executes the SYSCALL
>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>> go through the vDSO, but not always.
>>
>> It could be possible to force-load a DSO into an entire container and
>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>> such that they merely redirect control to the DSO, but that seems
>> quite messy.
>
> vDSO is a code mapped for all processes. As you said, these processes
> may use it or not. What I was thinking about is to use the same concept,
> i.e. map a "shim" code into each processes pertaining to a particular
> hierarchy (the same way seccomp filters are inherited across processes).
> With a seccomp filter matching some syscall (e.g. mount, open), it is
> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> shim code should then be able to emulate/patch what is needed, even
> faking a file opening by receiving a file descriptor through a UNIX
> socket. As did the Chrome sandbox, the seccomp filter may look at the
> calling address to allow the shim code to call syscalls without being
> catched, if needed. However, relying on SIGSYS may not fit with
> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> to a specific process address, to emulate the syscall in an easier way
> than only relying on a {c,e}BPF program.
>

This could indeed be done, but I think that Tycho's approach is much
cleaner and probably faster.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.