Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <2025-11-05-remember-remember-the-fifth-of-november-3EtRdS@cyphar.com>
Date: Wed, 5 Nov 2025 20:53:08 +1100
From: Aleksa Sarai <cyphar@...har.com>
To: oss-security@...ts.openwall.com, fulldisclosure@...lists.org
Subject: runc container breakouts via procfs writes: CVE-2025-31133,
 CVE-2025-52565, and CVE-2025-52881

| NOTE: This advisory was sent to <security-announce@...ncontainers.org>
| on 2025-10-16. If you ship any Open Container Initiative software, we
| highly recommend that you subscribe to our security-announce list in
| order to receive more timely disclosures of future security issues.
| The procedure for subscribing to security-announce is outlined here:
| <https://github.com/opencontainers/.github/blob/main/SECURITY.md#disclosure-distribution-list>

Hello,

This is a notification to vendors that use or ship runc about THREE (3)
high-severity vulnerabilities (CVE-2025-31133, CVE-2025-52565, and
CVE-2025-52881). All three vulnerabilities ultimately allow (through
different methods) for full container breakouts by bypassing runc's
restrictions for writing to arbitrary /proc files.

Today we have released the following runc releases which include more
than 20 patches to resolve this issue:

 * runc v1.4.0-rc.3 <https://github.com/opencontainers/runc/releases/tag/v1.4.0-rc.3>
 * runc v1.3.3 <https://github.com/opencontainers/runc/releases/tag/v1.3.3>
 * runc v1.2.8 <https://github.com/opencontainers/runc/releases/tag/v1.2.8>

We strongly recommend you update as soon as possible. For your own
reference I have attached a tarball of the patches (which apply cleanly
on top of runc v1.2.7, v1.3.2 and v1.4.0-rc.2).

Unfortunately the patches are are quite large as they required a lot of
development work in github.com/cyphar/filepath-securejoin along with
quite deep changes to runc. I would recommend just going with the
released versions.

Note that these patches have not been split into per-CVE patches, as the
resolutions for each issue overlap and so some patches help resolve more
than one CVE on the list. We strongly recommend simply applying all of
the provided patches (we have included a squashed single-patch version
for your convenience -- see v1.[234].patch).

| **NOTE**:
| Some vendors were given a pre-release version of this release.
| These public releases include two extra patches to fix regressions
| dIscovered very late during the embargo period and were thus not
| included in the pre-release versions. Please update to this version.
| The above tarball includes these extra patches as well.

/*** Vulnerabilities ***/

Below is a break-down of the key points of each issue. Once this
vulnerability is made public on the embargo date, the linked advisory
pages will contain some more information about the issues.

Please note that while these issues are generally related, the available
mitigations (if any) vary from issue to issue. However, all of these
attacks rely on starting containers with custom mount configurations --
if you do not run untrusted container images from unknown or unverified
sources then these attacks would not be possible to exploit. Note that
Dockerfiles support custom mount configurations (with RUN --mount=...)
and so these issues are also exploitable from Dockerfiles.

Also please note that the below CVSS scores are based on the threat
model from *runc's point of view*. If you were to analyse the same
vulnerability from the perspective of network-enabled systems like
Docker or Kubernetes you would likely end up with a much higher
severity.

/* CVE-2025-31133 */

"container escape via 'masked path' abuse due to mount race conditions"

CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H (7.3)

<https://github.com/opencontainers/runc/security/advisories/GHSA-9493-h29p-rfm2>

CVE-2025-31133 exploits an issue with how masked paths are implemented
in runc. When masking files, runc will bind-mount the container's
/dev/null inode on top of the file. However, if an attacker can replace
/dev/null with a symlink to some other procfs file, runc will instead
bind-mount the symlink target read-write. This issue affects all known
runc versions.

This stage happens after pivot_root(2) and so cannot be used to
bind-mount host files directly. However, paths like
/proc/sys/kernel/core_pattern which can be used to break out of a
container entirely (coredump helpers are spawned as upcalls, which are
not namespaced and have full host privileges). /proc/sysrq-trigger can
also be used by an attacker to cause the host system to crash or halt.
(This is "Attack 1".)

While developing a fix for this issue, we also discovered that if the
attacker instead deleted /dev/null, runc would purposefully ignore the
error and thus make maskedPath a no-op. This is slightly less serious,
but it would permit some information disclosure through masked files
like /proc/kcore and /proc/timer_list. (This is "Attack 2".)

Potential mitigations for this issue include:

 * Using user namespaces, with the host root user not mapped into the
   container's namespace. procfs file permissions are managed using Unix
   DAC and thus user namespaces stop a container process from being able
   to write to them.

 * Not running as a root user in the container (this includes disabling
   setuid binaries with noNewPrivileges). As above, procfs file
   permissions are managed using Unix DAC and thus non-root users cannot
   write to them.

 * Depending on the maskedPath configuration (the default configuration
   only masks paths in /proc and /sys), using an AppArmor that blocks
   unexpected writes to any maskedPaths (as is the case with the default
   profile used by Docker and Podman) will block attempts to exploit
   this issue. However, CVE-2025-52881 allows an attacker to bypass LSM
   labels, and so this mitigation is not helpful when considered in
   combination with CVE-2025-52881.

 * Based on our analysis, SELinux will NOT help mitigate this issue --
   the /dev/null bind-mount used for maskedPaths get re-labeled to the
   container context and thus the container will have access to them.

Thanks to Lei Wang (@ssst0n3 from Huawei) for finding and reporting the
original vulnerability (Attack 1), and Li Fubang (@lifubang from
acmcoder.com, CIIC) for discovering another attack vector (Attack 2)
based on @ssst0n3's initial findings.

/* CVE-2025-52565 */

"container escape with malicious config due to /dev/console mount and related races"

CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H (7.3)

<https://github.com/opencontainers/runc/security/advisories/GHSA-qw9x-cqr3-wc7r>

CVE-2025-52565 is very similar in concept and application to
CVE-2025-31133, except that it exploits a flaw in /dev/console
bind-mounts. When creating the /dev/console bind-mount (to /dev/pts/$n),
if an attacker replaces /dev/pts/$n with a symlink then runc will
bind-mount the symlink target over /dev/console. This issue affects all
versions of runc >= 1.0.0-rc3.

As with CVE-2025-31133, this happens after pivot_root(2) and so cannot
be used to bind-mount host files directly, but an attacker can trick
runc into creating a read-write bind-mount of
/proc/sys/kernel/core_pattern or /proc/sysrq-trigger, leading to a
complete container breakout (as with CVE-2025-31133).

While developing a fix for this issue, we also found some potentially
concerning issues with os.Create usage (which may have allowed for host
files to be truncated by an attacker) -- though we deemed these issues
to not be exploitable, we have provided fixes for them. In addition,
some previously known issues with /dev/pts/$n race conditions were
re-analysed and we have included mitigations for them too (even though
we still feel these are mostly hypothetical issues).

Potential mitigations for this issue include:

 * Using user namespaces, with the host root user not mapped into the
   container's namespace. procfs file permissions are managed using Unix
   DAC and thus user namespaces stop a container process from being able
   to write to them.

 * Not running as a root user in the container (this includes disabling
   setuid binaries with noNewPrivileges). As above, procfs file
   permissions are managed using Unix DAC and thus non-root users cannot
   write to them.

 * The default SELinux policy should mitigate this issue, as the
   /dev/console bind-mount does not re-label the mount and so the
   container process should not be able to write to unsafe procfs files.
   However, CVE-2025-52881 allows an attacker to bypass LSM labels, and
   so this mitigation is not helpful when considered in combination with
   CVE-2025-52881.

 * The default AppArmor profile used by most runtimes will NOT help
   mitigate this issue, as /dev/console access is permitted. You could
   create a custom profile that blocks access to /dev/console, but such
   a profile might break regular containers. In addition, CVE-2025-52881
   allows an attacker to bypass LSM labels, and so that mitigation is
   not helpful when considered in combination with CVE-2025-52881.

Known Issues:

 * We are aware of an issue with our mitigation for this attack and certain configurations

Thanks to Lei Wang (@ssst0n3 from Huawei) and Li Fubang (@lifubang from
acmcoder.com, CIIC) for discovering and reporting the main /dev/console
bind-mount vulnerability, as well as Aleksa Sarai (@cyphar from SUSE)
for discovering the related issues mentioned above as well as the
original research into these classes of issues several years ago.

/* CVE-2025-52881 */

"container escape and denial of service due to arbitrary write gadgets and procfs write redirects"

CVSS:4.0/AV:L/AC:L/AT:P/PR:L/UI:A/VC:H/VI:H/VA:H/SC:H/SI:H/SA:H (7.3)

<https://github.com/opencontainers/runc/security/advisories/GHSA-cgrx-mc8f-2prm>

This attack is a more sophisticated variant of CVE-2019-16884, which was
CVE-2019-19921
a flaw that allowed an attacker to trick runc into writing the LSM
process labels for a container process into a dummy tmpfs file and thus
not apply the correct LSM labels to the container process. The
mitigation we applied for CVE-2019-19921 was fairly limited and
effectively only caused runc to verify that when we write LSM labels
that those labels are actual procfs files. This issue affects all known
runc versions.

Rather than using a fake tmpfs file for /proc/self/attr/<label>, an
attacker could instead (through various means) make
/proc/self/attr/<label> reference a real procfs file, but one that would
still be a no-op (such as /proc/self/sched). This would have the same
effect but would clear the "is a procfs file" check.

We were aware that this kind of attack would be possible (even going so
far as to discuss this publicly as "future work" at conferences), and we
were working on a far more comprehensive mitigation of this attack, but
this security issue was disclosed before we could complete this work.

This attack pairs well with CVE-2025-31133 and CVE-2025-52565, as the
most basic version described above acts as an LSM bypass that makes it
easy for an attacker to write to procfs files and break out of a
container.

However, rather than just making the write a no-op, the attacker could
instead redirect the write to a more malicious target (such as
/proc/sysrq-trigger to crash the host machine). In addition, sysctl
writes could be similarly redirected, so it is plausible an attacker
would be able to provide a custom payload to write, allowing for a
/proc/sys/kernel/core_pattern-based full container breakout.

This lead us to do a complete audit for all write operations in runc, as
any write operation could potentially be redirected in a similar way --
we did not find any more problematic writes in our analysis but we are
still investigating the possibility of using lints or static analysis to
detect this kind of issue.

Potential mitigations for this issue include:

 * Using rootless containers, as doing so will block most of the
   inadvertent writes (runc would run with reduced privileges, making
   attempts to write to procfs files ineffective).

 * Based on our analysis, neither AppArmor or SELinux can protect
   against the full version of the redirected write attack. The
   container runtime is generally privileged enough to write to
   arbitrary procfs files, which is more than sufficient to cause a
   container breakout.

   With SELinux, it is *possible* that the container_runtime_t label
   applied to runc will restrict how much runc can do with the no-op
   variant of the attack, but it seems to us that the
   /proc/sysrq-trigger host crash and /proc/sys/kernel/core_pattern
   container breakout attacks would still work.

Thanks to Li Fubang (@lifubang from acmcoder.com, CIIC) and Tõnis Tiigi
(@tonistiigi from Docker) for both independently discovering this
vulnerability, as well as Aleksa Sarai (@cyphar from SUSE) for the
original research into this class of security issues and solutions over
the past few years.

/*** Other Container Runtimes ***/

These issues are all very easy-to-make logic flaws, and as such we
contacted several other container runtimes to alert them of these issues
and provide them our analysis.

Our current understanding is that youki and crun have similar flaws and
are working on patches to be released in co-ordination with this
advisory. LXC appears to have some similar bugs but their security
policy is (understandably) that non-user-namespaced containers are
fundamentally insecure and thus such exploits are not security issues.

If you use a container runtime other than runc, please check whether
upstream has released a security update addressing these (or similar)
issues once this issue becomes public.

If you are a container runtime author that we did not contact, please
get in touch with me at <cyphar@...har.com> to get added to the
cross-runtime security group. Please note that this group is intended
for *low-level* container runtime *upstream maintainers* only.

/*** Extra Patches ***/

There were three issues with these patches which we became aware of
quite late in the embargo process. We have included new patches in the
released versions linked above to address two of them, but these patches
were not included in the pre-release tarballs provided to vendors:

 * *00*-openat2-improve-resilience-on-busy-systems.patch
 * *00*-rootfs-re-allow-dangling-symlinks-in-mount-targets.patch

Note that these are *NOT* security issues, they are usability
regressions that may affect some users depending on what images they use
and what kind of systems they run their containers on.

Below is the description provided to vendors, for your own reference,
but the issues listed have been fixed (with the exception of the last
issue, which is still being investigated).

/* openat2 EAGAIN Retry Failures */

openat2 will return -EAGAIN if there was a racing rename or mount when
trying to walk into ".." during a scoped lookup. On systems with heavy
load, this can happen fairly frequently. In the version of the patches
we merged, runc would retry every openat2 operation up to 32 times
before failing with an error in order to mitigate this while also
avoiding denial-of-service attacks.

Unfortunately, it seems this number was too conservative and some
vendors have reported seeing this error:

  runc run failed: unable to start container process: error during container init: error mounting "$source" to rootfs at "$destination": create mountpoint for $destination mount: lookup mountpoint target: securejoin.OpenInRoot $destination: openat2 $destination: possible attack detected

Based on my testing, the worst-case failure rate for this is probably
around 3% (this is based on figures from me running very aggressive
rename loops on all 16 cores of my laptop). It is probably lower for
production deployments that have less aggressive rename and mount churn,
but it was a detectable regression for some downstreams.

*00*-openat2-improve-resilience-on-busy-systems.patch is a patch that
resolves this issue. The simplest mitigation is to just bump the retry
number (which this patch does), but I have also included some additional
retries with a time-based deadline that in my testing should be
virtually impossible to hit even in very high load scenarios (I was
unable to hit the error even after running >50k tests in a tight loop).

Some vendors have reported that this reduced the failure rate to
effectively 0 after 3-4 days of heavy load testing.

/* Dangling Symlink Mount Targets */

Due to the hardening work done for mounts in the provided patchsets, it
was necessary to block certain configurations that could not be done
safely in a reasonable way. One of these configurations is mount targets
that contain symlinks to non-existent paths (otherwise known as
"dangling symlinks"). With these patches, such configurations will
result in the following error:

  runc create failed: unable to start container process: error during container init: error mounting "$source" to rootfs at "$destination": create mountpoint for $destination mount: make mountpoint "$destination": file exists

The workaround is to either change the symlink to point to a real path
or create the target of the dangling symlink (previously, runc would do
this for you). A survey of public images indicates that this pattern is
incredibly rare (the one example I've been given is of a broken
/etc/resolv.conf symlink), and in addition these kinds of symlinks are
quite hard to deal with in a sane and safe manner.

This change in behaviour was intentional, but after receving reports
from more than one downstream, I took another look and wrote a hotfix
that should allow us to continue to support these broken symlinks.
*00*-rootfs-re-allow-dangling-symlinks-in-mount-targets.patch is that
patch.

However, we still strongly suggest users refrain from creating images
with such broken symlinks.

/* Issues with "-v /dev:/dev" */

At SUSE, we found an example of a developer tool creating a bind-mount
of the host /dev into the container. For reasons that are not entirely
clear to me yet, this setup appears to have worked previously but can
now lead to permission issues with rootless containers with our
mitigating patches, with typical errors looking like:

  exec failed: unable to start container process: reopen ptmx to get new pty pair: reopen fd 11: permission denied

I have not yet been able to root-cause this issue (I suspect that
ptmxmode=000 has some part to play here), but I would argue that such
setups are not particularly safe nor recommended, and users should
instead be doing --mount type=devpts,... if they have a strong need to
configure the /dev/pts mount (which is what our tool was trying to do
and had already been patched in newer versions to do properly).

If you have seen this issue or have any other information, feel free to
open a bug report.

/*** Credits ***/

Thanks again to the following researchers for helping discover and
report these vulnerabilities:

 * Lei Wang (@ssst0n3 from Huawei)
 * Li Fubang (@lifubang from acmcoder.com, CIIC)
 * Tõnis Tiigi (@tonistiigi from Docker)
 * Aleksa Sarai (@cyphar from SUSE)

Additional thanks go to Tõnis Tiigi for showing that Dockerfiles can be
used to exploit these issues, and thus providing us with some very
useful exploit templates for these kinds of race attacks.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Download attachment "runc-patches-2025-11-05.tar.xz" of type "application/x-xz" (109568 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (266 bytes)

Powered by blists - more mailing lists

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.