lkrg-users - user-triggerable Oops on Linux 4.17+ (64-bit only)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20200707154738.GA2846@openwall.com>
Date: Tue, 7 Jul 2020 17:47:38 +0200
From: Solar Designer <solar@...nwall.com>
To: lkrg-users@...ts.openwall.com
Subject: user-triggerable Oops on Linux 4.17+ (64-bit only)

Hi,

This is a heads-up that there's an important bug fix commit now in the
LKRG repo:

commit b3a499e7f6071e00338f173b2c614227e810397e
Author: Adam_pi3 <pi3@....com.pl>
Date:   Sat Jul 4 16:11:17 2020 -0400

    Fix user-triggerable Oops (dereference of a near-NULL pointer) on newer kernels with new syscall implementation.  Found by Jason A.  Donenfeld.

We recommend all users of LKRG on Linux 4.17 or newer on x86_64 or arm64
to update to a revision of LKRG with the above fix included.

We intend to release LKRG 0.8.1 including this fix shortly.

Bug impact:

As long as the kernel's mmap_min_addr works as intended, the impact of
near-NULL pointer dereference bugs is limited.  On systems that don't
set panic_on_oops and don't use lockdown, this is just a nuisance and
some information getting in the logs.  On systems that set panic_on_oops
(most notably, RHEL and its clones), this is a DoS (kernel panic).

Finally, as Jason A. Donenfeld pointed out, there's a shortcoming in the
kernel's lockdown mechanism where root may disable mmap_min_addr thus
making many of the (near-)NULL pointer dereference bugs exploitable into
lockdown bypasses by root (thus, for escalation from root to ring 0).
We didn't evaluate whether this particular bug is usable as a lockdown
bypass or not.  For this to matter, LKRG would need to be signed and
used along with lockdown, which we think is currently unusual.

Bug origin:

Linux 4.17+ includes a major change to how syscalls are handled within
the kernel (see the patch series starting with "[PATCH 000/109] remove
in-kernel calls to syscalls"), in particular introducing
CONFIG_ARCH_HAS_SYSCALL_WRAPPER and enabling it on x86_64 and arm64.

This change matters to modules like LKRG where we hook syscalls and need
to retrieve their arguments.  Thus, LKRG needed to be updated to
support Linux 4.17+ on those architectures, which Adam did with the
corresponding major update on August 14, 2018.  Unfortunately, the
delete_module() syscall hooks were overlooked, and continued to use the
old convention, which Linux 4.17+ on those architectures no longer uses.

Bug detail:

The affected code in LKRG is only reached when the delete_module()
syscall fails, which it normally does not.  This is what enabled the bug
to stay unnoticed for this long.

The specific discrepancy in calling conventions results in LKRG setting
an unintended register to -1, which the kernel later uses as a pointer
and tries to read from an offset relative to that pointer, resulting in
a read from a near-NULL address (in our testing, from address 0x6f).
Since nothing can normally be mapped at that address due to
mmap_min_addr, this results in an instant kernel Oops, killing the
process that attempted the failed delete_module() call.

Reminder to users:

As we write on the LKRG homepage from the very beginning and now also in
CONCEPTS since LKRG 0.8:

"Like any software, LKRG may contain bugs and some of those might even
be new security vulnerabilities.  You need to weigh the benefits vs.
risks of using LKRG, considering that LKRG is most useful on systems
that realistically, despite of this being a best practice for security,
won't be promptly rebooted into new kernels (nor live-patched) whenever
a new kernel vulnerability is discovered.

LKRG is currently in an experimental stage.  We expect occasional false
positives [...]"

Luckily, the bug's impact is typically limited to what could have been
the impact of some LKRG false positives (kernel panic if that response
to certain issues is enabled in the configuration), which are
unfortunately the expected kind of occasional issues when using LKRG.

The only additional impact we're currently aware this bug might have is
lockdown bypass by root.

Thus, this is more of a near-miss (or near-hit if you like) than a
full-blown LKRG vulnerability.  Regardless, this is a reminder to LKRG
users of the risks associated with its use, and of the need to weigh the
benefits against such risks.

Lessons to learn for developers:

This is also an opportunity for us to try and see what we could possibly
have done to avoid this bug or to detect it promptly, so that we're more
likely to avoid or promptly detect other bugs.

The bug and it having been overlooked are in part a result of LKRG
trying to support multiple and changing kernel versions while needing to
be aware of those kernels' specifics.  This is unavoidable without
hurting LKRG's usefulness.

Nevertheless, here are some points we identified:

1. Fuzzing.  So far, we've been stress-testing and benchmarking LKRG
with valid inputs, and we've been testing kernel vulnerability exploits,
however we haven't been deliberately throwing arbitrary invalid inputs
against systems with LKRG loaded.  We should.  Simply running Trinity as
non-root might have caught this bug.  Does anyone in the community
possibly want to help with this going forward?

2. Limit symbol visibility.  If a symbol isn't currently used from
outside of a source file, we should actively break such unexpected uses.
For .c files, this means use of the "static" keyword where possible
(something I've been telling Adam before).  For .h files (like in this
case), this means either moving stuff to .c files or using a (re)naming
convention where we'd indicate header-internal symbol names e.g. by the
new "ph_" prefix instead of Adam's usual "p_".

3. Reduce source code duplication.  Mariusz Zaborski started work on
this, with some changes already included in 0.8, and we should do more.

4. Reduce source code size by other means as well, so that we'd have a
better chance to notice issues in what's left.  I am suggesting to Adam
what functionality we might better drop from LKRG while only minimally
reducing its usefulness and effectiveness against attacks.  (I think a
primary candidate for dropping is validation of waking-up tasks.  Such
validation was originally an idea I shared while we were brainstorming,
but I no longer liked it when Adam started to implement it and ran into
some complications.)

5. Knowledge transfer on LKRG internals and development conventions from
Adam to another capable developer, so that Adam wouldn't be the only one
who could have noticed a bug like this from a look at the buggy code
without needing further context.

We'd appreciate any comments from the lkrg-users community.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.