Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 9 May 2017 06:00:01 -0700
From: Andy Lutomirski <>
To: Christoph Hellwig <>
Cc: Ingo Molnar <>, Greg KH <>, 
	Thomas Garnier <>, Martin Schwidefsky <>, 
	Heiko Carstens <>, Dave Hansen <>, 
	Arnd Bergmann <>, Thomas Gleixner <>, David Howells <>, 
	René Nyffenegger <>, 
	Andrew Morton <>, 
	"Paul E . McKenney" <>, "Eric W . Biederman" <>, 
	Oleg Nesterov <>, Pavel Tikhomirov <>, 
	Ingo Molnar <>, "H . Peter Anvin" <>, Andy Lutomirski <>, 
	Paolo Bonzini <>, Rik van Riel <>, Kees Cook <>, 
	Josh Poimboeuf <>, Borislav Petkov <>, Brian Gerst <>, 
	"Kirill A . Shutemov" <>, 
	Christian Borntraeger <>, Russell King <>, 
	Will Deacon <>, Catalin Marinas <>, 
	Mark Rutland <>, James Morse <>, 
	linux-s390 <>, LKML <>, 
	Linux API <>, "the arch/x86 maintainers" <>, 
	"" <>, 
	Kernel Hardening <>, 
	Linus Torvalds <>, Peter Zijlstra <>
Subject: Re: Re: [PATCH v9 1/4] syscalls: Verify address
 limit before returning to user-mode

On Tue, May 9, 2017 at 1:56 AM, Christoph Hellwig <> wrote:
> On Tue, May 09, 2017 at 08:45:22AM +0200, Ingo Molnar wrote:
>> We only have ~115 code blocks in the kernel that set/restore KERNEL_DS, it would
>> be a pity to add a runtime check to every system call ...
> I think we should simply strive to remove all of them that aren't
> in core scheduler / arch code.  Basically evetyytime we do the
>         oldfs = get_fs();
>         set_fs(KERNEL_DS);
>         ..
>         set_fs(oldfs);
> trick we're doing something wrong, and there should always be better
> ways to archive it.  E.g. using iov_iter with a ITER_KVEC type
> consistently would already remove most of them.

How about trying to remove all of them?  If we could actually get rid
of all of them, we could drop the arch support, and we'd get faster,
simpler, shorter uaccess code throughout the kernel.

The ones in kernel/compat.c are generally garbage.  They should be
using compat_alloc_user_space().  Ditto for kernel/power/user.c.

flush_module_icache() is a potentially silly arch thing.  Does the
code in kernel/module.c that uses set_fs() actually work?

kernel/signal.c's set_fs() is laziness.

__probe_kernel_read() and __probe_kernel_write() use set_fs(), but
that usage only matters on sane arches* like s390x.  We should
arguably have a set_uaccess_address_space() or similar for this
purpose that's a nop on normal arches like x86.

fs/splice.c has some, ahem, interesting uses that have been the source
of nasty exploits in the past.  Converting them to use iov_iter
properly would be really, really nice.  Christoph, I don't suppose
you'd like to do that?

The others seem to mostly be fixable, but I haven't looked that closely.

Overall, I suspect that a big part of why mitigations like the one
being discussed in this thread were developed is because addr_limit
used to be on the stack, making it (along with restart_block) a really
nice target.  This is fixed now on x86, arm64, and s390x, I believe,
and other arches can easily opt in to the fix.

* I'm strongly in favor of arches that have totally separate user and
kernel address spaces.  Sadly, the most common arches don't do this.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.