Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 24 Mar 2020 13:32:29 -0700
From: Kees Cook <>
To: Thomas Gleixner <>
Cc: Kees Cook <>,
	Elena Reshetova <>,,
	Andy Lutomirski <>,
	Peter Zijlstra <>,
	Catalin Marinas <>,
	Will Deacon <>,
	Mark Rutland <>,
	Alexander Potapenko <>,
	Ard Biesheuvel <>,
	Jann Horn <>,
	"Perla, Enrico" <>,,,,
Subject: [PATCH v2 3/5] stack: Optionally randomize kernel stack offset each syscall

This provides the ability for architectures to enable kernel stack base
address offset randomization. This feature is controlled by the boot
param "randomize_kstack_offset=on/off", with its default value set by

This feature is based on the original idea from the last public release
of PaX's RANDKSTACK feature:
All the credit for the original idea goes to the PaX team. Note that
the design and implementation of this upstream randomize_kstack_offset
feature differs greatly from the RANDKSTACK feature (see below).

Reasoning for the feature:

This feature aims to make harder the various stack-based attacks that
rely on deterministic stack structure. We have had many such attacks in
past (just to name few):

As Linux kernel stack protections have been constantly improving
(vmap-based stack allocation with guard pages, removal of thread_info,
STACKLEAK), attackers have to find new ways for their exploits to work.
They have done so, continuing to rely on the kernel's stack determinism,
in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT were not
relevant. For example, the following recent attacks would have been
hampered if the stack offset was non-deterministic between syscalls:

The main idea is that since the stack offset is randomized upon each
system call, it is hard for an attack to reliably land in any particular
place on the thread stack, even with address exposures, as the stack base
will change on the next syscall. Also, since randomization is performed
after placing pt_regs, the ptrace-based approach[1] to discover the
randomized offset during a long-running syscall should not be possible.

Design description:

During most of the kernel's execution, it runs on the "thread stack",
which is allocated at fork.c/dup_task_struct() and stored in a per-task
variable (tsk->stack). Since stack is growing downward, the stack
top can be always calculated using task_top_of_stack(tsk) function,
which essentially returns an address of tsk->stack + stack size. When
VMAP_STACK is enabled, the thread stack is allocated from vmalloc space.

The thread stack is pretty deterministic in its structure -- fixed in
size, and upon every entry from a userspace to kernel on a syscall the
thread stack is started to be constructed from an address fetched from a
per-cpu cpu_current_top_of_stack variable. The first element to be pushed
to the thread stack is the pt_regs struct that stores all required CPU
registers and syscall parameters.

The goal of randomize_kstack_offset feature is to add a random offset
after the pt_regs has been pushed to the stack and the rest of thread
stack (used during the syscall processing) every time a process issues
a syscall. The source of randomness is currently arch-defined (but x86
is using the low byte of rdtsc()). Future improvements for different
entropy sources is possible, but out of scope for this patch. The offset
is added using alloca() call since it helps avoiding changes in assembly
syscall entry code and unwinder, and provides correct stack alignment
as defined by the compiler.

In order to make this available by default with zero performance impact
for those that don't want it, now it is selectable with static branches.
This way, if the overhead is not wanted, it can just be turned off.

Using the per-cpu variable as the entropy source and __builtin_alloc()
for stack adjustment and alignment, the generated assembly for x86_64
with GCC looks like this:

ffffffff81003977: 65 8b 05 02 ea 00 7f  mov %gs:0x7f00ea02(%rip),%eax
					    # 12380 <kstack_offset>
ffffffff8100397e: 25 ff 03 00 00        and $0x3ff,%eax
ffffffff81003983: 48 83 c0 0f           add $0xf,%rax
ffffffff81003987: 25 f8 07 00 00        and $0x7f8,%eax
ffffffff8100398c: 48 29 c4              sub %rax,%rsp
ffffffff8100398f: 48 8d 44 24 0f        lea 0xf(%rsp),%rax
ffffffff81003994: 48 83 e0 f0           and $0xfffffffffffffff0,%rax

As a result of the above stack alignment, this patch introduces about
5 bits of randomness after pt_regs is spilled to the thread stack on
x86_64, and 6 bits on x86_32 (since its has 1 fewer bits required for
stack alignment). The amount of entropy could be adjusted based on how
much of the stack space we wish to trade for security.

My measure of syscall performance overhead (on x86_64):

lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null
    randomize_kstack_offset=y	Simple syscall: 0.7082 microseconds
    randomize_kstack_offset=n	Simple syscall: 0.7016 microseconds

So, roughly 0.9% overhead growth for a no-op syscall, which is very
manageable. And for people that don't want this, it's off by default.

Comparison to PaX RANDKSTACK feature:

The RANDKSTACK feature randomizes the location of the stack start
(cpu_current_top_of_stack), i.e. including the location of pt_regs
structure itself on the stack. Initially this patch followed the same
approach, but during the recent discussions[2], it has been determined
to be of a little value since, if ptrace functionality is available for
an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write
different offsets in the pt_regs struct, observe the cache behavior of
the pt_regs accesses, and figure out the random stack offset. Another
difference is that the random offset is stored in a per-cpu variable,
rather than having it be per-thread. As a result, these implementations
differ a fair bit in their implementation details and results, though
obviously the intent is similar.


Co-developed-by: Elena Reshetova <>
Signed-off-by: Elena Reshetova <>
Signed-off-by: Kees Cook <>
- move to per-cpu rdtsc() saved on syscall exit
- add static branches for zero-cost dynamic enabling
- Kconfig just selects the default state of static branch
- __builtin_alloca() produces ugly asm without -fno-stack-clash-protection
- made arch agnostic
 Makefile                         |  4 ++++
 arch/Kconfig                     | 19 +++++++++++++++
 include/linux/randomize_kstack.h | 40 ++++++++++++++++++++++++++++++++
 init/main.c                      | 23 ++++++++++++++++++
 4 files changed, 86 insertions(+)
 create mode 100644 include/linux/randomize_kstack.h

diff --git a/Makefile b/Makefile
index 171f2b004c8a..c99463406522 100644
--- a/Makefile
+++ b/Makefile
@@ -779,6 +779,10 @@ ifdef CONFIG_INIT_STACK_ALL
 KBUILD_CFLAGS	+= -ftrivial-auto-var-init=pattern
+# While VLAs have been removed, GCC produces unreachable stack probes
+# for the random_kstack_offset feature. Disable it for all compilers.
+KBUILD_CFLAGS	+= $(call cc-option,-fno-stack-clash-protection,)
 DEBUG_CFLAGS	:= $(call cc-option, -fno-var-tracking-assignments)
diff --git a/arch/Kconfig b/arch/Kconfig
index 17fe351cdde0..619a56da4b76 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -854,6 +854,25 @@ config VMAP_STACK
 	  virtual mappings with real shadow memory, and KASAN_VMALLOC must
 	  be enabled.
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stack
+	  offset randomization with calls to add_random_kstack_offset()
+	  during syscall entry and choose_random_kstack_offset() during
+	  syscall exit.
+	bool "Randomize kernel stack offset on syscall entry"
+	help
+	  The kernel stack offset can be randomized (after pt_regs) by
+	  roughly 5 bits of entropy, frustrating memory corruption
+	  attacks that depend on stack address determinism or
+	  cross-syscall address exposures. This feature is controlled
+	  by kernel boot param "randomize_kstack_offset=on/off", and this
+	  config chooses the default boot state.
 	def_bool n
diff --git a/include/linux/randomize_kstack.h b/include/linux/randomize_kstack.h
new file mode 100644
index 000000000000..651ba9504568
--- /dev/null
+++ b/include/linux/randomize_kstack.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#include <linux/kernel.h>
+#include <linux/jump_label.h>
+#include <linux/percpu-defs.h>
+			 randomize_kstack_offset);
+DECLARE_PER_CPU(u32, kstack_offset);
+ * Do not use this anywhere else in the kernel. This is used here because
+ * it provides an arch-agnostic way to grow the stack with correct
+ * alignment. Also, since this use is being explicitly masked to a max of
+ * 10 bits, stack-clash style attacks are unlikely. For more details see
+ * "VLAs" in Documentation/process/deprecated.rst
+ */
+void *__builtin_alloca(size_t size);
+#define add_random_kstack_offset() do {					\
+				&randomize_kstack_offset)) {		\
+		u32 offset = this_cpu_read(kstack_offset);		\
+		char *ptr = __builtin_alloca(offset & 0x3FF);		\
+		asm volatile("" : "=m"(*ptr));				\
+	}								\
+} while (0)
+#define choose_random_kstack_offset(rand) do {				\
+				&randomize_kstack_offset)) {		\
+		u32 offset = this_cpu_read(kstack_offset);		\
+		offset ^= (rand);					\
+		this_cpu_write(kstack_offset, offset);			\
+	}								\
+} while (0)
diff --git a/init/main.c b/init/main.c
index ee4947af823f..78fe3aea00b0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -777,6 +777,29 @@ static void __init mm_init(void)
+			   randomize_kstack_offset);
+DEFINE_PER_CPU(u32, kstack_offset);
+static int __init early_randomize_kstack_offset(char *buf)
+	int ret;
+	bool bool_result;
+	ret = kstrtobool(buf, &bool_result);
+	if (ret)
+		return ret;
+	if (bool_result)
+		static_branch_enable(&randomize_kstack_offset);
+	else
+		static_branch_disable(&randomize_kstack_offset);
+	return 0;
+early_param("randomize_kstack_offset", early_randomize_kstack_offset);
 void __init __weak arch_call_rest_init(void)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.