Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 26 Jun 2018 21:13:43 +0200
From: Solar Designer <solar@...nwall.com>
To: owl-dev@...ts.openwall.com
Cc: Vasily Averin <vvs@...tuozzo.com>
Subject: Re: 32-bit syscall breakage in -431 kernel with KAISER

On Tue, Jun 26, 2018 at 08:00:49PM +0200, Pavel Kankovsky wrote:
> It turns out the handler is invoked with an unexpected stack frame in the 
> "trampoline stack" (the top of the "trampoline stack", as stored in 
> TSS.rsp0, is 0xffff810001a36278 in this example):
> 
> (gdb) x/6g $rsp
> 0xffff810001a36248:     0x00000000b7f81ea1      0x0000000000000073
> 0xffff810001a36258:     0x0000000000000246      0x00000000bfa78274
> 0xffff810001a36268:     0x000000000000007b      0x000000000000007b
> 
> There should be five entries (rip at the bottom, cs, eflags, rsp, and ss 
> at the top) only but the actual stack contains one mysterious and bogus 
> extra entry (the other 0x000000000000007b) at the top, shifting everything 
> else down by 8 bytes.
> 
> This stack frame (including the bogus extra entry) is copied to the 
> regular kernel stack. As far as I can tell, the same extra entry appears 
> in other interrupt handlers, e.g. common_interrupt.
> 
> Functions that get an explicit pointer to struct pt_regs are not affected 
> but anything that expects to find pt_regs at the top of the stack (e.g. 
> compat_alloc_user_space via task_pt_regs(current)) breaks.
> 
> It seems the extra entry is added because the top of the trampoline stack 
> is not aligned to 16 bytes and the CPU does not like it and enforces the 
> alignment, shifting the whole stack down wrt. the expected layout as a 
> result.

AMD64 Architecture Programmer's Manual, Volume 2, Section 8.9 mentions
16-byte alignment:

"Interrupt-Stack Alignment.  In legacy mode, the interrupt-stack pointer
can be aligned at any address boundary.  Long mode, however, aligns the
stack on a 16-byte boundary.  This alignment is performed by the
processor in hardware before pushing items onto the stack frame.  The
previous RSP is saved unconditionally on the new stack by the interrupt
mechanism.  A subsequent IRET instruction automatically restores the
previous RSP."

"Although the RSP alignment is always performed in long mode, it is only
of consequence when the interrupted program is already running at CPL=0,
and it is generally used only within the operating-system kernel.  The
operating system should put 16-byte aligned RSP values in the TSS for
interrupts that change privilege levels."

> I have modified struct tss_struct in include/asm-x86_64/processor.h to
> include an extra "unsigned long stack_padding" between stack_canary and 
> stack to make stack 16-byte aligned...
> 
> 
> ---snip---
> --- include/asm-x86_64/processor.h.orig	2018-06-26 12:11:17 +0000
> +++ include/asm-x86_64/processor.h	2018-06-26 16:50:55 +0000
> @@ -269,6 +269,7 @@
>  	 */
>  #ifndef __GENKSYMS__
>  	unsigned long		stack_canary;
> +	unsigned long		stack_padding;
>  	unsigned long           stack[64];
>  #endif
>  } __attribute__((packed, __aligned__(PAGE_SIZE)));
> ---snip---
> 
> 
> ...and lo! 32-bit ifconfig works as expected.

Wow.  But per my review of the full struct tss_struct, the stack[] field
offset is:

4+8*5+4*2+2*2+1025*8+8 = 8264

So it's not 16-byte aligned.  But that's the start of stack[], and we
need the stack pointer to be aligned.  Yet I suppose aligning stack[]
itself is in fact the most appropriate fix.

Reviewing the diffs between e.g. -423 and -431, I see this stack[] in
struct tss is in fact newly added:

@@ -245,7 +248,29 @@
         * 8 bytes, for an extra "long" of ~0UL
         */
        unsigned long io_bitmap[IO_BITMAP_LONGS + 1];
-} __attribute__((packed)) ____cacheline_aligned;
+       /*
+        *
+        * The Intel SDM says (Volume 3, 7.2.1):
+        *
+        *  Avoid placing a page boundary in the part of the TSS that the
+        *  processor reads during a task switch (the first 104 bytes). The
+        *  processor may not correctly perform address translations if a
+        *  boundary occurs in this area. During a task switch, the processor
+        *  reads and writes into the first 104 bytes of each TSS (using
+        *  contiguous physical addresses beginning with the physical address
+        *  of the first byte of the TSS). So, after TSS access begins, if
+        *  part of the 104 bytes is not physically contiguous, the processor
+        *  will access incorrect information without generating a page-fault
+        *  exception.
+        *
+        * There are also a lot of errata involving the TSS spanning a page
+        * boundary.  Assert that we're not doing that.
+        */
+#ifndef __GENKSYMS__
+       unsigned long           stack_canary;
+       unsigned long           stack[64];
+#endif
+} __attribute__((packed, __aligned__(PAGE_SIZE)));

Do you think this is a RHEL5 bug?

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.