Date: Thu, 16 Nov 2017 16:54:10 -0800 From: Kees Cook <keescook@...omium.org> To: Patrick McLean <chutzpah@...too.org> Cc: Linus Torvalds <torvalds@...ux-foundation.org>, Emese Revfy <re.emese@...il.com>, Al Viro <viro@...iv.linux.org.uk>, Bruce Fields <bfields@...hat.com>, "Darrick J. Wong" <darrick.wong@...cle.com>, Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, Linux NFS Mailing List <linux-nfs@...r.kernel.org>, stable <stable@...r.kernel.org>, Thorsten Leemhuis <regressions@...mhuis.info>, "kernel-hardening@...ts.openwall.com" <kernel-hardening@...ts.openwall.com> Subject: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11 On Mon, Nov 13, 2017 at 2:48 PM, Patrick McLean <chutzpah@...too.org> wrote: > On 2017-11-11 09:31 AM, Linus Torvalds wrote: >> Boris Lukashev points out that Patrick should probably check a newer >> version of gcc. >> >> I looked around, and in one of the emails, Patrick said: >> >> "No changes, both the working and broken kernels were built with >> distro-provided gcc 5.4.0 and binutils 2.28.1" >> >> and gcc-5.4.0 is certainly not very recent. It's not _ancient_, but >> it's a bug-fix release to a pretty old branch that is not exactly new. >> >> It would probably be good to check if the problems persist with gcc >> 6.x or 7.x.. I have no idea which gcc version the randstruct people >> tend to use themselves. > > I just tested it with gcc 7.2, and was able to reproduce the NULL > pointer dereference, the backtrace looks slightly different this time. > > I will also test with binutils 2.29, though I doubt that will make any > difference. > >> [ 56.165181] BUG: unable to handle kernel NULL pointer dereference at 0000000000000560 >> [ 56.166563] IP: vfs_statfs+0x7c/0xc0 >> [ 56.167249] PGD 0 P4D 0 >> [ 56.167860] Oops: 0000 [#1] SMP >> [ 56.176478] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_multiport xt_addrtype iptable_mangle iptable> >> [ 56.180227] CPU: 0 PID: 3985 Comm: nfsd Tainted: G O 4.14.0-git-kratos-1 #1 >> [ 56.181728] Hardware name: TYAN S5510/S5510, BIOS V2.02 03/12/2013 >> [ 56.182729] task: ffff88040c412a00 task.stack: ffffc90002c18000 >> [ 56.183629] RIP: 0010:vfs_statfs+0x7c/0xc0 >> [ 56.184341] RSP: 0018:ffffc90002c1bb28 EFLAGS: 00010202 >> [ 56.185143] RAX: 0000000000000000 RBX: ffffc90002c1bbf0 RCX: 0000000000000020 >> [ 56.186085] RDX: 0000000000001801 RSI: 0000000000001801 RDI: 0000000000000000 >> [ 56.187066] RBP: ffffc90002c1bbc0 R08: ffffffffffffff00 R09: 00000000000000ff >> [ 56.188268] R10: 000000000038be3a R11: ffff880408b18258 R12: 0000000000000000 >> [ 56.189336] R13: ffff88040c23ad00 R14: ffff88040b874000 R15: ffffc90002c1bbf0 >> [ 56.190444] FS: 0000000000000000(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000 >> [ 56.191876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 56.192843] CR2: 0000000000000560 CR3: 0000000001e0a002 CR4: 00000000001606f0 >> [ 56.193898] Call Trace: >> [ 56.194510] nfsd4_encode_fattr+0x201/0x1f90 >> [ 56.195267] ? generic_permission+0x12c/0x1a0 >> [ 56.196025] nfsd4_encode_getattr+0x25/0x30 >> [ 56.196753] nfsd4_encode_operation+0x98/0x1b0 >> [ 56.197526] nfsd4_proc_compound+0x2a0/0x5e0 >> [ 56.198268] nfsd_dispatch+0xe8/0x220 >> [ 56.198968] svc_process_common+0x475/0x640 >> [ 56.199696] ? nfsd_destroy+0x60/0x60 >> [ 56.200404] svc_process+0xf2/0x1a0 >> [ 56.201079] nfsd+0xe3/0x150 >> [ 56.201706] kthread+0x117/0x130 >> [ 56.202354] ? kthread_create_on_node+0x40/0x40 >> [ 56.203100] ret_from_fork+0x25/0x30 >> [ 56.203774] Code: d6 89 d6 81 ce 00 04 00 00 f6 c1 08 0f 45 d6 89 d6 81 ce 00 08 00 00 f6 c1 10 0f 45 d6 89 d6 81 ce> >> [ 56.206289] RIP: vfs_statfs+0x7c/0xc0 RSP: ffffc90002c1bb28 >> [ 56.207110] CR2: 0000000000000560 >> [ 56.207763] ---[ end trace d452986a80f64aaa ]--- > >> On Sat, Nov 11, 2017 at 8:13 AM, Kees Cook <keescook@...omium.org> wrote: >>> >>> I'll take a closer look at this and see if I can provide something to >>> narrow it down. How reliable is this crash? The best idea I have to isolate it would be to bisect the additions of the __randomize_layout markings on various structures. I would start with the ones Al is most upset to see randomized. ;) All that said, I'd like to better understand the BIOS side of this a little better. In the first email in this thread, you showed two BUGs separated by a little time, which implies to me that the NULL deref and the BIOS no longer POSTing are separate (though seemingly related) issues. Have you had machines survive the BUG without blowing up the BIOS? I'm still trying to wrap my head around how the BIOS could be blowing up. I assume there's some magic memory address that is getting poked as a result of some struct randomization bug, so tracking that down should be possible assuming you can stand reflashing your BIOS across the bisects. For the first step, I'd try a revert of 9225331b310821760f39ba55b00b8973602adbb5, which enables a large portion of struct randomization. If that doesn't change things, I can provide a series that reverts 3859a271a003aba01e45b85c9d8b355eb7bf25f9 and then re-applies __randomize_layout one structure per patch, and you could bisect that? -Kees -- Kees Cook Pixel Security
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.