oss-security - cve request: local DoS by overflowing kernel mount table using shared bind mount

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1929364718.4484556.1468421564523.JavaMail.zimbra@redhat.com>
Date: Wed, 13 Jul 2016 10:52:44 -0400 (EDT)
From: CAI Qian <caiqian@...hat.com>
To: oss-security@...ts.openwall.com
Cc: cve-assign@...re.org
Subject: cve request: local DoS by overflowing kernel mount table using
 shared bind mount

Below is the discussion between myself and staffs from security@...nel.org.
   CAI Qian

=== Initial Report ===
It was reported that the mount table expands by a power-of-two
with each bind mount command. This is a change of behavior
against the older kernel (i.e., 2.6.18). Hence, the older kernel
won't be affected or harder to exploit.

If the system is configured in the way that a non-root user
allows bind mount even if with limit number of bind mount
allowed, a non-root user could cause a local DoS by quickly
overflow the mount table. For example,

$ cat /etc/fstab
...
/tmp/1        /tmp/2        none        user,bind        0         0

Even if the application could say only allow 30 bind mounts to
increase the security.

$ for i in `seq 1 20`; do mount /tmp/1; done
$ mount | wc -l
1048606

Once this happened, it will cause a deadlock for the whole
system,

[  361.301885] NMI backtrace for cpu 0
[  361.302352] CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 3.10.0-327.10.1.el7.x86_64 #1
[  361.303062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[  361.303882] Workqueue: events qxl_fb_work [qxl]
[  361.304379] task: ffff88013943d080 ti: ffff8801395ec000 task.ti: ffff8801395ec000
[  361.305057] RIP: 0010:[<ffffffff811e0b33>]  [<ffffffff811e0b33>] prune_super+0x23/0x170
[  361.305784] RSP: 0000:ffff8801395ef5c0  EFLAGS: 00000206
[  361.306318] RAX: 0000000000000080 RBX: ffff8801394243b0 RCX: 0000000000000000
[  361.306973] RDX: 0000000000000000 RSI: ffff8801395ef710 RDI: ffff8801394243b0
[  361.307628] RBP: ffff8801395ef5e8 R08: 0000000000000000 R09: 0000000000000040
[  361.308282] R10: 0000000000000000 R11: 0000000000000220 R12: ffff8801395ef710
[  361.308939] R13: ffff880139424000 R14: ffff8801395ef710 R15: 0000000000000000
[  361.309599] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[  361.310320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  361.310893] CR2: 00007fea0c55dc3d CR3: 00000000b770d000 CR4: 00000000003406f0
[  361.311561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  361.312227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  361.312894] Stack:
[  361.313223]  0000000000000400 ffff8801395ef710 ffff8801394243b0 0000000000000258
[  361.313950]  0000000000000000 ffff8801395ef688 ffffffff8117c46b 0000000000000000
[  361.314858]  ffff8801395ef630 ffffffff811d5e21 ffff8801395ef720 0000000000000036
[  361.315615] Call Trace:
[  361.315995]  [<ffffffff8117c46b>] shrink_slab+0xab/0x300
[  361.316559]  [<ffffffff811d5e21>] ? vmpressure+0x21/0x90
[  361.317114]  [<ffffffff8117f6a2>] do_try_to_free_pages+0x3c2/0x4e0
[  361.317727]  [<ffffffff8117f8bc>] try_to_free_pages+0xfc/0x180
[  361.318309]  [<ffffffff811735bd>] __alloc_pages_nodemask+0x7fd/0xb90
[  361.318927]  [<ffffffff811b4429>] alloc_pages_current+0xa9/0x170
[  361.319513]  [<ffffffff811be9ec>] new_slab+0x2ec/0x300
[  361.320045]  [<ffffffff8163220f>] __slab_alloc+0x315/0x48f
[  361.320596]  [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0
[  361.321161]  [<ffffffff811c0fb3>] kmem_cache_alloc+0x193/0x1d0
[  361.321735]  [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0
[  361.322295]  [<ffffffff811e064c>] get_empty_filp+0x5c/0x1a0
[  361.322845]  [<ffffffff811e07ae>] alloc_file+0x1e/0xf0
[  361.323362]  [<ffffffff81182773>] __shmem_file_setup+0x113/0x1f0
[  361.323940]  [<ffffffff81182860>] shmem_file_setup+0x10/0x20
[  361.324496]  [<ffffffffa039f5ab>] drm_gem_object_init+0x2b/0x40 [drm]
[  361.325103]  [<ffffffffa0422c3d>] qxl_bo_create+0x7d/0x190 [qxl]
[  361.325680]  [<ffffffffa042798c>] ? qxl_release_list_add+0x5c/0xc0 [qxl]
[  361.326299]  [<ffffffffa0424066>] qxl_alloc_bo_reserved+0x46/0xb0 [qxl]
[  361.326912]  [<ffffffffa0424fde>] qxl_image_alloc_objects+0xae/0x140 [qxl]
[  361.327544]  [<ffffffffa042556e>] qxl_draw_opaque_fb+0xce/0x3c0 [qxl]
[  361.328145]  [<ffffffffa0421ee2>] qxl_fb_dirty_flush+0x1a2/0x260 [qxl]
[  361.328754]  [<ffffffffa0421fb9>] qxl_fb_work+0x19/0x20 [qxl]
[  361.329306]  [<ffffffff8109d5db>] process_one_work+0x17b/0x470
[  361.329865]  [<ffffffff8109e3ab>] worker_thread+0x11b/0x400
[  361.330393]  [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[  361.330940]  [<ffffffff810a5acf>] kthread+0xcf/0xe0
[  361.331419]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[  361.332005]  [<ffffffff81645998>] ret_from_fork+0x58/0x90
[  361.332513]  [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[  361.333093] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 f6 41 55 4c 8d af 50 fc ff ff 41 54 53 4c 8b 46 08 48 89 fb <4d> 85 c0 74 09 f6 06 80 0f 84 2f 01 00 00 48 8b 83 80 fc ff ff
[  361.335906] Kernel panic - not syncing: hung_task: blocked tasks

=== From Al Viro ===
the number of vfsmounts getting propagation from /tmp is doubling on
each step.  You are asking to take a subtree at /tmp/2 and attach it to
/tmp/1 *and* *all* *existing* peers.  Turning all copies into peers of
what was on /tmp/2.

So after the first mount --bind you get two vfsmounts - /tmp and /tmp/1.
And the damn things are peers - you mount anything on /tmp/1/shit, you get its
clone attached to the matching directory (/2/shit) in /tmp.

After the second mount --bind you've got two more vfsmounts - one overmounting
/tmp/1 and another - /tmp/2.  And again, all of them constitute one peer group.
With 4 elements now.  Etc.

=== From Eric W. Biederman ===
First let's be clear, it is systemd that calls MS_SHARED|MS_REC on /.

Furthermore the configuration interface for mount propagation is error
prone and pretty much requires the most problematic cases be the default
makine the interfaces very easy to use incorrectly.  Especially for code
that was only tested on systems prior to systemd.


All of this is currently allowed if a user namespace creates a mount
namespace so the concern is real.  Of course we also allow using as much
memory as we want with virtual addresses as well.  It is slightly worse
on 32bit in that it is kernel memory we are consuming and not user
memory.  Still it is just another form of unlimited memory consumption
that is causing the problem.  We have cases where in typical deployments
we allow users to consume all of the memory on the system.


That said this definitely is a case where we could set a reasonable
upper limits on the number mount objects and catch it when people do
crazy things by accident before the system gets stuck in an OOM.


I am in the process of cooking up a number up limits of that kind and I
will see about adding a limit on the number of mounts as well.
Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.