kernel-hardening - Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87tv3tde1r.fsf@x220.int.ebiederm.org>
Date: Thu, 13 Feb 2020 21:48:48 -0600
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Al Viro <viro@...iv.linux.org.uk>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,  LKML <linux-kernel@...r.kernel.org>,  Kernel Hardening <kernel-hardening@...ts.openwall.com>,  Linux API <linux-api@...r.kernel.org>,  Linux FS Devel <linux-fsdevel@...r.kernel.org>,  Linux Security Module <linux-security-module@...r.kernel.org>,  Akinobu Mita <akinobu.mita@...il.com>,  Alexey Dobriyan <adobriyan@...il.com>,  Andrew Morton <akpm@...ux-foundation.org>,  Andy Lutomirski <luto@...nel.org>,  Daniel Micay <danielmicay@...il.com>,  Djalal Harouni <tixxdz@...il.com>,  "Dmitry V . Levin" <ldv@...linux.org>,  Greg Kroah-Hartman <gregkh@...uxfoundation.org>,  Ingo Molnar <mingo@...nel.org>,  "J . Bruce Fields" <bfields@...ldses.org>,  Jeff Layton <jlayton@...chiereds.net>,  Jonathan Corbet <corbet@....net>,  Kees Cook <keescook@...omium.org>,  Oleg Nesterov <oleg@...hat.com>,  Solar Designer <solar@...nwall.com>
Subject: Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances

Al Viro <viro@...iv.linux.org.uk> writes:

> On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote:
>
>> I think I have an alternate idea that could work.  Add some extra code
>> into proc_task_readdir, that would look for dentries that no longer
>> point to tasks and d_invalidate them.  With the same logic probably
>> being called from a few more places as well like proc_pid_readdir,
>> proc_task_lookup, and proc_pid_lookup.
>> 
>> We could even optimize it and have a process died flag we set in the
>> superblock.
>> 
>> That would would batch up the freeing work until the next time someone
>> reads from proc in a way that would create more dentries.  So it would
>> prevent dentries from reaped zombies from growing without bound.
>> 
>> Hmm.  Given the existence of proc_fill_cache it would really be a good
>> idea if readdir and lookup performed some of the freeing work as well.
>> As on readdir we always populate the dcache for all of the directory
>> entries.
>
> First of all, that won't do a damn thing when nobody is accessing
> given superblock.  What's more, readdir in root of that procfs instance
> is not enough - you need it in task/ of group leader.

It should give a rough bound on the number of stale dentries a
superblock can have.  The same basic concept has been used very
successfully in many incremental garbage collectors.  In those malloc
(or the equivalent) does a finite amount of garbage collection work to
roughly balance out the amount of memory allocated.  I am proposing
something similar for proc instances.

Further if no one is accessing a superblock we don't have a problem
either.


> What I don't understand is the insistence on getting those dentries
> via dcache lookups.  _IF_ we are willing to live with cacheline
> contention (on ->d_lock of root dentry, if nothing else), why not
> do the following:

No insistence from this side.

I was not seeing atomic_inc_not_zero(sb->s_active) from rcu
context as option earlier.  But it is an option.

> 	* put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> into a list anchored in task_struct; have non-counting reference to
> task_struct stored in them (might simplify part of get_proc_task() users,
> BTW - avoids pid-to-task_struct lookups if we have a dentry and not just
> the inode; many callers do)
> 	* have ->d_release() remove from it (protecting per-task_struct lock
> nested outside of all ->d_lock)
> 	* on exit:
> 	lock the (per-task_struct) list
> 	while list is non-empty
> 		pick the first dentry
> 		remove from the list
> 		sb = dentry->d_sb
> 		try to bump sb->s_active (if non-zero, that is).
> 		if failed
> 			continue // move on to the next one - nothing to do here
> 		grab ->d_lock
> 		res = handle_it(dentry, &temp_list)
> 		drop ->d_lock
> 		unlock the list
> 		if (!list_empty(&temp_list))
> 			shrink_dentry_list(&temp_list)
> 		if (res)
> 			d_invalidate(dentry)
> 			dput(dentry)
> 		deactivate_super(sb)
> 		lock the list
> 	unlock the list
>
> handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c
> 	if ->d_count is negative // unlikely
> 		return 0;
> 	if ->d_count is positive,
> 		increment ->d_count
> 		return 1;
> 	// OK, it's still alive, but ->d_count is 0
> 	__d_drop	// equivalent of d_invalidate in this case
> 	if not on a shrink list // otherwise it's not our headache
> 		if on lru list
> 			d_lru_del
> 		d_shrink_add dentry to temp_list
> 	return 0;
>
> And yeah, that'll dirty ->s_active for each procfs superblock that
> has dentry for our process present in dcache.  On exit()...


I would thread the whole thing through the proc_inode instead of coming
up with a new allocation per dentry so an extra memory allocation isn't
needed.  We already have i_dentry.  So going from the vfs_inode to
the dentry is trivial.



But truthfully I don't like proc_flush_task.

The problem is that proc_flush_task is a layering violation and magic
code that pretty much no one understands.  We have some very weird
cases where dput or d_invalidate wound up triggering ext3 code.  It has
been fixed for a long time now, but it wasy crazy weird unexpected
stuff.


Al your logic above just feels very clever, and like many pieces of the
kernel have to know how other pieces of the kernel work.  If we can find
something stupid and simple that also solves the problem I would be much
happier.   Than anyone could understand and fix it if something goes
wrong.

Eric
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.