diff mbox series

[1/3] dcache: add a new enum type for 'dentry_d_lock_class'

Message ID 1573788472-87426-2-git-send-email-yukuai3@huawei.com (mailing list archive)
State New, archived
Headers show
Series fix potential infinite loop in debugfs_remove_recursive | expand

Commit Message

Yu Kuai Nov. 15, 2019, 3:27 a.m. UTC
'dentry_d_lock_class' can be used for spin_lock_nested in case lockdep
confused about two different dentry take the 'd_lock'.

However, a single 'DENTRY_D_LOCK_NESTED' may not be enough if more than
two dentry are involed. So, and in 'DENTRY_D_LOCK_NESTED_2'

Signed-off-by: yu kuai <yukuai3@huawei.com>
---
 include/linux/dcache.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Greg KH Nov. 15, 2019, 3:27 a.m. UTC | #1
On Fri, Nov 15, 2019 at 11:27:50AM +0800, yu kuai wrote:
> 'dentry_d_lock_class' can be used for spin_lock_nested in case lockdep
> confused about two different dentry take the 'd_lock'.
> 
> However, a single 'DENTRY_D_LOCK_NESTED' may not be enough if more than
> two dentry are involed. So, and in 'DENTRY_D_LOCK_NESTED_2'
> 
> Signed-off-by: yu kuai <yukuai3@huawei.com>
> ---
>  include/linux/dcache.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 10090f1..8eb84ef 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -129,7 +129,8 @@ struct dentry {
>  enum dentry_d_lock_class
>  {
>  	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
> -	DENTRY_D_LOCK_NESTED
> +	DENTRY_D_LOCK_NESTED,
> +	DENTRY_D_LOCK_NESTED_2

You should document this, as "_2" does not make much sense to anyone
only looking at the code :(

Or rename it better.

thanks,

greg k-h
Al Viro Nov. 15, 2019, 4:12 a.m. UTC | #2
On Fri, Nov 15, 2019 at 11:27:59AM +0800, Greg KH wrote:
> On Fri, Nov 15, 2019 at 11:27:50AM +0800, yu kuai wrote:
> > 'dentry_d_lock_class' can be used for spin_lock_nested in case lockdep
> > confused about two different dentry take the 'd_lock'.
> > 
> > However, a single 'DENTRY_D_LOCK_NESTED' may not be enough if more than
> > two dentry are involed. So, and in 'DENTRY_D_LOCK_NESTED_2'
> > 
> > Signed-off-by: yu kuai <yukuai3@huawei.com>
> > ---
> >  include/linux/dcache.h | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> > index 10090f1..8eb84ef 100644
> > --- a/include/linux/dcache.h
> > +++ b/include/linux/dcache.h
> > @@ -129,7 +129,8 @@ struct dentry {
> >  enum dentry_d_lock_class
> >  {
> >  	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
> > -	DENTRY_D_LOCK_NESTED
> > +	DENTRY_D_LOCK_NESTED,
> > +	DENTRY_D_LOCK_NESTED_2
> 
> You should document this, as "_2" does not make much sense to anyone
> only looking at the code :(
> 
> Or rename it better.

FWIW, I'm not sure it's a good solution.  What are the rules for callers
of that thing, anyway?  If it can be called when somebody is creating
more files in that subtree, we almost certainly will have massive
problems with the lifetimes of underlying objects...

Could somebody familiar with debugfs explain how is that thing actually
used and what is required from/promised to its callers?  I can try and
grep through the tree and guess what the rules are, but I've way too
much on my platter right now and I don't want to get sidetracked into yet
another tree-wide search and analysis session ;-/
Greg KH Nov. 15, 2019, 7:20 a.m. UTC | #3
On Fri, Nov 15, 2019 at 04:12:43AM +0000, Al Viro wrote:
> On Fri, Nov 15, 2019 at 11:27:59AM +0800, Greg KH wrote:
> > On Fri, Nov 15, 2019 at 11:27:50AM +0800, yu kuai wrote:
> > > 'dentry_d_lock_class' can be used for spin_lock_nested in case lockdep
> > > confused about two different dentry take the 'd_lock'.
> > > 
> > > However, a single 'DENTRY_D_LOCK_NESTED' may not be enough if more than
> > > two dentry are involed. So, and in 'DENTRY_D_LOCK_NESTED_2'
> > > 
> > > Signed-off-by: yu kuai <yukuai3@huawei.com>
> > > ---
> > >  include/linux/dcache.h | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> > > index 10090f1..8eb84ef 100644
> > > --- a/include/linux/dcache.h
> > > +++ b/include/linux/dcache.h
> > > @@ -129,7 +129,8 @@ struct dentry {
> > >  enum dentry_d_lock_class
> > >  {
> > >  	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
> > > -	DENTRY_D_LOCK_NESTED
> > > +	DENTRY_D_LOCK_NESTED,
> > > +	DENTRY_D_LOCK_NESTED_2
> > 
> > You should document this, as "_2" does not make much sense to anyone
> > only looking at the code :(
> > 
> > Or rename it better.
> 
> FWIW, I'm not sure it's a good solution.  What are the rules for callers
> of that thing, anyway?  If it can be called when somebody is creating
> more files in that subtree, we almost certainly will have massive
> problems with the lifetimes of underlying objects...
> 
> Could somebody familiar with debugfs explain how is that thing actually
> used and what is required from/promised to its callers?  I can try and
> grep through the tree and guess what the rules are, but I've way too
> much on my platter right now and I don't want to get sidetracked into yet
> another tree-wide search and analysis session ;-/

Yu wants to use simple_empty() in debugfs_remove_recursive() instead of
manually checking:
	if (!list_empty(&child->d_subdirs)) {

See patch 3 of this series for that change and why they feel it is
needed:
	https://lore.kernel.org/lkml/1573788472-87426-4-git-send-email-yukuai3@huawei.com/

As to if patch 3 really is needed, I'll leave that up to Yu given that I
thought we had resolved these types of issues already a year or so ago.

thanks,

greg k-h
Yu Kuai Nov. 15, 2019, 10:02 a.m. UTC | #4
On 2019/11/15 11:27, Greg KH wrote:
> On Fri, Nov 15, 2019 at 11:27:50AM +0800, yu kuai wrote:
>> 'dentry_d_lock_class' can be used for spin_lock_nested in case lockdep
>> confused about two different dentry take the 'd_lock'.
>>
>> However, a single 'DENTRY_D_LOCK_NESTED' may not be enough if more than
>> two dentry are involed. So, and in 'DENTRY_D_LOCK_NESTED_2'
>>
>> Signed-off-by: yu kuai <yukuai3@huawei.com>
>> ---
>>   include/linux/dcache.h | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
>> index 10090f1..8eb84ef 100644
>> --- a/include/linux/dcache.h
>> +++ b/include/linux/dcache.h
>> @@ -129,7 +129,8 @@ struct dentry {
>>   enum dentry_d_lock_class
>>   {
>>   	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
>> -	DENTRY_D_LOCK_NESTED
>> +	DENTRY_D_LOCK_NESTED,
>> +	DENTRY_D_LOCK_NESTED_2
> 
> You should document this, as "_2" does not make much sense to anyone
> only looking at the code :(
> 
> Or rename it better.
> 
Thank you for your advise, I'll try to rename it better with comments.

Yu Kuai
> thanks,
> 
> greg k-h
> 
> .
>
Yu Kuai Nov. 15, 2019, 10:08 a.m. UTC | #5
On 2019/11/15 15:20, Greg KH wrote:
> On Fri, Nov 15, 2019 at 04:12:43AM +0000, Al Viro wrote:
>> On Fri, Nov 15, 2019 at 11:27:59AM +0800, Greg KH wrote:
>>> On Fri, Nov 15, 2019 at 11:27:50AM +0800, yu kuai wrote:
>>>> 'dentry_d_lock_class' can be used for spin_lock_nested in case lockdep
>>>> confused about two different dentry take the 'd_lock'.
>>>>
>>>> However, a single 'DENTRY_D_LOCK_NESTED' may not be enough if more than
>>>> two dentry are involed. So, and in 'DENTRY_D_LOCK_NESTED_2'
>>>>
>>>> Signed-off-by: yu kuai <yukuai3@huawei.com>
>>>> ---
>>>>   include/linux/dcache.h | 3 ++-
>>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
>>>> index 10090f1..8eb84ef 100644
>>>> --- a/include/linux/dcache.h
>>>> +++ b/include/linux/dcache.h
>>>> @@ -129,7 +129,8 @@ struct dentry {
>>>>   enum dentry_d_lock_class
>>>>   {
>>>>   	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
>>>> -	DENTRY_D_LOCK_NESTED
>>>> +	DENTRY_D_LOCK_NESTED,
>>>> +	DENTRY_D_LOCK_NESTED_2
>>>
>>> You should document this, as "_2" does not make much sense to anyone
>>> only looking at the code :(
>>>
>>> Or rename it better.
>>
>> FWIW, I'm not sure it's a good solution.  What are the rules for callers
>> of that thing, anyway?  If it can be called when somebody is creating
>> more files in that subtree, we almost certainly will have massive
>> problems with the lifetimes of underlying objects...
>>
>> Could somebody familiar with debugfs explain how is that thing actually
>> used and what is required from/promised to its callers?  I can try and
>> grep through the tree and guess what the rules are, but I've way too
>> much on my platter right now and I don't want to get sidetracked into yet
>> another tree-wide search and analysis session ;-/
> 
> Yu wants to use simple_empty() in debugfs_remove_recursive() instead of
> manually checking:
> 	if (!list_empty(&child->d_subdirs)) {
> 
> See patch 3 of this series for that change and why they feel it is
> needed:
> 	https://lore.kernel.org/lkml/1573788472-87426-4-git-send-email-yukuai3@huawei.com/
> 
> As to if patch 3 really is needed, I'll leave that up to Yu given that I
> thought we had resolved these types of issues already a year or so ago.
> 
> thanks,
> 
> greg k-h
> 
> .
> 
The main purpose of this patchset is to fix the infinite loop in
debugfs_remove_recursive. Steven point out that simple replace
list_empty with simple_empty will cause a splat with lockdep enabled.
We try to fix it with the first two patch, do you think it's appropriate?

Thanks,
Yu Kuai
Al Viro Nov. 15, 2019, 1:16 p.m. UTC | #6
On Fri, Nov 15, 2019 at 03:20:11PM +0800, Greg KH wrote:

> > FWIW, I'm not sure it's a good solution.  What are the rules for callers
> > of that thing, anyway?  If it can be called when somebody is creating
> > more files in that subtree, we almost certainly will have massive
> > problems with the lifetimes of underlying objects...
> > 
> > Could somebody familiar with debugfs explain how is that thing actually
> > used and what is required from/promised to its callers?  I can try and
> > grep through the tree and guess what the rules are, but I've way too
> > much on my platter right now and I don't want to get sidetracked into yet
> > another tree-wide search and analysis session ;-/
> 
> Yu wants to use simple_empty() in debugfs_remove_recursive() instead of
> manually checking:
> 	if (!list_empty(&child->d_subdirs)) {
> 
> See patch 3 of this series for that change and why they feel it is
> needed:
> 	https://lore.kernel.org/lkml/1573788472-87426-4-git-send-email-yukuai3@huawei.com/
> 
> As to if patch 3 really is needed, I'll leave that up to Yu given that I
> thought we had resolved these types of issues already a year or so ago.

What I'm asking is what concurrency warranties does the whole thing
(debugfs_remove_recursive()) have to deal with.  IMO the overall
structure of the walk-and-remove the tree algorithm in there
is Not Nice(tm) and I'd like to understand if it needs to be kept
that way.  And the locking is confused in there - it either locks
too much, or not enough.

So... can debugfs_remove_recursive() rely upon the lack of attempts to create
new entries inside the subtree it's trying to kill?  If it can, the things
can be made simpler; if it can't, it's not locking enough; e.g. results of
simple_empty() on child won't be guaranteed to remain unchanged just as it
returns to caller.

What's more, the uses of simple_unlink()/simple_rmdir() there are not
imitiating the normal locking environment for ->unlink() and ->rmdir() resp. -
the victim's inode is not locked, so just for starters the call of simple_empty()
from simple_rmdir() itself is not guaranteed to give valid result.

I want to understand the overall situation.  No argument, list_empty()
in there is BS, for many reasons.  But I wonder if trying to keep the
current structure of the iterator _and_ the use of simple_rmdir()/simple_unlink()
is the right approach.
Steven Rostedt Nov. 15, 2019, 1:38 p.m. UTC | #7
On Fri, 15 Nov 2019 13:16:25 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> I want to understand the overall situation.  No argument, list_empty()
> in there is BS, for many reasons.  But I wonder if trying to keep the
> current structure of the iterator _and_ the use of simple_rmdir()/simple_unlink()
> is the right approach.

My guess is that debugfs was written to be as simple as possible.
Nothing too complex. And in doing so, may have issues as you are
pointing out. Just a way to allow communications between user space and
kernel space (as tracefs started out).

BTW, what do you mean by "can debugfs_remove_recursive() rely upon the
lack of attempts to create new entries inside the subtree it's trying
to kill?"

-- Steve
Steven Rostedt Nov. 15, 2019, 1:39 p.m. UTC | #8
On Fri, 15 Nov 2019 08:38:13 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> My guess is that debugfs was written to be as simple as possible.
> Nothing too complex. And in doing so, may have issues as you are
> pointing out. Just a way to allow communications between user space and
> kernel space (as tracefs started out).

And speaking of tracefs, as it was basically a cut and paste copy of
debugfs, it too has the same issues. Thus, I'm very much interested in
how this should be done.

-- Steve
Al Viro Nov. 15, 2019, 1:48 p.m. UTC | #9
On Fri, Nov 15, 2019 at 08:38:13AM -0500, Steven Rostedt wrote:
> On Fri, 15 Nov 2019 13:16:25 +0000
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > I want to understand the overall situation.  No argument, list_empty()
> > in there is BS, for many reasons.  But I wonder if trying to keep the
> > current structure of the iterator _and_ the use of simple_rmdir()/simple_unlink()
> > is the right approach.
> 
> My guess is that debugfs was written to be as simple as possible.
> Nothing too complex. And in doing so, may have issues as you are
> pointing out. Just a way to allow communications between user space and
> kernel space (as tracefs started out).
> 
> BTW, what do you mean by "can debugfs_remove_recursive() rely upon the
> lack of attempts to create new entries inside the subtree it's trying
> to kill?"

Is it possible for something to call e.g. debugfs_create_dir() (or any
similar primitive) with parent inside the subtree that has been
passed to debugfs_remove_recursive() call that is still in progress?

If debugfs needs to cope with that, debugfs_remove_recursive() needs
considerably heavier locking, to start with.
Steven Rostedt Nov. 15, 2019, 1:58 p.m. UTC | #10
On Fri, 15 Nov 2019 13:48:23 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> > BTW, what do you mean by "can debugfs_remove_recursive() rely upon the
> > lack of attempts to create new entries inside the subtree it's trying
> > to kill?"  
> 
> Is it possible for something to call e.g. debugfs_create_dir() (or any
> similar primitive) with parent inside the subtree that has been
> passed to debugfs_remove_recursive() call that is still in progress?
> 
> If debugfs needs to cope with that, debugfs_remove_recursive() needs
> considerably heavier locking, to start with.

I don't know about debugfs, but at least tracefs (which cut and pasted
from debugfs) does not allow that. At least in theory it doesn't allow
that (and if it does, it's a bug in the locking at the higher levels).

And perhaps debugfs shouldn't allow that either. As it is only suppose
to be a light weight way to interact with the kernel, hence the name
"debugfs".

Yu, do you have a test case for the "infinite loop" case?

-- Steve
Al Viro Nov. 15, 2019, 2:17 p.m. UTC | #11
On Fri, Nov 15, 2019 at 08:58:05AM -0500, Steven Rostedt wrote:
> On Fri, 15 Nov 2019 13:48:23 +0000
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > > BTW, what do you mean by "can debugfs_remove_recursive() rely upon the
> > > lack of attempts to create new entries inside the subtree it's trying
> > > to kill?"  
> > 
> > Is it possible for something to call e.g. debugfs_create_dir() (or any
> > similar primitive) with parent inside the subtree that has been
> > passed to debugfs_remove_recursive() call that is still in progress?
> > 
> > If debugfs needs to cope with that, debugfs_remove_recursive() needs
> > considerably heavier locking, to start with.
> 
> I don't know about debugfs, but at least tracefs (which cut and pasted
> from debugfs) does not allow that. At least in theory it doesn't allow
> that (and if it does, it's a bug in the locking at the higher levels).
> 
> And perhaps debugfs shouldn't allow that either. As it is only suppose
> to be a light weight way to interact with the kernel, hence the name
> "debugfs".
> 
> Yu, do you have a test case for the "infinite loop" case?

Infinite loop, AFAICS, is reasonably easy to trigger - just open
a non-empty subdirectory and lseek to e.g. next-to-last element
in it.  Again, list_empty() use in there is quite wrong - it can
give false negatives just on the cursors.  No arguments about
that part...
Al Viro Nov. 15, 2019, 5:54 p.m. UTC | #12
On Fri, Nov 15, 2019 at 02:17:54PM +0000, Al Viro wrote:
> On Fri, Nov 15, 2019 at 08:58:05AM -0500, Steven Rostedt wrote:
> > On Fri, 15 Nov 2019 13:48:23 +0000
> > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > > > BTW, what do you mean by "can debugfs_remove_recursive() rely upon the
> > > > lack of attempts to create new entries inside the subtree it's trying
> > > > to kill?"  
> > > 
> > > Is it possible for something to call e.g. debugfs_create_dir() (or any
> > > similar primitive) with parent inside the subtree that has been
> > > passed to debugfs_remove_recursive() call that is still in progress?
> > > 
> > > If debugfs needs to cope with that, debugfs_remove_recursive() needs
> > > considerably heavier locking, to start with.
> > 
> > I don't know about debugfs, but at least tracefs (which cut and pasted
> > from debugfs) does not allow that. At least in theory it doesn't allow
> > that (and if it does, it's a bug in the locking at the higher levels).
> > 
> > And perhaps debugfs shouldn't allow that either. As it is only suppose
> > to be a light weight way to interact with the kernel, hence the name
> > "debugfs".
> > 
> > Yu, do you have a test case for the "infinite loop" case?
> 
> Infinite loop, AFAICS, is reasonably easy to trigger - just open
> a non-empty subdirectory and lseek to e.g. next-to-last element
> in it.  Again, list_empty() use in there is quite wrong - it can
> give false negatives just on the cursors.  No arguments about
> that part...

FWIW, given that debugfs_remove_recursive() has no way to report an error
*and* that we have a single chokepoint for all entry creations (start_creating())
we could make sure nothing gets added pretty easily - just mark the victim
dentry as "don't allow any creations here" when we first reach it and
have start_creating check that, using e.g. inode_lock() for serialization.
Marking could be done e.g. by setting ->d_fsdata to
(void *)DEBUGFS_FSDATA_IS_REAL_FOPS_BIT, so that ->d_release() doesn't
need any changes.

BTW, this
                if (!ret)
                        d_delete(dentry);
                if (d_is_reg(dentry))
                        __debugfs_file_removed(dentry);
                dput(dentry);
in __debugfs_remove() is both subtle and bogus.  If we get here
with refcount > 1, d_delete() is just a d_drop() - dentry can't
be made negative, so it gets unhashed instead.  If we *do* get
here with refcount 1, it will be made negative, all right.
Only to be killed off by immediately following dput(), since
debugfs doesn't retain unpinned dentries.

Why immediate?  Because d_is_reg() is obviously false on
negative dentries, so __debugfs_file_removed() is not called.
That's where the subtle part begins: there should've been
nobody for __debugfs_file_removed() to wait for, since they
would've had to hold additional references to dentry and
refcount wouldn't have been 1.  So that's actually not
a bug.  However, it's too subtle to introduce without
having bothered to even comment the damn thing.

As for the "bogus" part - that d_delete() is bollocks.
First of all, it is fully equivalent to d_drop().  Always
had been.  What's more, that sucker should've been
d_invalidate() instead.

Anyway, AFAICS removal could be done this way:

// parent is held exclusive, after is NULL or a child of parent
find_next_child(parent, prev)
	child = NULL
	node = prev ? &prev->d_child : &parent->d_subdirs;
	grab parent->d_lock
	for each entry in the list starting at node->next
		d = container_of(entry, struct dentry, d_child)
		grab d->d_lock
		if simple_positive(d)
			bump d->d_count
			child = d
		drop d->d_lock
		if child
			break
	drop parent->d_lock
	dput(prev);
	return child

kill_it(victim, parent)
	if simple_positive(victim)
		d_invalidate(victim);	// needed to avoid lost mounts
		if victim is a directory
			fsnotify_rmdir
		else
			fsnotify_unlink
		if victim is regular
			__debugfs_file_removed(victim)
		dput(victim)		// unpin it

recursive_removal(dentry)
	this = dentry;
	while (true) {
		victim = NULL;
		inode = this->d_inode;
		inode_lock(inode);
		if (d_is_dir(this))
			mark this doomed
		while ((child = find_next_child(this, victim)) == NULL) {
			// no children (left); kill and ascend
			// update metadata while it's still locked
			inode->i_ctime = current_time(inode);
			clear_nlink(inode);
			inode_unlock(inode);
			victim = this;
			this = this->d_parent;
			inode = this->d_inode;
			inode_lock(inode);
			kill_it(victim, this);
			if (victim == dentry) {
				inode->i_ctime = inode->i_mtime = current_time(inode);
				if (d_is_dir(dentry))
					drop_nlink(inode);
				inode_unlock(inode);
				dput(dentry);
				return;
			}
		}
		inode_unlock(inode);
		this = child;
	}
Al Viro Nov. 15, 2019, 6:42 p.m. UTC | #13
On Fri, Nov 15, 2019 at 05:54:23PM +0000, Al Viro wrote:
> Anyway, AFAICS removal could be done this way:
> 
> // parent is held exclusive, after is NULL or a child of parent
> find_next_child(parent, prev)
> 	child = NULL
> 	node = prev ? &prev->d_child : &parent->d_subdirs;
> 	grab parent->d_lock
> 	for each entry in the list starting at node->next
> 		d = container_of(entry, struct dentry, d_child)
> 		grab d->d_lock
> 		if simple_positive(d)
> 			bump d->d_count
> 			child = d
> 		drop d->d_lock
> 		if child
> 			break
> 	drop parent->d_lock
> 	dput(prev);
> 	return child
> 
> kill_it(victim, parent)
> 	if simple_positive(victim)
> 		d_invalidate(victim);	// needed to avoid lost mounts
> 		if victim is a directory
> 			fsnotify_rmdir
> 		else
> 			fsnotify_unlink
> 		if victim is regular
> 			__debugfs_file_removed(victim)
> 		dput(victim)		// unpin it
> 
> recursive_removal(dentry)
> 	this = dentry;
> 	while (true) {
> 		victim = NULL;
> 		inode = this->d_inode;
> 		inode_lock(inode);
> 		if (d_is_dir(this))
> 			mark this doomed
> 		while ((child = find_next_child(this, victim)) == NULL) {
> 			// no children (left); kill and ascend
> 			// update metadata while it's still locked
> 			inode->i_ctime = current_time(inode);
> 			clear_nlink(inode);
> 			inode_unlock(inode);
> 			victim = this;
> 			this = this->d_parent;
> 			inode = this->d_inode;
> 			inode_lock(inode);
> 			kill_it(victim, this);
> 			if (victim == dentry) {
> 				inode->i_ctime = inode->i_mtime = current_time(inode);
> 				if (d_is_dir(dentry))
> 					drop_nlink(inode);
> 				inode_unlock(inode);
> 				dput(dentry);
> 				return;
> 			}
> 		}
> 		inode_unlock(inode);
> 		this = child;
> 	}

Come to think of that, if we use IS_DEADDIR as "no more additions" marking,
that looks like a good candidate for all in-kernel rm -rf on ramfs-style
filesystems without cross-directory renames.  This bit in kill_it() above
 		if victim is regular
 			__debugfs_file_removed(victim)
would be an fs-specific callback passed by the caller, turning the whole
thing into this:

void simple_recursive_removal(struct dentry *dentry,
			      void (*callback)(struct dentry *))
{
	struct dentry *this = dentry;
	while (true) {
		struct dentry *victim = NULL, *child;
		struct inode *inode = this->d_inode;

		inode_lock(inode);
		if (d_is_dir(this))
			inode->i_flags |= S_DEAD;
		while ((child = find_next_child(this, victim)) == NULL) {
			// kill and ascend
			// update metadata while it's still locked
			inode->i_ctime = current_time(inode);
			clear_nlink(inode);
			inode_unlock(inode);
			victim = this;
			this = this->d_parent;
			inode = this->d_inode;
			inode_lock(inode);
			if (simple_positive(victim)) {
		 		d_invalidate(victim);	// avoid lost mounts
				if (is_dir(victim))
					fsnotify_rmdir(inode, victim);
				else
					fsnotify_unlink(inode, victim);
				if (callback)
					callback(victim);
				dput(victim)		// unpin it
			}
			if (victim == dentry) {
				inode->i_ctime = inode->i_mtime =
					current_time(inode);
				if (d_is_dir(dentry))
					drop_nlink(inode);
				inode_unlock(inode);
				dput(dentry);
				return;
			}
		}
		inode_unlock(inode);
		this = child;
	}
}

with find_next_child() easily implemented via scan_positives() already
in libfs.c...  Objections?  The above is obviously completely untested,
and I've got nowhere near enough sleep, so there may be any number of
brown paperbag bugs in it.  Review would be very welcome...
Al Viro Nov. 15, 2019, 7:41 p.m. UTC | #14
On Fri, Nov 15, 2019 at 06:42:09PM +0000, Al Viro wrote:
> Come to think of that, if we use IS_DEADDIR as "no more additions" marking,
> that looks like a good candidate for all in-kernel rm -rf on ramfs-style
> filesystems without cross-directory renames.  This bit in kill_it() above
>  		if victim is regular
>  			__debugfs_file_removed(victim)
> would be an fs-specific callback passed by the caller, turning the whole
> thing into this:

Umm...  A bit more than that, actually - the callback would be
void remove_one(struct dentry *victim)
{
	if (d_is_reg(victim))
		__debugfs_file_removed(victim);
	simple_release_fs(&debugfs_mount, &debugfs_mount_count);
}
and the caller would do
	simple_pin_fs(&debug_fs_type, &debugfs_mount, &debugfs_mount_count);
	simple_recursive_removal(dentry, remove_one);
	simple_release_fs(&debugfs_mount, &debugfs_mount_count);
Al Viro Nov. 15, 2019, 9:18 p.m. UTC | #15
On Fri, Nov 15, 2019 at 07:41:38PM +0000, Al Viro wrote:
> On Fri, Nov 15, 2019 at 06:42:09PM +0000, Al Viro wrote:
> > Come to think of that, if we use IS_DEADDIR as "no more additions" marking,
> > that looks like a good candidate for all in-kernel rm -rf on ramfs-style
> > filesystems without cross-directory renames.  This bit in kill_it() above
> >  		if victim is regular
> >  			__debugfs_file_removed(victim)
> > would be an fs-specific callback passed by the caller, turning the whole
> > thing into this:
> 
> Umm...  A bit more than that, actually - the callback would be
> void remove_one(struct dentry *victim)
> {
> 	if (d_is_reg(victim))
> 		__debugfs_file_removed(victim);
> 	simple_release_fs(&debugfs_mount, &debugfs_mount_count);
> }
> and the caller would do
> 	simple_pin_fs(&debug_fs_type, &debugfs_mount, &debugfs_mount_count);
> 	simple_recursive_removal(dentry, remove_one);
> 	simple_release_fs(&debugfs_mount, &debugfs_mount_count);

OK... debugfs and tracefs definitely convert to that; so do, AFAICS,
spufs and selinuxfs, and I wouldn't be surprised if it could be
used in a few more places...  securityfs, almost certainly qibfs,
gadgetfs looks like it could make use of that.  Maybe subrpc
as well, but I'll need to look in details.  configfs won't,
unfortunately...
Steven Rostedt Nov. 15, 2019, 9:26 p.m. UTC | #16
On Fri, 15 Nov 2019 21:18:20 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> OK... debugfs and tracefs definitely convert to that; so do, AFAICS,
> spufs and selinuxfs, and I wouldn't be surprised if it could be
> used in a few more places...  securityfs, almost certainly qibfs,
> gadgetfs looks like it could make use of that.  Maybe subrpc
> as well, but I'll need to look in details.  configfs won't,
> unfortunately...

Thanks Al for looking into this.

I'll try to test it in tracefs, and see if anything breaks. But
probably wont get to it till next week.

-- Steve
Al Viro Nov. 15, 2019, 10:10 p.m. UTC | #17
On Fri, Nov 15, 2019 at 04:26:09PM -0500, Steven Rostedt wrote:
> On Fri, 15 Nov 2019 21:18:20 +0000
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > OK... debugfs and tracefs definitely convert to that; so do, AFAICS,
> > spufs and selinuxfs, and I wouldn't be surprised if it could be
> > used in a few more places...  securityfs, almost certainly qibfs,
> > gadgetfs looks like it could make use of that.  Maybe subrpc
> > as well, but I'll need to look in details.  configfs won't,
> > unfortunately...
> 
> Thanks Al for looking into this.
> 
> I'll try to test it in tracefs, and see if anything breaks. But
> probably wont get to it till next week.

I'll probably throw that into #next.dcache - if nothing else,
that cuts down on the size of patch converting d_subdirs/d_child
from list to hlist...

Need to get some sleep first, though - only 5 hours today, so
I want to take another look at that thing tomorrow morning -
I don't trust my ability to spot obvious bugs right now... ;-/

Oh, well - that at least might finally push the old "kernel-side
rm -rf done right" pile of half-baked patches into more useful
state, probably superseding most of them.
Greg KH Nov. 16, 2019, 12:04 p.m. UTC | #18
On Fri, Nov 15, 2019 at 10:10:37PM +0000, Al Viro wrote:
> On Fri, Nov 15, 2019 at 04:26:09PM -0500, Steven Rostedt wrote:
> > On Fri, 15 Nov 2019 21:18:20 +0000
> > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > > OK... debugfs and tracefs definitely convert to that; so do, AFAICS,
> > > spufs and selinuxfs, and I wouldn't be surprised if it could be
> > > used in a few more places...  securityfs, almost certainly qibfs,
> > > gadgetfs looks like it could make use of that.  Maybe subrpc
> > > as well, but I'll need to look in details.  configfs won't,
> > > unfortunately...
> > 
> > Thanks Al for looking into this.
> > 
> > I'll try to test it in tracefs, and see if anything breaks. But
> > probably wont get to it till next week.
> 
> I'll probably throw that into #next.dcache - if nothing else,
> that cuts down on the size of patch converting d_subdirs/d_child
> from list to hlist...
> 
> Need to get some sleep first, though - only 5 hours today, so
> I want to take another look at that thing tomorrow morning -
> I don't trust my ability to spot obvious bugs right now... ;-/
> 
> Oh, well - that at least might finally push the old "kernel-side
> rm -rf done right" pile of half-baked patches into more useful
> state, probably superseding most of them.

Thanks for doing this.  Sorry for the delay in getting back to this, was
on a long-haul flight...

Anyway, this looks sane to me.  debugfs "should" not be having a file
added while a directory is being removed at the same time, but I really
can't guarantee that someone is trying to do something crazy like that.
So "heavy" locking is fine with me, this never has to be a "fast"
operation, it's much more important to get it "correct".

thanks,

greg k-h
Al Viro Nov. 17, 2019, 10:24 p.m. UTC | #19
On Fri, Nov 15, 2019 at 10:10:37PM +0000, Al Viro wrote:

> I'll probably throw that into #next.dcache - if nothing else,
> that cuts down on the size of patch converting d_subdirs/d_child
> from list to hlist...
> 
> Need to get some sleep first, though - only 5 hours today, so
> I want to take another look at that thing tomorrow morning -
> I don't trust my ability to spot obvious bugs right now... ;-/
> 
> Oh, well - that at least might finally push the old "kernel-side
> rm -rf done right" pile of half-baked patches into more useful
> state, probably superseding most of them.

	Curious...  Is there any point keeping debugfs_remove() and
debugfs_remove_recursive() separate?   The thing is, the only case
when their behaviours differ is when the victim is non-empty.  In that
case the former quietly does nothing; the latter (also quietly) removes
the entire subtree.  And the caller has no way to tell if that case has
happened - they can't even look at the dentry they'd passed, since
in the normal case it's already pointing to freed (and possibly reused)
memory by that point.

	The same goes for tracefs, except that there we have only
one caller of tracefs_remove(), and it's guaranteed to be a non-directory.
So there we definitely can fold them together.

	Greg, could we declare debufs_remove() to be an alias for
debugfs_remove_recursive()?
Greg KH Nov. 18, 2019, 6:37 a.m. UTC | #20
On Sun, Nov 17, 2019 at 10:24:22PM +0000, Al Viro wrote:
> On Fri, Nov 15, 2019 at 10:10:37PM +0000, Al Viro wrote:
> 
> > I'll probably throw that into #next.dcache - if nothing else,
> > that cuts down on the size of patch converting d_subdirs/d_child
> > from list to hlist...
> > 
> > Need to get some sleep first, though - only 5 hours today, so
> > I want to take another look at that thing tomorrow morning -
> > I don't trust my ability to spot obvious bugs right now... ;-/
> > 
> > Oh, well - that at least might finally push the old "kernel-side
> > rm -rf done right" pile of half-baked patches into more useful
> > state, probably superseding most of them.
> 
> 	Curious...  Is there any point keeping debugfs_remove() and
> debugfs_remove_recursive() separate?   The thing is, the only case
> when their behaviours differ is when the victim is non-empty.  In that
> case the former quietly does nothing; the latter (also quietly) removes
> the entire subtree.  And the caller has no way to tell if that case has
> happened - they can't even look at the dentry they'd passed, since
> in the normal case it's already pointing to freed (and possibly reused)
> memory by that point.
> 
> 	The same goes for tracefs, except that there we have only
> one caller of tracefs_remove(), and it's guaranteed to be a non-directory.
> So there we definitely can fold them together.
> 
> 	Greg, could we declare debufs_remove() to be an alias for
> debugfs_remove_recursive()?

Yes, we can do that there's no reason to keep those separate at all.
Especially if it makes things easier overall.

thanks,

greg k-h
diff mbox series

Patch

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 10090f1..8eb84ef 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -129,7 +129,8 @@  struct dentry {
 enum dentry_d_lock_class
 {
 	DENTRY_D_LOCK_NORMAL, /* implicitly used by plain spin_lock() APIs. */
-	DENTRY_D_LOCK_NESTED
+	DENTRY_D_LOCK_NESTED,
+	DENTRY_D_LOCK_NESTED_2
 };
 
 struct dentry_operations {