diff mbox

[REVIEW,1/2] mnt: In propgate_umount handle visiting mounts in any order

Message ID 87efvodo4l.fsf_-_@xmission.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman May 17, 2017, 5:54 a.m. UTC
While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities.  This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Cc: stable@vger.kernel.org
Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/mount.h     |   2 +-
 fs/namespace.c |   2 +-
 fs/pnode.c     | 144 ++++++++++++++++++++++++++++++++++-----------------------
 3 files changed, 88 insertions(+), 60 deletions(-)

Comments

Ram Pai May 30, 2017, 6:07 a.m. UTC | #1
On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote:
> 
> While investigating some poor umount performance I realized that in
> the case of overlapping mount trees where some of the mounts are locked
> the code has been failing to unmount all of the mounts it should
> have been unmounting.
> 
> This failure to unmount all of the necessary
> mounts can be reproduced with:
> 
> $ cat locked_mounts_test.sh
> 
> mount -t tmpfs test-base /mnt
> mount --make-shared /mnt
> mkdir -p /mnt/b
> 
> mount -t tmpfs test1 /mnt/b
> mount --make-shared /mnt/b
> mkdir -p /mnt/b/10
> 
> mount -t tmpfs test2 /mnt/b/10
> mount --make-shared /mnt/b/10
> mkdir -p /mnt/b/10/20
> 
> mount --rbind /mnt/b /mnt/b/10/20
> 
> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
> sleep 1
> umount -l /mnt/b
> wait %%
> 
> $ unshare -Urm ./locked_mounts_test.sh
> 
> This failure is corrected by removing the prepass that marks mounts
> that may be umounted.
> 
> A first pass is added that umounts mounts if possible and if not sets
> mount mark if they could be unmounted if they weren't locked and adds
> them to a list to umount possibilities.  This first pass reconsiders
> the mounts parent if it is on the list of umount possibilities, ensuring
> that information of umoutability will pass from child to mount parent.
> 
> A second pass then walks through all mounts that are umounted and processes
> their children unmounting them or marking them for reparenting.
> 
> A last pass cleans up the state on the mounts that could not be umounted
> and if applicable reparents them to their first parent that remained
> mounted.
> 
> While a bit longer than the old code this code is much more robust
> as it allows information to flow up from the leaves and down
> from the trunk making the order in which mounts are encountered
> in the umount propgation tree irrelevant.

Eric,

	I tried multiple time to understand the algorithm, but failed
	to understand the reasoning behind each of the steps. Hence
	I can't tell if the algorithm is correct or wrong.

	I know you are trying to optimize the current algorithm,
	but what is the key insight that you are trying to leverage
       	to optimize it? That probably might help me analyze the
	algorithm.
	

	You walk the propogation tree, and for each element in the
	propagation-tree you try to unmount its entire mount-tree.
	(not sure if this operation is correct, since I know, I had
	 given an example in the past where this can go wrong).
	And later if you find that the unmount is successful, you try
	to walk up and see if the parent can also be unmounted(dont know
	why this is needed).

Sorry, but if you can help with some key insights, it will help.
RP

> 
> Cc: stable@vger.kernel.org
> Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/mount.h     |   2 +-
>  fs/namespace.c |   2 +-
>  fs/pnode.c     | 144 ++++++++++++++++++++++++++++++++++-----------------------
>  3 files changed, 88 insertions(+), 60 deletions(-)
> 
> diff --git a/fs/mount.h b/fs/mount.h
> index ede5a1d5cf99..de45d9e76748 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -58,7 +58,7 @@ struct mount {
>  	struct mnt_namespace *mnt_ns;	/* containing namespace */
>  	struct mountpoint *mnt_mp;	/* where is it mounted */
>  	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
> -	struct list_head mnt_reparent;	/* reparent list entry */
> +	struct list_head mnt_umounting; /* list entry for umount propagation */
>  #ifdef CONFIG_FSNOTIFY
>  	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
>  	__u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 51e49866e1fe..5e3dcbeb1de5 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -236,7 +236,7 @@ static struct mount *alloc_vfsmnt(const char *name)
>  		INIT_LIST_HEAD(&mnt->mnt_slave_list);
>  		INIT_LIST_HEAD(&mnt->mnt_slave);
>  		INIT_HLIST_NODE(&mnt->mnt_mp_list);
> -		INIT_LIST_HEAD(&mnt->mnt_reparent);
> +		INIT_LIST_HEAD(&mnt->mnt_umounting);
>  		init_fs_pin(&mnt->mnt_umount, drop_mountpoint);
>  	}
>  	return mnt;
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 52aca0a118ff..fbaca7df2eb0 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -413,86 +413,95 @@ void propagate_mount_unlock(struct mount *mnt)
>  	}
>  }
> 
> -/*
> - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
> - */
> -static void mark_umount_candidates(struct mount *mnt)
> +static void umount_one(struct mount *mnt, struct list_head *to_umount)
>  {
> -	struct mount *parent = mnt->mnt_parent;
> -	struct mount *m;
> -
> -	BUG_ON(parent == mnt);
> -
> -	for (m = propagation_next(parent, parent); m;
> -			m = propagation_next(m, parent)) {
> -		struct mount *child = __lookup_mnt(&m->mnt,
> -						mnt->mnt_mountpoint);
> -		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
> -			continue;
> -		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
> -			SET_MNT_MARK(child);
> -		}
> -	}
> +	CLEAR_MNT_MARK(mnt);
> +	mnt->mnt.mnt_flags |= MNT_UMOUNT;
> +	list_del_init(&mnt->mnt_child);
> +	list_del_init(&mnt->mnt_umounting);
> +	list_move_tail(&mnt->mnt_list, to_umount);
>  }
> 
>  /*
>   * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
>   * parent propagates to.
>   */
> -static void __propagate_umount(struct mount *mnt, struct list_head *to_reparent)
> +static bool __propagate_umount(struct mount *mnt,
> +			       struct list_head *to_umount,
> +			       struct list_head *to_restore)
>  {
> -	struct mount *parent = mnt->mnt_parent;
> -	struct mount *m;
> +	bool progress = false;
> +	struct mount *child;
> 
> -	BUG_ON(parent == mnt);
> +	/*
> +	 * The state of the parent won't change if this mount is
> +	 * already unmounted or marked as without children.
> +	 */
> +	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
> +		goto out;
> 
> -	for (m = propagation_next(parent, parent); m;
> -			m = propagation_next(m, parent)) {
> -		struct mount *topper;
> -		struct mount *child = __lookup_mnt(&m->mnt,
> -						mnt->mnt_mountpoint);
> -		/*
> -		 * umount the child only if the child has no children
> -		 * and the child is marked safe to unmount.
> -		 */
> -		if (!child || !IS_MNT_MARKED(child))
> +	/* Verify topper is the only grandchild that has not been
> +	 * speculatively unmounted.
> +	 */
> +	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
> +		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
>  			continue;
> -		CLEAR_MNT_MARK(child);
> +		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
> +			continue;
> +		/* Found a mounted child */
> +		goto children;
> +	}
> 
> -		/* If there is exactly one mount covering all of child
> -		 * replace child with that mount.
> -		 */
> -		topper = find_topper(child);
> -		if (topper)
> -			list_add_tail(&topper->mnt_reparent, to_reparent);
> +	/* Mark mounts that can be unmounted if not locked */
> +	SET_MNT_MARK(mnt);
> +	progress = true;
> 
> -		if (topper || list_empty(&child->mnt_mounts)) {
> -			list_del_init(&child->mnt_child);
> -			list_del_init(&child->mnt_reparent);
> -			child->mnt.mnt_flags |= MNT_UMOUNT;
> -			list_move_tail(&child->mnt_list, &mnt->mnt_list);
> +	/* If a mount is without children and not locked umount it. */
> +	if (!IS_MNT_LOCKED(mnt)) {
> +		umount_one(mnt, to_umount);
> +	} else {
> +children:
> +		list_move_tail(&mnt->mnt_umounting, to_restore);
> +	}
> +out:
> +	return progress;
> +}
> +
> +static void umount_list(struct list_head *to_umount,
> +			struct list_head *to_restore)
> +{
> +	struct mount *mnt, *child, *tmp;
> +	list_for_each_entry(mnt, to_umount, mnt_list) {
> +		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
> +			/* topper? */
> +			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
> +				list_move_tail(&child->mnt_umounting, to_restore);
> +			else
> +				umount_one(child, to_umount);
>  		}
>  	}
>  }
> 
> -static void reparent_mounts(struct list_head *to_reparent)
> +static void restore_mounts(struct list_head *to_restore)
>  {
> -	while (!list_empty(to_reparent)) {
> +	/* Restore mounts to a clean working state */
> +	while (!list_empty(to_restore)) {
>  		struct mount *mnt, *parent;
>  		struct mountpoint *mp;
> 
> -		mnt = list_first_entry(to_reparent, struct mount, mnt_reparent);
> -		list_del_init(&mnt->mnt_reparent);
> +		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
> +		CLEAR_MNT_MARK(mnt);
> +		list_del_init(&mnt->mnt_umounting);
> 
> -		/* Where should this mount be reparented to? */
> +		/* Should this mount be reparented? */
>  		mp = mnt->mnt_mp;
>  		parent = mnt->mnt_parent;
>  		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
>  			mp = parent->mnt_mp;
>  			parent = parent->mnt_parent;
>  		}
> -
> -		mnt_change_mountpoint(parent, mp, mnt);
> +		if (parent != mnt->mnt_parent)
> +			mnt_change_mountpoint(parent, mp, mnt);
>  	}
>  }
> 
> @@ -506,15 +515,34 @@ static void reparent_mounts(struct list_head *to_reparent)
>  int propagate_umount(struct list_head *list)
>  {
>  	struct mount *mnt;
> -	LIST_HEAD(to_reparent);
> +	LIST_HEAD(to_restore);
> +	LIST_HEAD(to_umount);
> 
> -	list_for_each_entry_reverse(mnt, list, mnt_list)
> -		mark_umount_candidates(mnt);
> +	list_for_each_entry(mnt, list, mnt_list) {
> +		struct mount *parent = mnt->mnt_parent;
> +		struct mount *m;
> 
> -	list_for_each_entry(mnt, list, mnt_list)
> -		__propagate_umount(mnt, &to_reparent);
> +		for (m = propagation_next(parent, parent); m;
> +		     m = propagation_next(m, parent)) {
> +			struct mount *child = __lookup_mnt(&m->mnt,
> +							   mnt->mnt_mountpoint);
> +			if (!child)
> +				continue;
> +
> +			/* Check the child and parents while progress is made */
> +			while (__propagate_umount(child,
> +						  &to_umount, &to_restore)) {
> +				/* Is the parent a umount candidate? */
> +				child = child->mnt_parent;
> +				if (list_empty(&child->mnt_umounting))
> +					break;
> +			}
> +		}
> +	}
> 
> -	reparent_mounts(&to_reparent);
> +	umount_list(&to_umount, &to_restore);
> +	restore_mounts(&to_restore);
> +	list_splice_tail(&to_umount, list);
> 
>  	return 0;
>  }
> -- 
> 2.10.1
Eric W. Biederman May 30, 2017, 3:07 p.m. UTC | #2
Ram Pai <linuxram@us.ibm.com> writes:

> On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote:
>> 
>> While investigating some poor umount performance I realized that in
>> the case of overlapping mount trees where some of the mounts are locked
>> the code has been failing to unmount all of the mounts it should
>> have been unmounting.
>> 
>> This failure to unmount all of the necessary
>> mounts can be reproduced with:
>> 
>> $ cat locked_mounts_test.sh
>> 
>> mount -t tmpfs test-base /mnt
>> mount --make-shared /mnt
>> mkdir -p /mnt/b
>> 
>> mount -t tmpfs test1 /mnt/b
>> mount --make-shared /mnt/b
>> mkdir -p /mnt/b/10
>> 
>> mount -t tmpfs test2 /mnt/b/10
>> mount --make-shared /mnt/b/10
>> mkdir -p /mnt/b/10/20
>> 
>> mount --rbind /mnt/b /mnt/b/10/20
>> 
>> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
>> sleep 1
>> umount -l /mnt/b
>> wait %%
>> 
>> $ unshare -Urm ./locked_mounts_test.sh
>> 
>> This failure is corrected by removing the prepass that marks mounts
>> that may be umounted.
>> 
>> A first pass is added that umounts mounts if possible and if not sets
>> mount mark if they could be unmounted if they weren't locked and adds
>> them to a list to umount possibilities.  This first pass reconsiders
>> the mounts parent if it is on the list of umount possibilities, ensuring
>> that information of umoutability will pass from child to mount parent.
>> 
>> A second pass then walks through all mounts that are umounted and processes
>> their children unmounting them or marking them for reparenting.
>> 
>> A last pass cleans up the state on the mounts that could not be umounted
>> and if applicable reparents them to their first parent that remained
>> mounted.
>> 
>> While a bit longer than the old code this code is much more robust
>> as it allows information to flow up from the leaves and down
>> from the trunk making the order in which mounts are encountered
>> in the umount propgation tree irrelevant.
>
> Eric,
>
> 	I tried multiple time to understand the algorithm, but failed
> 	to understand the reasoning behind each of the steps. Hence
> 	I can't tell if the algorithm is correct or wrong.
>
> 	I know you are trying to optimize the current algorithm,
> 	but what is the key insight that you are trying to leverage
>        	to optimize it? That probably might help me analyze the
> 	algorithm.
> 	
>
> 	You walk the propogation tree, and for each element in the
> 	propagation-tree you try to unmount its entire mount-tree.
> 	(not sure if this operation is correct, since I know, I had
> 	 given an example in the past where this can go wrong).

I think you are refering to when I tried to propgate the entire tree
mount and was not following the individual propagation trees for
each mount.  Which left some mounts mounted that if we had
followed the individual propgation trees would have been umounted.

This code does not do anything like that it simply follows the
individual umount propgation trees.

> 	And later if you find that the unmount is successful, you try
> 	to walk up and see if the parent can also be unmounted(dont know
> 	why this is needed).

> Sorry, but if you can help with some key insights, it will help.

The first insight is that the parent child relationship that happens
between mounts in the set of mounts removed by MNT_DETACH is not
necessarily the same parent child relationship between the mounts
the are propagated to.

Which leads to the second insight that we can not guarantee that during
umount when the mount propgation tree is being walked we can not
gaurantee that the leaves are being walked first.

Without tucked mounts, without locked mounts that information is enough
to say the original mount propgation code for umount was incorrect as it
assumed that the leaves would be walked first.

I believe the test case I have given above is an example of that.  It
has locked mounts in it as well which made everything doubly ugly.

To handle the case that the code may visit the parent before the child
if something is mounted on top of a mount we wish to unmount it is
added to the to_restore list instead of the to_umount list.

If a mount does not have children and it is unmounted __propgate_umount
returns true.  The code then looks at the parent mount and it is on
the to_restore list it tries to unmount the parent again.

That is the leaf to root propagation.



There is also root to leaf propgation in umount_list to handle the case of
locked mounts.  Locked mounts come about because a more privileged user
propagated them into your mount namespace as a set, and the unprivileged
user is not allowed to break up the set lest they see something under a
mount they should not see.

When a mount that could be unmounted if it was not locked to it's parent
it is marked and placed on the to_restore list.


When examining children umount_one removes mounts from their parents
mnt_mounts list.

Children that fully cover a mount (toppers) are ignored.
Children that if not locked would be unmounted are ignored
   as those children become unmountable if their parent
   is unmountable.

   This allows a tree of mounts that is locked together to
   be unmounted if the root is unmountable.


Does that help?

Eric





> RP
>
>> 
>> Cc: stable@vger.kernel.org
>> Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/mount.h     |   2 +-
>>  fs/namespace.c |   2 +-
>>  fs/pnode.c     | 144 ++++++++++++++++++++++++++++++++++-----------------------
>>  3 files changed, 88 insertions(+), 60 deletions(-)
>> 
>> diff --git a/fs/mount.h b/fs/mount.h
>> index ede5a1d5cf99..de45d9e76748 100644
>> --- a/fs/mount.h
>> +++ b/fs/mount.h
>> @@ -58,7 +58,7 @@ struct mount {
>>  	struct mnt_namespace *mnt_ns;	/* containing namespace */
>>  	struct mountpoint *mnt_mp;	/* where is it mounted */
>>  	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
>> -	struct list_head mnt_reparent;	/* reparent list entry */
>> +	struct list_head mnt_umounting; /* list entry for umount propagation */
>>  #ifdef CONFIG_FSNOTIFY
>>  	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
>>  	__u32 mnt_fsnotify_mask;
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 51e49866e1fe..5e3dcbeb1de5 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -236,7 +236,7 @@ static struct mount *alloc_vfsmnt(const char *name)
>>  		INIT_LIST_HEAD(&mnt->mnt_slave_list);
>>  		INIT_LIST_HEAD(&mnt->mnt_slave);
>>  		INIT_HLIST_NODE(&mnt->mnt_mp_list);
>> -		INIT_LIST_HEAD(&mnt->mnt_reparent);
>> +		INIT_LIST_HEAD(&mnt->mnt_umounting);
>>  		init_fs_pin(&mnt->mnt_umount, drop_mountpoint);
>>  	}
>>  	return mnt;
>> diff --git a/fs/pnode.c b/fs/pnode.c
>> index 52aca0a118ff..fbaca7df2eb0 100644
>> --- a/fs/pnode.c
>> +++ b/fs/pnode.c
>> @@ -413,86 +413,95 @@ void propagate_mount_unlock(struct mount *mnt)
>>  	}
>>  }
>> 
>> -/*
>> - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
>> - */
>> -static void mark_umount_candidates(struct mount *mnt)
>> +static void umount_one(struct mount *mnt, struct list_head *to_umount)
>>  {
>> -	struct mount *parent = mnt->mnt_parent;
>> -	struct mount *m;
>> -
>> -	BUG_ON(parent == mnt);
>> -
>> -	for (m = propagation_next(parent, parent); m;
>> -			m = propagation_next(m, parent)) {
>> -		struct mount *child = __lookup_mnt(&m->mnt,
>> -						mnt->mnt_mountpoint);
>> -		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
>> -			continue;
>> -		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
>> -			SET_MNT_MARK(child);
>> -		}
>> -	}
>> +	CLEAR_MNT_MARK(mnt);
>> +	mnt->mnt.mnt_flags |= MNT_UMOUNT;
>> +	list_del_init(&mnt->mnt_child);
>> +	list_del_init(&mnt->mnt_umounting);
>> +	list_move_tail(&mnt->mnt_list, to_umount);
>>  }
>> 
>>  /*
>>   * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
>>   * parent propagates to.
>>   */
>> -static void __propagate_umount(struct mount *mnt, struct list_head *to_reparent)
>> +static bool __propagate_umount(struct mount *mnt,
>> +			       struct list_head *to_umount,
>> +			       struct list_head *to_restore)
>>  {
>> -	struct mount *parent = mnt->mnt_parent;
>> -	struct mount *m;
>> +	bool progress = false;
>> +	struct mount *child;
>> 
>> -	BUG_ON(parent == mnt);
>> +	/*
>> +	 * The state of the parent won't change if this mount is
>> +	 * already unmounted or marked as without children.
>> +	 */
>> +	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
>> +		goto out;
>> 
>> -	for (m = propagation_next(parent, parent); m;
>> -			m = propagation_next(m, parent)) {
>> -		struct mount *topper;
>> -		struct mount *child = __lookup_mnt(&m->mnt,
>> -						mnt->mnt_mountpoint);
>> -		/*
>> -		 * umount the child only if the child has no children
>> -		 * and the child is marked safe to unmount.
>> -		 */
>> -		if (!child || !IS_MNT_MARKED(child))
>> +	/* Verify topper is the only grandchild that has not been
>> +	 * speculatively unmounted.
>> +	 */
>> +	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
>> +		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
>>  			continue;
>> -		CLEAR_MNT_MARK(child);
>> +		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
>> +			continue;
>> +		/* Found a mounted child */
>> +		goto children;
>> +	}
>> 
>> -		/* If there is exactly one mount covering all of child
>> -		 * replace child with that mount.
>> -		 */
>> -		topper = find_topper(child);
>> -		if (topper)
>> -			list_add_tail(&topper->mnt_reparent, to_reparent);
>> +	/* Mark mounts that can be unmounted if not locked */
>> +	SET_MNT_MARK(mnt);
>> +	progress = true;
>> 
>> -		if (topper || list_empty(&child->mnt_mounts)) {
>> -			list_del_init(&child->mnt_child);
>> -			list_del_init(&child->mnt_reparent);
>> -			child->mnt.mnt_flags |= MNT_UMOUNT;
>> -			list_move_tail(&child->mnt_list, &mnt->mnt_list);
>> +	/* If a mount is without children and not locked umount it. */
>> +	if (!IS_MNT_LOCKED(mnt)) {
>> +		umount_one(mnt, to_umount);
>> +	} else {
>> +children:
>> +		list_move_tail(&mnt->mnt_umounting, to_restore);
>> +	}
>> +out:
>> +	return progress;
>> +}
>> +
>> +static void umount_list(struct list_head *to_umount,
>> +			struct list_head *to_restore)
>> +{
>> +	struct mount *mnt, *child, *tmp;
>> +	list_for_each_entry(mnt, to_umount, mnt_list) {
>> +		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
>> +			/* topper? */
>> +			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
>> +				list_move_tail(&child->mnt_umounting, to_restore);
>> +			else
>> +				umount_one(child, to_umount);
>>  		}
>>  	}
>>  }
>> 
>> -static void reparent_mounts(struct list_head *to_reparent)
>> +static void restore_mounts(struct list_head *to_restore)
>>  {
>> -	while (!list_empty(to_reparent)) {
>> +	/* Restore mounts to a clean working state */
>> +	while (!list_empty(to_restore)) {
>>  		struct mount *mnt, *parent;
>>  		struct mountpoint *mp;
>> 
>> -		mnt = list_first_entry(to_reparent, struct mount, mnt_reparent);
>> -		list_del_init(&mnt->mnt_reparent);
>> +		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
>> +		CLEAR_MNT_MARK(mnt);
>> +		list_del_init(&mnt->mnt_umounting);
>> 
>> -		/* Where should this mount be reparented to? */
>> +		/* Should this mount be reparented? */
>>  		mp = mnt->mnt_mp;
>>  		parent = mnt->mnt_parent;
>>  		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
>>  			mp = parent->mnt_mp;
>>  			parent = parent->mnt_parent;
>>  		}
>> -
>> -		mnt_change_mountpoint(parent, mp, mnt);
>> +		if (parent != mnt->mnt_parent)
>> +			mnt_change_mountpoint(parent, mp, mnt);
>>  	}
>>  }
>> 
>> @@ -506,15 +515,34 @@ static void reparent_mounts(struct list_head *to_reparent)
>>  int propagate_umount(struct list_head *list)
>>  {
>>  	struct mount *mnt;
>> -	LIST_HEAD(to_reparent);
>> +	LIST_HEAD(to_restore);
>> +	LIST_HEAD(to_umount);
>> 
>> -	list_for_each_entry_reverse(mnt, list, mnt_list)
>> -		mark_umount_candidates(mnt);
>> +	list_for_each_entry(mnt, list, mnt_list) {
>> +		struct mount *parent = mnt->mnt_parent;
>> +		struct mount *m;
>> 
>> -	list_for_each_entry(mnt, list, mnt_list)
>> -		__propagate_umount(mnt, &to_reparent);
>> +		for (m = propagation_next(parent, parent); m;
>> +		     m = propagation_next(m, parent)) {
>> +			struct mount *child = __lookup_mnt(&m->mnt,
>> +							   mnt->mnt_mountpoint);
>> +			if (!child)
>> +				continue;
>> +
>> +			/* Check the child and parents while progress is made */
>> +			while (__propagate_umount(child,
>> +						  &to_umount, &to_restore)) {
>> +				/* Is the parent a umount candidate? */
>> +				child = child->mnt_parent;
>> +				if (list_empty(&child->mnt_umounting))
>> +					break;
>> +			}
>> +		}
>> +	}
>> 
>> -	reparent_mounts(&to_reparent);
>> +	umount_list(&to_umount, &to_restore);
>> +	restore_mounts(&to_restore);
>> +	list_splice_tail(&to_umount, list);
>> 
>>  	return 0;
>>  }
>> -- 
>> 2.10.1
Ram Pai June 7, 2017, 9:54 a.m. UTC | #3
On Tue, May 30, 2017 at 10:07:49AM -0500, Eric W. Biederman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote:
> >> 
> >> While investigating some poor umount performance I realized that in
> >> the case of overlapping mount trees where some of the mounts are locked
> >> the code has been failing to unmount all of the mounts it should
> >> have been unmounting.
> >> 
> >> This failure to unmount all of the necessary
> >> mounts can be reproduced with:
> >> 
> >> $ cat locked_mounts_test.sh
> >> 
> >> mount -t tmpfs test-base /mnt
> >> mount --make-shared /mnt
> >> mkdir -p /mnt/b
> >> 
> >> mount -t tmpfs test1 /mnt/b
> >> mount --make-shared /mnt/b
> >> mkdir -p /mnt/b/10
> >> 
> >> mount -t tmpfs test2 /mnt/b/10
> >> mount --make-shared /mnt/b/10
> >> mkdir -p /mnt/b/10/20
> >> 
> >> mount --rbind /mnt/b /mnt/b/10/20
> >> 
> >> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
> >> sleep 1
> >> umount -l /mnt/b
> >> wait %%
> >> 
> >> $ unshare -Urm ./locked_mounts_test.sh
> >> 
> >> This failure is corrected by removing the prepass that marks mounts
> >> that may be umounted.
> >> 
> >> A first pass is added that umounts mounts if possible and if not sets
> >> mount mark if they could be unmounted if they weren't locked and adds
> >> them to a list to umount possibilities.  This first pass reconsiders
> >> the mounts parent if it is on the list of umount possibilities, ensuring
> >> that information of umoutability will pass from child to mount parent.
> >> 
> >> A second pass then walks through all mounts that are umounted and processes
> >> their children unmounting them or marking them for reparenting.
> >> 
> >> A last pass cleans up the state on the mounts that could not be umounted
> >> and if applicable reparents them to their first parent that remained
> >> mounted.
> >> 
> >> While a bit longer than the old code this code is much more robust
> >> as it allows information to flow up from the leaves and down
> >> from the trunk making the order in which mounts are encountered
> >> in the umount propgation tree irrelevant.
> >
> > Eric,
> >
> > 	I tried multiple time to understand the algorithm, but failed
> > 	to understand the reasoning behind each of the steps. Hence
> > 	I can't tell if the algorithm is correct or wrong.
> >
> > 	I know you are trying to optimize the current algorithm,
> > 	but what is the key insight that you are trying to leverage
> >        	to optimize it? That probably might help me analyze the
> > 	algorithm.
> > 	
> >
> > 	You walk the propogation tree, and for each element in the
> > 	propagation-tree you try to unmount its entire mount-tree.
> > 	(not sure if this operation is correct, since I know, I had
> > 	 given an example in the past where this can go wrong).
> 
> I think you are refering to when I tried to propgate the entire tree
> mount and was not following the individual propagation trees for
> each mount.  Which left some mounts mounted that if we had
> followed the individual propgation trees would have been umounted.
> 
> This code does not do anything like that it simply follows the
> individual umount propgation trees.
> 
> > 	And later if you find that the unmount is successful, you try
> > 	to walk up and see if the parent can also be unmounted(dont know
> > 	why this is needed).
> 
> > Sorry, but if you can help with some key insights, it will help.
> 
> The first insight is that the parent child relationship that happens
> between mounts in the set of mounts removed by MNT_DETACH is not
> necessarily the same parent child relationship between the mounts
> the are propagated to.
> 
> Which leads to the second insight that we can not guarantee that during
> umount when the mount propgation tree is being walked we can not
> gaurantee that the leaves are being walked first.
> 

If you agree that " (A) tucked mounts do/should NOT receive propagation
events on its root dentry ", than i strongly think; short of asserting,
that the propagation tree will always reach the child first
and not the parent. I will further verify my thoughts before
elevating my strong-thinking to an assertion.

But if you do not agree with (A) than yes we could reach the
parent or the child first and we will badly need your algorithm.

BTW: This is the same disagreement that we had earlier regarding
the semantics of tucked mounts.

Assuming that you may not agree with (A), I looked through your
explaination and the code/algorithm and the steps make sense.

The one thing that was not clear to me was --  is it ok to umount
child whose ancestor is MNT_LOCKED?  Your algorithm seems
to allow that.



> Without tucked mounts, without locked mounts that information is enough
> to say the original mount propgation code for umount was incorrect as it
> assumed that the leaves would be walked first.
> 
> I believe the test case I have given above is an example of that.  It
> has locked mounts in it as well which made everything doubly ugly.
> 
> To handle the case that the code may visit the parent before the child
> if something is mounted on top of a mount we wish to unmount it is
> added to the to_restore list instead of the to_umount list.
> 
> If a mount does not have children and it is unmounted __propgate_umount
> returns true.  The code then looks at the parent mount and it is on
> the to_restore list it tries to unmount the parent again.
> 
> That is the leaf to root propagation.
> 
> 
> 
> There is also root to leaf propgation in umount_list to handle the case of
> locked mounts.  Locked mounts come about because a more privileged user
> propagated them into your mount namespace as a set, and the unprivileged
> user is not allowed to break up the set lest they see something under a
> mount they should not see.
> 
> When a mount that could be unmounted if it was not locked to it's parent
> it is marked and placed on the to_restore list.
> 
> 
> When examining children umount_one removes mounts from their parents
> mnt_mounts list.
> 
> Children that fully cover a mount (toppers) are ignored.
> Children that if not locked would be unmounted are ignored
>    as those children become unmountable if their parent
>    is unmountable.

I find the children can get unmounted even when one its ancestor is
locked.  An example case is -- if the child is a leaf and one of its
ancestor is locked, but is not part of any related propagation tree.


> 
>    This allows a tree of mounts that is locked together to
>    be unmounted if the root is unmountable.
> 
> 
> Does that help?

Yes your explaination helped,

Sorry it took some time to get to this, 
RP

> 
> Eric
> >> 
> >> Cc: stable@vger.kernel.org
> >> Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >> ---
> >>  fs/mount.h     |   2 +-
> >>  fs/namespace.c |   2 +-
> >>  fs/pnode.c     | 144 ++++++++++++++++++++++++++++++++++-----------------------
> >>  3 files changed, 88 insertions(+), 60 deletions(-)
> >> 
> >> diff --git a/fs/mount.h b/fs/mount.h
> >> index ede5a1d5cf99..de45d9e76748 100644
> >> --- a/fs/mount.h
> >> +++ b/fs/mount.h
> >> @@ -58,7 +58,7 @@ struct mount {
> >>  	struct mnt_namespace *mnt_ns;	/* containing namespace */
> >>  	struct mountpoint *mnt_mp;	/* where is it mounted */
> >>  	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
> >> -	struct list_head mnt_reparent;	/* reparent list entry */
> >> +	struct list_head mnt_umounting; /* list entry for umount propagation */
> >>  #ifdef CONFIG_FSNOTIFY
> >>  	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
> >>  	__u32 mnt_fsnotify_mask;
> >> diff --git a/fs/namespace.c b/fs/namespace.c
> >> index 51e49866e1fe..5e3dcbeb1de5 100644
> >> --- a/fs/namespace.c
> >> +++ b/fs/namespace.c
> >> @@ -236,7 +236,7 @@ static struct mount *alloc_vfsmnt(const char *name)
> >>  		INIT_LIST_HEAD(&mnt->mnt_slave_list);
> >>  		INIT_LIST_HEAD(&mnt->mnt_slave);
> >>  		INIT_HLIST_NODE(&mnt->mnt_mp_list);
> >> -		INIT_LIST_HEAD(&mnt->mnt_reparent);
> >> +		INIT_LIST_HEAD(&mnt->mnt_umounting);
> >>  		init_fs_pin(&mnt->mnt_umount, drop_mountpoint);
> >>  	}
> >>  	return mnt;
> >> diff --git a/fs/pnode.c b/fs/pnode.c
> >> index 52aca0a118ff..fbaca7df2eb0 100644
> >> --- a/fs/pnode.c
> >> +++ b/fs/pnode.c
> >> @@ -413,86 +413,95 @@ void propagate_mount_unlock(struct mount *mnt)
> >>  	}
> >>  }
> >> 
> >> -/*
> >> - * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
> >> - */
> >> -static void mark_umount_candidates(struct mount *mnt)
> >> +static void umount_one(struct mount *mnt, struct list_head *to_umount)
> >>  {
> >> -	struct mount *parent = mnt->mnt_parent;
> >> -	struct mount *m;
> >> -
> >> -	BUG_ON(parent == mnt);
> >> -
> >> -	for (m = propagation_next(parent, parent); m;
> >> -			m = propagation_next(m, parent)) {
> >> -		struct mount *child = __lookup_mnt(&m->mnt,
> >> -						mnt->mnt_mountpoint);
> >> -		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
> >> -			continue;
> >> -		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
> >> -			SET_MNT_MARK(child);
> >> -		}
> >> -	}
> >> +	CLEAR_MNT_MARK(mnt);
> >> +	mnt->mnt.mnt_flags |= MNT_UMOUNT;
> >> +	list_del_init(&mnt->mnt_child);
> >> +	list_del_init(&mnt->mnt_umounting);
> >> +	list_move_tail(&mnt->mnt_list, to_umount);
> >>  }
> >> 
> >>  /*
> >>   * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
> >>   * parent propagates to.
> >>   */
> >> -static void __propagate_umount(struct mount *mnt, struct list_head *to_reparent)
> >> +static bool __propagate_umount(struct mount *mnt,
> >> +			       struct list_head *to_umount,
> >> +			       struct list_head *to_restore)
> >>  {
> >> -	struct mount *parent = mnt->mnt_parent;
> >> -	struct mount *m;
> >> +	bool progress = false;
> >> +	struct mount *child;
> >> 
> >> -	BUG_ON(parent == mnt);
> >> +	/*
> >> +	 * The state of the parent won't change if this mount is
> >> +	 * already unmounted or marked as without children.
> >> +	 */
> >> +	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
> >> +		goto out;
> >> 
> >> -	for (m = propagation_next(parent, parent); m;
> >> -			m = propagation_next(m, parent)) {
> >> -		struct mount *topper;
> >> -		struct mount *child = __lookup_mnt(&m->mnt,
> >> -						mnt->mnt_mountpoint);
> >> -		/*
> >> -		 * umount the child only if the child has no children
> >> -		 * and the child is marked safe to unmount.
> >> -		 */
> >> -		if (!child || !IS_MNT_MARKED(child))
> >> +	/* Verify topper is the only grandchild that has not been
> >> +	 * speculatively unmounted.
> >> +	 */
> >> +	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
> >> +		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
> >>  			continue;
> >> -		CLEAR_MNT_MARK(child);
> >> +		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
> >> +			continue;
> >> +		/* Found a mounted child */
> >> +		goto children;
> >> +	}
> >> 
> >> -		/* If there is exactly one mount covering all of child
> >> -		 * replace child with that mount.
> >> -		 */
> >> -		topper = find_topper(child);
> >> -		if (topper)
> >> -			list_add_tail(&topper->mnt_reparent, to_reparent);
> >> +	/* Mark mounts that can be unmounted if not locked */
> >> +	SET_MNT_MARK(mnt);
> >> +	progress = true;
> >> 
> >> -		if (topper || list_empty(&child->mnt_mounts)) {
> >> -			list_del_init(&child->mnt_child);
> >> -			list_del_init(&child->mnt_reparent);
> >> -			child->mnt.mnt_flags |= MNT_UMOUNT;
> >> -			list_move_tail(&child->mnt_list, &mnt->mnt_list);
> >> +	/* If a mount is without children and not locked umount it. */
> >> +	if (!IS_MNT_LOCKED(mnt)) {
> >> +		umount_one(mnt, to_umount);
> >> +	} else {
> >> +children:
> >> +		list_move_tail(&mnt->mnt_umounting, to_restore);
> >> +	}
> >> +out:
> >> +	return progress;
> >> +}
> >> +
> >> +static void umount_list(struct list_head *to_umount,
> >> +			struct list_head *to_restore)
> >> +{
> >> +	struct mount *mnt, *child, *tmp;
> >> +	list_for_each_entry(mnt, to_umount, mnt_list) {
> >> +		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
> >> +			/* topper? */
> >> +			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
> >> +				list_move_tail(&child->mnt_umounting, to_restore);
> >> +			else
> >> +				umount_one(child, to_umount);
> >>  		}
> >>  	}
> >>  }
> >> 
> >> -static void reparent_mounts(struct list_head *to_reparent)
> >> +static void restore_mounts(struct list_head *to_restore)
> >>  {
> >> -	while (!list_empty(to_reparent)) {
> >> +	/* Restore mounts to a clean working state */
> >> +	while (!list_empty(to_restore)) {
> >>  		struct mount *mnt, *parent;
> >>  		struct mountpoint *mp;
> >> 
> >> -		mnt = list_first_entry(to_reparent, struct mount, mnt_reparent);
> >> -		list_del_init(&mnt->mnt_reparent);
> >> +		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
> >> +		CLEAR_MNT_MARK(mnt);
> >> +		list_del_init(&mnt->mnt_umounting);
> >> 
> >> -		/* Where should this mount be reparented to? */
> >> +		/* Should this mount be reparented? */
> >>  		mp = mnt->mnt_mp;
> >>  		parent = mnt->mnt_parent;
> >>  		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
> >>  			mp = parent->mnt_mp;
> >>  			parent = parent->mnt_parent;
> >>  		}
> >> -
> >> -		mnt_change_mountpoint(parent, mp, mnt);
> >> +		if (parent != mnt->mnt_parent)
> >> +			mnt_change_mountpoint(parent, mp, mnt);
> >>  	}
> >>  }
> >> 
> >> @@ -506,15 +515,34 @@ static void reparent_mounts(struct list_head *to_reparent)
> >>  int propagate_umount(struct list_head *list)
> >>  {
> >>  	struct mount *mnt;
> >> -	LIST_HEAD(to_reparent);
> >> +	LIST_HEAD(to_restore);
> >> +	LIST_HEAD(to_umount);
> >> 
> >> -	list_for_each_entry_reverse(mnt, list, mnt_list)
> >> -		mark_umount_candidates(mnt);
> >> +	list_for_each_entry(mnt, list, mnt_list) {
> >> +		struct mount *parent = mnt->mnt_parent;
> >> +		struct mount *m;
> >> 
> >> -	list_for_each_entry(mnt, list, mnt_list)
> >> -		__propagate_umount(mnt, &to_reparent);
> >> +		for (m = propagation_next(parent, parent); m;
> >> +		     m = propagation_next(m, parent)) {
> >> +			struct mount *child = __lookup_mnt(&m->mnt,
> >> +							   mnt->mnt_mountpoint);
> >> +			if (!child)
> >> +				continue;
> >> +
> >> +			/* Check the child and parents while progress is made */
> >> +			while (__propagate_umount(child,
> >> +						  &to_umount, &to_restore)) {
> >> +				/* Is the parent a umount candidate? */
> >> +				child = child->mnt_parent;
> >> +				if (list_empty(&child->mnt_umounting))
> >> +					break;
> >> +			}
> >> +		}
> >> +	}
> >> 
> >> -	reparent_mounts(&to_reparent);
> >> +	umount_list(&to_umount, &to_restore);
> >> +	restore_mounts(&to_restore);
> >> +	list_splice_tail(&to_umount, list);
> >> 
> >>  	return 0;
> >>  }
> >> -- 
> >> 2.10.1
Eric W. Biederman June 7, 2017, 1:09 p.m. UTC | #4
Ram Pai <linuxram@us.ibm.com> writes:

> On Tue, May 30, 2017 at 10:07:49AM -0500, Eric W. Biederman wrote:
>> Ram Pai <linuxram@us.ibm.com> writes:
>> 
>> > On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote:
>> >> 
>> >> While investigating some poor umount performance I realized that in
>> >> the case of overlapping mount trees where some of the mounts are locked
>> >> the code has been failing to unmount all of the mounts it should
>> >> have been unmounting.
>> >> 
>> >> This failure to unmount all of the necessary
>> >> mounts can be reproduced with:
>> >> 
>> >> $ cat locked_mounts_test.sh
>> >> 
>> >> mount -t tmpfs test-base /mnt
>> >> mount --make-shared /mnt
>> >> mkdir -p /mnt/b
>> >> 
>> >> mount -t tmpfs test1 /mnt/b
>> >> mount --make-shared /mnt/b
>> >> mkdir -p /mnt/b/10
>> >> 
>> >> mount -t tmpfs test2 /mnt/b/10
>> >> mount --make-shared /mnt/b/10
>> >> mkdir -p /mnt/b/10/20
>> >> 
>> >> mount --rbind /mnt/b /mnt/b/10/20
>> >> 
>> >> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
>> >> sleep 1
>> >> umount -l /mnt/b
>> >> wait %%
>> >> 
>> >> $ unshare -Urm ./locked_mounts_test.sh
>> >> 
>> >> This failure is corrected by removing the prepass that marks mounts
>> >> that may be umounted.
>> >> 
>> >> A first pass is added that umounts mounts if possible and if not sets
>> >> mount mark if they could be unmounted if they weren't locked and adds
>> >> them to a list to umount possibilities.  This first pass reconsiders
>> >> the mounts parent if it is on the list of umount possibilities, ensuring
>> >> that information of umoutability will pass from child to mount parent.
>> >> 
>> >> A second pass then walks through all mounts that are umounted and processes
>> >> their children unmounting them or marking them for reparenting.
>> >> 
>> >> A last pass cleans up the state on the mounts that could not be umounted
>> >> and if applicable reparents them to their first parent that remained
>> >> mounted.
>> >> 
>> >> While a bit longer than the old code this code is much more robust
>> >> as it allows information to flow up from the leaves and down
>> >> from the trunk making the order in which mounts are encountered
>> >> in the umount propgation tree irrelevant.
>> >
>> > Eric,
>> >
>> > 	I tried multiple time to understand the algorithm, but failed
>> > 	to understand the reasoning behind each of the steps. Hence
>> > 	I can't tell if the algorithm is correct or wrong.
>> >
>> > 	I know you are trying to optimize the current algorithm,
>> > 	but what is the key insight that you are trying to leverage
>> >        	to optimize it? That probably might help me analyze the
>> > 	algorithm.
>> > 	
>> >
>> > 	You walk the propogation tree, and for each element in the
>> > 	propagation-tree you try to unmount its entire mount-tree.
>> > 	(not sure if this operation is correct, since I know, I had
>> > 	 given an example in the past where this can go wrong).
>> 
>> I think you are refering to when I tried to propgate the entire tree
>> mount and was not following the individual propagation trees for
>> each mount.  Which left some mounts mounted that if we had
>> followed the individual propgation trees would have been umounted.
>> 
>> This code does not do anything like that it simply follows the
>> individual umount propgation trees.
>> 
>> > 	And later if you find that the unmount is successful, you try
>> > 	to walk up and see if the parent can also be unmounted(dont know
>> > 	why this is needed).
>> 
>> > Sorry, but if you can help with some key insights, it will help.
>> 
>> The first insight is that the parent child relationship that happens
>> between mounts in the set of mounts removed by MNT_DETACH is not
>> necessarily the same parent child relationship between the mounts
>> the are propagated to.
>> 
>> Which leads to the second insight that we can not guarantee that during
>> umount when the mount propgation tree is being walked we can not
>> gaurantee that the leaves are being walked first.
>> 
>
> If you agree that " (A) tucked mounts do/should NOT receive propagation
> events on its root dentry ", than i strongly think; short of asserting,
> that the propagation tree will always reach the child first
> and not the parent. I will further verify my thoughts before
> elevating my strong-thinking to an assertion.

What led to the implementation of these mounts as tucked mounts rather
than side/shadow mounts is Al Viro's assertion that these are normal
mounts.  As normal mounts I tend to think they should be treated
normally.

My experience to date suggests that except for artifically created
pathlogical cases the possible ways we can treat tucked mounts
does not appear to matter.

So I do not yet agree to (A).

I don't think we have ever actually implemented (A).  We have
implemented something very similar but the act of untucking a mount
resulted in what was at the root of a dentry receving events during
the same umount MNT_DETACH instance.

All that is important for the optimizations in patch 2/2 is that
resolving the underying mount is a separate and final pass from
dealing with mount propgation.  So your (A) could be compatible with
the code in patch 1/2.

> But if you do not agree with (A) than yes we could reach the
> parent or the child first and we will badly need your algorithm.

Please note my example does not involve any tucked mounts.  I see this
issue with a very ordinary set of mounts and mount propagation.

So I think we have rough agreement that my algorithm in patch 1/2 is
needed.

> BTW: This is the same disagreement that we had earlier regarding
> the semantics of tucked mounts.
>
> Assuming that you may not agree with (A), I looked through your
> explaination and the code/algorithm and the steps make sense.

> The one thing that was not clear to me was --  is it ok to umount
> child whose ancestor is MNT_LOCKED?  Your algorithm seems
> to allow that.

If you have a tree of mounts:  With the root of the tree not locked onto
it's mountpoint and the rest of the mounts in the tree locked onto the
mountpoint.  Yes it is ok to unmount the tree. (Just not the individual
pieces).

While user space has access to the mounts it is necessary to keep the
pieces locked together.

MNT_LOCKED guards against seeing the mountpint or any of the files or
directories in the mountpoint.

MNT_LOCKED is a mechanism for dealing with privilege differences between
mount namespaces.

>> Without tucked mounts, without locked mounts that information is enough
>> to say the original mount propgation code for umount was incorrect as it
>> assumed that the leaves would be walked first.
>> 
>> I believe the test case I have given above is an example of that.  It
>> has locked mounts in it as well which made everything doubly ugly.
>> 
>> To handle the case that the code may visit the parent before the child
>> if something is mounted on top of a mount we wish to unmount it is
>> added to the to_restore list instead of the to_umount list.
>> 
>> If a mount does not have children and it is unmounted __propgate_umount
>> returns true.  The code then looks at the parent mount and it is on
>> the to_restore list it tries to unmount the parent again.
>> 
>> That is the leaf to root propagation.
>> 
>> 
>> 
>> There is also root to leaf propgation in umount_list to handle the case of
>> locked mounts.  Locked mounts come about because a more privileged user
>> propagated them into your mount namespace as a set, and the unprivileged
>> user is not allowed to break up the set lest they see something under a
>> mount they should not see.
>> 
>> When a mount that could be unmounted if it was not locked to it's parent
>> it is marked and placed on the to_restore list.
>> 
>> 
>> When examining children umount_one removes mounts from their parents
>> mnt_mounts list.
>> 
>> Children that fully cover a mount (toppers) are ignored.
>> Children that if not locked would be unmounted are ignored
>>    as those children become unmountable if their parent
>>    is unmountable.
>
> I find the children can get unmounted even when one its ancestor is
> locked.  An example case is -- if the child is a leaf and one of its
> ancestor is locked, but is not part of any related propagation tree.

I am not quite certain what you are asserting.  Locking is an operation
to hide the contents of a mountpoint.  So a leaf that is not locked to
it's parent is perfectly fine to unmount. 

>> 
>>    This allows a tree of mounts that is locked together to
>>    be unmounted if the root is unmountable.
>> 
>> 
>> Does that help?
>
> Yes your explaination helped,
>
> Sorry it took some time to get to this,

Thank you for making the time to get to this and have the conversation.
These are tough issues to dig through, and we aren't having the easiest
of conversations.

Eric
diff mbox

Patch

diff --git a/fs/mount.h b/fs/mount.h
index ede5a1d5cf99..de45d9e76748 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -58,7 +58,7 @@  struct mount {
 	struct mnt_namespace *mnt_ns;	/* containing namespace */
 	struct mountpoint *mnt_mp;	/* where is it mounted */
 	struct hlist_node mnt_mp_list;	/* list mounts with the same mountpoint */
-	struct list_head mnt_reparent;	/* reparent list entry */
+	struct list_head mnt_umounting; /* list entry for umount propagation */
 #ifdef CONFIG_FSNOTIFY
 	struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks;
 	__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 51e49866e1fe..5e3dcbeb1de5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -236,7 +236,7 @@  static struct mount *alloc_vfsmnt(const char *name)
 		INIT_LIST_HEAD(&mnt->mnt_slave_list);
 		INIT_LIST_HEAD(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
-		INIT_LIST_HEAD(&mnt->mnt_reparent);
+		INIT_LIST_HEAD(&mnt->mnt_umounting);
 		init_fs_pin(&mnt->mnt_umount, drop_mountpoint);
 	}
 	return mnt;
diff --git a/fs/pnode.c b/fs/pnode.c
index 52aca0a118ff..fbaca7df2eb0 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -413,86 +413,95 @@  void propagate_mount_unlock(struct mount *mnt)
 	}
 }
 
-/*
- * Mark all mounts that the MNT_LOCKED logic will allow to be unmounted.
- */
-static void mark_umount_candidates(struct mount *mnt)
+static void umount_one(struct mount *mnt, struct list_head *to_umount)
 {
-	struct mount *parent = mnt->mnt_parent;
-	struct mount *m;
-
-	BUG_ON(parent == mnt);
-
-	for (m = propagation_next(parent, parent); m;
-			m = propagation_next(m, parent)) {
-		struct mount *child = __lookup_mnt(&m->mnt,
-						mnt->mnt_mountpoint);
-		if (!child || (child->mnt.mnt_flags & MNT_UMOUNT))
-			continue;
-		if (!IS_MNT_LOCKED(child) || IS_MNT_MARKED(m)) {
-			SET_MNT_MARK(child);
-		}
-	}
+	CLEAR_MNT_MARK(mnt);
+	mnt->mnt.mnt_flags |= MNT_UMOUNT;
+	list_del_init(&mnt->mnt_child);
+	list_del_init(&mnt->mnt_umounting);
+	list_move_tail(&mnt->mnt_list, to_umount);
 }
 
 /*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
-static void __propagate_umount(struct mount *mnt, struct list_head *to_reparent)
+static bool __propagate_umount(struct mount *mnt,
+			       struct list_head *to_umount,
+			       struct list_head *to_restore)
 {
-	struct mount *parent = mnt->mnt_parent;
-	struct mount *m;
+	bool progress = false;
+	struct mount *child;
 
-	BUG_ON(parent == mnt);
+	/*
+	 * The state of the parent won't change if this mount is
+	 * already unmounted or marked as without children.
+	 */
+	if (mnt->mnt.mnt_flags & (MNT_UMOUNT | MNT_MARKED))
+		goto out;
 
-	for (m = propagation_next(parent, parent); m;
-			m = propagation_next(m, parent)) {
-		struct mount *topper;
-		struct mount *child = __lookup_mnt(&m->mnt,
-						mnt->mnt_mountpoint);
-		/*
-		 * umount the child only if the child has no children
-		 * and the child is marked safe to unmount.
-		 */
-		if (!child || !IS_MNT_MARKED(child))
+	/* Verify topper is the only grandchild that has not been
+	 * speculatively unmounted.
+	 */
+	list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+		if (child->mnt_mountpoint == mnt->mnt.mnt_root)
 			continue;
-		CLEAR_MNT_MARK(child);
+		if (!list_empty(&child->mnt_umounting) && IS_MNT_MARKED(child))
+			continue;
+		/* Found a mounted child */
+		goto children;
+	}
 
-		/* If there is exactly one mount covering all of child
-		 * replace child with that mount.
-		 */
-		topper = find_topper(child);
-		if (topper)
-			list_add_tail(&topper->mnt_reparent, to_reparent);
+	/* Mark mounts that can be unmounted if not locked */
+	SET_MNT_MARK(mnt);
+	progress = true;
 
-		if (topper || list_empty(&child->mnt_mounts)) {
-			list_del_init(&child->mnt_child);
-			list_del_init(&child->mnt_reparent);
-			child->mnt.mnt_flags |= MNT_UMOUNT;
-			list_move_tail(&child->mnt_list, &mnt->mnt_list);
+	/* If a mount is without children and not locked umount it. */
+	if (!IS_MNT_LOCKED(mnt)) {
+		umount_one(mnt, to_umount);
+	} else {
+children:
+		list_move_tail(&mnt->mnt_umounting, to_restore);
+	}
+out:
+	return progress;
+}
+
+static void umount_list(struct list_head *to_umount,
+			struct list_head *to_restore)
+{
+	struct mount *mnt, *child, *tmp;
+	list_for_each_entry(mnt, to_umount, mnt_list) {
+		list_for_each_entry_safe(child, tmp, &mnt->mnt_mounts, mnt_child) {
+			/* topper? */
+			if (child->mnt_mountpoint == mnt->mnt.mnt_root)
+				list_move_tail(&child->mnt_umounting, to_restore);
+			else
+				umount_one(child, to_umount);
 		}
 	}
 }
 
-static void reparent_mounts(struct list_head *to_reparent)
+static void restore_mounts(struct list_head *to_restore)
 {
-	while (!list_empty(to_reparent)) {
+	/* Restore mounts to a clean working state */
+	while (!list_empty(to_restore)) {
 		struct mount *mnt, *parent;
 		struct mountpoint *mp;
 
-		mnt = list_first_entry(to_reparent, struct mount, mnt_reparent);
-		list_del_init(&mnt->mnt_reparent);
+		mnt = list_first_entry(to_restore, struct mount, mnt_umounting);
+		CLEAR_MNT_MARK(mnt);
+		list_del_init(&mnt->mnt_umounting);
 
-		/* Where should this mount be reparented to? */
+		/* Should this mount be reparented? */
 		mp = mnt->mnt_mp;
 		parent = mnt->mnt_parent;
 		while (parent->mnt.mnt_flags & MNT_UMOUNT) {
 			mp = parent->mnt_mp;
 			parent = parent->mnt_parent;
 		}
-
-		mnt_change_mountpoint(parent, mp, mnt);
+		if (parent != mnt->mnt_parent)
+			mnt_change_mountpoint(parent, mp, mnt);
 	}
 }
 
@@ -506,15 +515,34 @@  static void reparent_mounts(struct list_head *to_reparent)
 int propagate_umount(struct list_head *list)
 {
 	struct mount *mnt;
-	LIST_HEAD(to_reparent);
+	LIST_HEAD(to_restore);
+	LIST_HEAD(to_umount);
 
-	list_for_each_entry_reverse(mnt, list, mnt_list)
-		mark_umount_candidates(mnt);
+	list_for_each_entry(mnt, list, mnt_list) {
+		struct mount *parent = mnt->mnt_parent;
+		struct mount *m;
 
-	list_for_each_entry(mnt, list, mnt_list)
-		__propagate_umount(mnt, &to_reparent);
+		for (m = propagation_next(parent, parent); m;
+		     m = propagation_next(m, parent)) {
+			struct mount *child = __lookup_mnt(&m->mnt,
+							   mnt->mnt_mountpoint);
+			if (!child)
+				continue;
+
+			/* Check the child and parents while progress is made */
+			while (__propagate_umount(child,
+						  &to_umount, &to_restore)) {
+				/* Is the parent a umount candidate? */
+				child = child->mnt_parent;
+				if (list_empty(&child->mnt_umounting))
+					break;
+			}
+		}
+	}
 
-	reparent_mounts(&to_reparent);
+	umount_list(&to_umount, &to_restore);
+	restore_mounts(&to_restore);
+	list_splice_tail(&to_umount, list);
 
 	return 0;
 }