diff mbox

Track and use TUCK mounts for precise umount semantics.

Message ID 1497254801-21104-1-git-send-email-linuxram@us.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ram Pai June 12, 2017, 8:06 a.m. UTC
Shared-subtree has had many cases where unmount operation would leave a lot of
residual mounts, because it was way too difficult to remove them. Eric
Biederman came up with the idea of tucked mount which relieved the situation
tremendously. But it still lacks the reversibility property --  A mount
followed by a umount should reverse the mount-tree to the exact same state as
before the mount.

The main goal of this patch is to bring in reversibility.

Below are listed the rules that can help reverse the operation.  These rules
are derived intuitive and are backed by lots of testing.  Would be nice to
derive formal proofs of correctness. Any takers?

Any way...So here are the rules.

Rule-1:  a mount operation that propagates to a peer or a slave, which already
has a mount at that dentry, should tuck a mount. These mounts are called tucked
mounts.

	eg:  mount --make-shared A
	     mount  --bind A  B
	     mount  -make-slave B
	     mount  C  B

	     leads to the following mount tree

  (state 1)

            A      -->       B
                             |
                             C

	if we try to mount on A, it propagates to B, but C is already
	mounted on B, so we tuck a mount, leading to the following
	tree.

	mount  D   A

  (state 2)
            A       -->       B
            |                 |
            D       -->       D1(ts, te)
                              |
                              C

	   Here D1 is a slave of D, and since D is tucked, we mark it with
	   (ts, te), meaning its tucked, and it is the start of the tuck
	   stack aswell as end of the tuck state. 'ts' means tuck-start, 'te'
	   means tuck-end.  Marking it that way is important, which will be
	   evident in the examples later.

Rule-2: a umount operation that propagates to a peer/slave, that holds a tucked
	mount, should untuck the mount.

         eg: when the mount tree is in state 2.
	 umount D should reverse the tree to state 1.

Rule-3: a umount opAration that exposes a tucked mount, is no more a tucked mount.
	clear that mount of its tucked state.

	eg: when the mount tree is in state 2,
	'umount C' should lead the tree to the following state

  (state 3)
            A       -->       B
            |                 |
            D       -->       D1

	    D1 is now exposed, and hence not a tucked-mount anymore.

Rule-4: a mount operation that propagates to a peer/slave that is also a tucked-mount, should
	mount a tucked-mount and extend the tuck boundary.

	eg: when the mount tree is in state 2,
	mount on D should propagate to D1 and tuck in a tucked-mount.

  (state 2)
            A       -->       B
            |                 |
            D       -->       D1(ts, te)
                              |
                              C

	   'mount  E   D' , leads to

  (state 4)
            A       -->       B
            |                 |
            D       -->       D1(ts)
            |                 |
            E       -->       E1(te)
                              |
                              C

	   Here E1 is a slave of E, and since E1 extends the tuck-stack we mark
	   it with (te), meaning its tucked, and is the end of its tuck stack;
	   in other words it is a tuck-end mount. Since D1 is now only the
	   start of the tuck stack, it is marked as D1(ts) meaning it is a
	   tuck-start mount.  Marking the start and the end is important to
	   determine when to entertain a umount event and when to ignore it, as
	   will be evident from the examples later.

Rule-5: a umount operation that propagates to a tucked peer/slave; but is not a
tuck-end mount, should unmount and untuck its child mount.

The reason why it should not be a tuck-end mount will be evident in the next case.

       eg: when the mount tree is in state 4,

  (state 4)
            A       -->       B
            |                 |
            D       -->       D1(ts)
            |                 |
            E       -->       E1(te)
                              |
                              C

       umount E , leads to state 2.

  (state 2)
            A       -->       B
            |                 |
            D       -->       D1(ts, te)
                              |
                              C

	The umount event propagated to D1. D1 is not a tuck-end mount, so it
	unmount its child E1.  This is logical because E1 was mounted on D1 in
	the past and hence should be unmountable.

Rule-6: a umount operation that propagates to a tucked peer/slave, and
	is *also a tuck-end mount*, should NOT unmount its child mount. (there is one
	exception which will be demonstrated in Rule-7 ).

	The reason being anything that is mounted on a tuck-end mount was not
	mounted explicitly. It got artifically mounted on top by a tucking
	action. In other words, the child existed before the parent.

	eg: assume  we have the following tree

  (state 5)
            A       -->       B
            |                 |
            D       -->       D1(ts)
            |                 |
            E       -->       E1(te)
                              |
                              C

	where E1 is also a *peer* of D1, in other words a tuck-start mount and
	its child tuck-end mount are peers of each other. It may not be
	possible to create the exact tree above. However I have contrived this
	example to highlight a state where it is possible to reach a scenario,
	where a umount operation propagates to a tuck-end mount.  In such cases
	the event has to be neglected.

	[ NOTE:A real-life example where such a scenario can happen is
	here. Its a complex state, I dont want to get into the complexity.
	So I contrived the above state (5) example. Just for completeness, here
	is the real case

	 mount -t tmpfs test-base /mnt1
	 cd /mnt1
	 mkdir -p 1 2 1/1
	 mount --bind 1 1
	 mount --make-shared 1
	 mount --bind 1 2
	 mount --bind 1/1 1/1
	 mount --bind 1/1 1/1

	]

	the operation 'umount E'

	propagates to D1 and E1 since D1 and E1 are peers. In such a case D1
	has to entertain the umount, but not E1 because C was created
	way-before E1 was created.

	The resulting tree should look like this

  (state 6)
            A       -->       B
            |                 |
            D       -->       D1(ts,te)
                              |
                              C

Rule-7) a umount operation that propagates to a tuck-end mount, has to entertain the
	umount only if the originating event was propagated by a tuck-end mount. This
	is the exception case for Rule-6.

	here is a example

	eg: assume the following example
	mount --make-shared A
	mount --bind A  B
	mount --make-slave  B
	mount --make-shared  B
	mount --bind B  Z
	mount --make-slave  Z
	mount C Z

  (state 7)
            A       -->       B               -->       Z
                                                        |
                                                        C

	mount D B
  (state 8)
           A       -->       B               -->        Z
                             |                          |
                             D              -->         D1(ts,te)
                                                        |
                                                        C
	mount E A
  (state 9)
            A       -->       B               -->       Z
                              |                         |
            E       -->       E1(ts,te)       -->       E2(ts,te)
                              |                         |
                              D               -->       D1(ts,te)
                                                        |
                                                        C

	At this point if we umount D,  it will propagate to E2 since E1 is a
	master to E2.  However E2 is a tuck-end mount. By rule Rule-6 we should
	ignore the event. However umount D1 is a legitimate operation, because
	D1 got mounted at the same time D got mounted. They were born
	to-gather, they have to die togather. The problem is, we should'nt be walking
	the E1 propagation tree at all. We should be walking B propagation
	tree since that was the tree that was walked when D and D1 got created.

	As an optimization we walk tucked-propagation tree of E1, and unmount
	D1 since we know the propagation got initiated at a tuck-end mount; E1
	in this case.

	In other words, if we start to walk the propagation tree at a tuck-end
	mount, we can entertain all unmounts on all the nodes of the
	propagation tree(Rule-6). On the contrary if we did not start at a tuck-end
	mount, then we should not entertain any umounts at any node of the
	propagation tree that has a tuck-end mount i.e rule-5 applies.

Rule-8: tucks have to maintain tuck-boundaries. Othewise it is difficult to
	determine if rule-5 or Rule-6 or Rule-7 is applicable.

	The most important part is we need to know where the tuck-ends are,
	umount on a tuck-end-mount is determined by Rule-6 and Rule-7. umount on a
	tuck-mount but not tuck-end mount, is determined by rule-5.  A
	tuck-start-mount is there as a place holder incase we stumble upon a
	case which needs us to know where the tuck-stack started.

	Currently tuck-start-mounts are used to determine the point uptill
	which the tuck-mounts are to be normalized, when their covering
	mount goes away.

	eg: lets take state 9.

  (state 9)
            A       -->       B               -->       Z
                              |                         |
            E       -->       E1(ts,te)       -->       E2(ts,te)
                              |                         |
                              D               -->       D1(ts,te)
                                                        |
                                                        C

        now 'mount F D'
            'mount G D'
        leads to the following state.

  (state 10)
            A       -->       B               -->       Z
                              |                         |
            E       -->       E1(ts,te)       -->       E2(ts,te)
                              |                         |
                              D               -->       D1(ts)
                              |                         |
                              F               -->       F1
                              |                         |
                              G               -->       G1(te)
                                                        |
                                                        C

	At this point if C, is unmounted, all the mounts below it up until the
	nearest tuck-start mount are not tucked-mounts anymore. The reason they
	were tucked was because of the presence of C.  So the tuck-start mount
	helps us to determine that boundary uptill where we need need to
	normalize the mounts. Without it we would'nt know.


TESTING:

With the above rules in place, I have generated the following test cases and found
them all working. Many scenarios were tested manually, and are not codified as test
cases. Will do. But here are some codified testcases.

----------------------------------------------------------
Testcase 1

setup()
{
	mkdir -p /tmp2 /tmp1 /tmp3
	mount -t tmpfs test-base /tmp1
	mount --make-shared /tmp1
	mount --bind /tmp1 /tmp2
	mount --make-slave /tmp2
	mount --make-shared /tmp2
	mount --bind /tmp2 /tmp3
	mount --make-slave /tmp3
	mkdir -p /tmp1/a
	mount --bind /mnt /tmp3/a
	mount --bind /home /tmp2/a
	mount --bind /bin /tmp1/a
}

uumount()
{
	umount $1
	umount $2
	umount $3
	umount /tmp2
	umount /tmp3
}

doo()
{
setup;
uumount $1 $2 $3
unshare -Urm --propagation unchanged /bin/sh -c 'sleep 5; cat /proc/self/mountinfo; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount /tmp1
wait
}

doo /tmp1/a /tmp2/a /tmp3/a
doo /tmp2/a /tmp3/a /tmp1/a
doo /tmp3/a /tmp1/a /tmp2/a
--------------------------------------------------------------

Testcase 2:

#!/bin/bash
# written by Eric to check lock-mount semantics.

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b
mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20
umount -l /mnt/b

unshare -Urm --propagation unchanged /bin/sh -c 'sleep 5; cat /proc/self/mountinfo; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b

--------------------------------------------------------------

Testcase 3:

#!/bin/bash
#test reversibility
mkdir -p /tmp2 /tmp1 /tmp3
mount -t tmpfs test-base /tmp1
mount --make-shared /tmp1
mount --bind /tmp1 /tmp2
mount --make-slave /tmp2
mount --make-shared /tmp2
mount --bind /tmp2 /tmp3
mount --make-slave /tmp3
mkdir -p /tmp1/a
mount -t tmpfs test-base9 /tmp3/a
mount -t tmpfs test-base7 /tmp2/a
mount -t tmpfs test-base6 /tmp1/a

mount -t tmpfs test-base1 /tmp2/a
mount -t tmpfs test-base10 /tmp2/a
mount -t tmpfs test-base100 /tmp2/a
mount -t tmpfs test-base20 /tmp1/a
mount -t tmpfs test-base200 /tmp1/a
mount -t tmpfs test-base2000 /tmp1/a

umount /tmp1/a
umount /tmp1/a
umount /tmp1/a
umount /tmp2/a
umount /tmp2/a
umount /tmp2/a
umount /tmp1/a
umount /tmp2/a
umount /tmp3/a
umount /tmp3 /tmp2

mount

unshare -Urm --propagation unchanged /bin/sh -c 'sleep 5; cat /proc/self/mountinfo; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount /tmp1
wait
--------------------------------------------------------------

And finally the patch :)

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 fs/mount.h            |   5 ++
 fs/namespace.c        |  19 ++++-
 fs/pnode.c            | 200 ++++++++++++++++++++++++++++++++++++++------------
 fs/pnode.h            |  11 +++
 fs/proc_namespace.c   |   4 +
 include/linux/mount.h |   3 +
 6 files changed, 193 insertions(+), 49 deletions(-)

Comments

Eric W. Biederman June 13, 2017, 11:55 a.m. UTC | #1
Ram Pai <linuxram@us.ibm.com> writes:

> Shared-subtree has had many cases where unmount operation would leave a lot of
> residual mounts, because it was way too difficult to remove them. Eric
> Biederman came up with the idea of tucked mount which relieved the situation
> tremendously. But it still lacks the reversibility property --  A mount
> followed by a umount should reverse the mount-tree to the exact same state as
> before the mount.

There is a lot here so I am taking my time an digging through it all.

I have found two issues that I am going to point out before I comment
on the big picture.

> diff --git a/fs/pnode.c b/fs/pnode.c
> index 5bc7896..7dfbe9c 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -322,19 +322,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
>  	return ret;
>  }
>  
> -static struct mount *find_topper(struct mount *mnt)
> +static struct mount *find_covering_child(struct mount *m)
>  {
> -	/* If there is exactly one mount covering mnt completely return it. */
> -	struct mount *child;
> -
> -	if (!list_is_singular(&mnt->mnt_mounts))
> -		return NULL;
> -
> -	child = list_first_entry(&mnt->mnt_mounts, struct mount, mnt_child);
> -	if (child->mnt_mountpoint != mnt->mnt.mnt_root)
> -		return NULL;
> -
> -	return child;
> +	return __lookup_mnt(&m->mnt, m->mnt.mnt_root);
>  }

Unless there is some reason you can't use mnt_mounts using
__lookup_mnt is a pessimization here as __lookup_mnt can go slowly
when the has chains get long.

>  /*
> @@ -357,8 +347,10 @@ static inline int do_refcount_check(struct mount *mnt, int count)
>   */
>  int propagate_mount_busy(struct mount *mnt, int refcnt)
>  {
> -	struct mount *m, *child, *topper;
> +	struct mount *m, *child;
>  	struct mount *parent = mnt->mnt_parent;
> +	bool tucks_can_ignore = !(IS_MNT_TUCK_END(parent) &&
> +				mnt->mnt_mountpoint == parent->mnt.mnt_root);
>  
>  	if (mnt == parent)
>  		return do_refcount_check(mnt, refcnt);
> @@ -372,19 +364,19 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
>  		return 1;
>  
>  	for (m = propagation_next(parent, parent); m;
> -	     		m = propagation_next(m, parent)) {
> +			m = propagation_next(m, parent)) {
>  		int count = 1;
> -		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> -		if (!child)
> -			continue;
>  
> -		/* Is there exactly one mount on the child that covers
> -		 * it completely whose reference should be ignored?
> +		/*
> +		 * read the comment in __propagate_umount()
>  		 */
> -		topper = find_topper(child);
> -		if (topper)
> -			count += 1;
> -		else if (!list_empty(&child->mnt_mounts))
> +		if (tucks_can_ignore &&
> +			IS_MNT_TUCK_END(m) &&
> +			(mnt->mnt_mountpoint == m->mnt.mnt_root))
> +			continue;
> +
> +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> +		if (!child || !list_empty(&child->mnt_mounts))
>  			continue;
>  
>  		if (do_refcount_check(child, count))

Unless I am misreading this code you are skipping a refcount check that
we should be performing.

Eric
Ram Pai June 15, 2017, 9:48 p.m. UTC | #2
On Tue, Jun 13, 2017 at 06:55:30AM -0500, Eric W. Biederman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > Shared-subtree has had many cases where unmount operation would leave a lot of
> > residual mounts, because it was way too difficult to remove them. Eric
> > Biederman came up with the idea of tucked mount which relieved the situation
> > tremendously. But it still lacks the reversibility property --  A mount
> > followed by a umount should reverse the mount-tree to the exact same state as
> > before the mount.
> 
> There is a lot here so I am taking my time an digging through it all.

Yes. thanks in advance for taking the time. It will take some
concentrated effort to go through it.

> 
> I have found two issues that I am going to point out before I comment
> on the big picture.
> 
> > diff --git a/fs/pnode.c b/fs/pnode.c
> > index 5bc7896..7dfbe9c 100644
> > --- a/fs/pnode.c
> > +++ b/fs/pnode.c
> > @@ -322,19 +322,9 @@ int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
> >  	return ret;
> >  }
> >  
> > -static struct mount *find_topper(struct mount *mnt)
> > +static struct mount *find_covering_child(struct mount *m)
> >  {
> > -	/* If there is exactly one mount covering mnt completely return it. */
> > -	struct mount *child;
> > -
> > -	if (!list_is_singular(&mnt->mnt_mounts))
> > -		return NULL;
> > -
> > -	child = list_first_entry(&mnt->mnt_mounts, struct mount, mnt_child);
> > -	if (child->mnt_mountpoint != mnt->mnt.mnt_root)
> > -		return NULL;
> > -
> > -	return child;
> > +	return __lookup_mnt(&m->mnt, m->mnt.mnt_root);
> >  }
> 
> Unless there is some reason you can't use mnt_mounts using
> __lookup_mnt is a pessimization here as __lookup_mnt can go slowly
> when the has chains get long.

I can depend on ->mnt_mounts. I will have to walk the children
list, which can be slow.

But I realize that the covering child mount will generally be
towards the tail of the list, since once it covers, the only way to mount
is through propagations. So I can use the reverse-list-walking heuristic
and cut down on the time.

> 
> >  /*
> > @@ -357,8 +347,10 @@ static inline int do_refcount_check(struct mount *mnt, int count)
> >   */
> >  int propagate_mount_busy(struct mount *mnt, int refcnt)
> >  {
> > -	struct mount *m, *child, *topper;
> > +	struct mount *m, *child;
> >  	struct mount *parent = mnt->mnt_parent;
> > +	bool tucks_can_ignore = !(IS_MNT_TUCK_END(parent) &&
> > +				mnt->mnt_mountpoint == parent->mnt.mnt_root);
> >  
> >  	if (mnt == parent)
> >  		return do_refcount_check(mnt, refcnt);
> > @@ -372,19 +364,19 @@ int propagate_mount_busy(struct mount *mnt, int refcnt)
> >  		return 1;
> >  
> >  	for (m = propagation_next(parent, parent); m;
> > -	     		m = propagation_next(m, parent)) {
> > +			m = propagation_next(m, parent)) {
> >  		int count = 1;
> > -		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> > -		if (!child)
> > -			continue;
> >  
> > -		/* Is there exactly one mount on the child that covers
> > -		 * it completely whose reference should be ignored?
> > +		/*
> > +		 * read the comment in __propagate_umount()
> >  		 */
> > -		topper = find_topper(child);
> > -		if (topper)
> > -			count += 1;
> > -		else if (!list_empty(&child->mnt_mounts))
> > +		if (tucks_can_ignore &&
> > +			IS_MNT_TUCK_END(m) &&
> > +			(mnt->mnt_mountpoint == m->mnt.mnt_root))
> > +			continue;
> > +
> > +		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
> > +		if (!child || !list_empty(&child->mnt_mounts))
> >  			continue;
> >  
> >  		if (do_refcount_check(child, count))
> 
> Unless I am misreading this code you are skipping a refcount check that
> we should be performing.

I looked through it again, and I am sure we are not skipping a refcount
check. We check the refcount of only those children which can be a
target of the umount and have no children of their own.

RP
diff mbox

Patch

diff --git a/fs/mount.h b/fs/mount.h
index bf1fda6..714a540 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -82,6 +82,11 @@  static inline int mnt_has_parent(struct mount *mnt)
 	return mnt != mnt->mnt_parent;
 }
 
+static inline int mnt_is_covering(struct mount *mnt)
+{
+	return (mnt->mnt_mountpoint == mnt->mnt_parent->mnt.mnt_root);
+}
+
 static inline int is_mounted(struct vfsmount *mnt)
 {
 	/* neither detached nor internal? */
diff --git a/fs/namespace.c b/fs/namespace.c
index 8bd3e4d..072fb62 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1512,10 +1512,19 @@  static void umount_tree(struct mount *mnt, enum umount_tree_flags how)
 		pin_insert_group(&p->mnt_umount, &p->mnt_parent->mnt,
 				 disconnect ? &unmounted : NULL);
 		if (mnt_has_parent(p)) {
+			/*
+			 * handle tucked mounts that were not
+			 * dismantled through the progagation
+			 * path.
+			 */
+			if (!(how & UMOUNT_PROPAGATE))
+				prep_for_untuck(p);
+
 			mnt_add_count(p->mnt_parent, -1);
 			if (!disconnect) {
 				/* Don't forget about p */
-				list_add_tail(&p->mnt_child, &p->mnt_parent->mnt_mounts);
+				list_add_tail(&p->mnt_child,
+					&p->mnt_parent->mnt_mounts);
 			} else {
 				umount_mnt(p);
 			}
@@ -1656,7 +1665,7 @@  void __detach_mounts(struct dentry *dentry)
 	namespace_unlock();
 }
 
-/* 
+/*
  * Is the caller allowed to modify his namespace?
  */
 static inline bool may_mount(void)
@@ -2047,8 +2056,10 @@  static int attach_recursive_mnt(struct mount *source_mnt,
 		hlist_del_init(&child->mnt_hash);
 		q = __lookup_mnt(&child->mnt_parent->mnt,
 				 child->mnt_mountpoint);
-		if (q)
+		if (q) {
+			prep_for_tuck(child);
 			mnt_change_mountpoint(child, smp, q);
+		}
 		commit_tree(child);
 	}
 	put_mountpoint(smp);
@@ -2210,7 +2221,7 @@  static int do_loopback(struct path *path, const char *old_name,
 
 	err = -EINVAL;
 	if (mnt_ns_loop(old_path.dentry))
-		goto out; 
+		goto out;
 
 	mp = lock_mount(path);
 	err = PTR_ERR(mp);
diff --git a/fs/pnode.c b/fs/pnode.c
index 5bc7896..7dfbe9c 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -322,19 +322,9 @@  int propagate_mnt(struct mount *dest_mnt, struct mountpoint *dest_mp,
 	return ret;
 }
 
-static struct mount *find_topper(struct mount *mnt)
+static struct mount *find_covering_child(struct mount *m)
 {
-	/* If there is exactly one mount covering mnt completely return it. */
-	struct mount *child;
-
-	if (!list_is_singular(&mnt->mnt_mounts))
-		return NULL;
-
-	child = list_first_entry(&mnt->mnt_mounts, struct mount, mnt_child);
-	if (child->mnt_mountpoint != mnt->mnt.mnt_root)
-		return NULL;
-
-	return child;
+	return __lookup_mnt(&m->mnt, m->mnt.mnt_root);
 }
 
 /*
@@ -357,8 +347,10 @@  static inline int do_refcount_check(struct mount *mnt, int count)
  */
 int propagate_mount_busy(struct mount *mnt, int refcnt)
 {
-	struct mount *m, *child, *topper;
+	struct mount *m, *child;
 	struct mount *parent = mnt->mnt_parent;
+	bool tucks_can_ignore = !(IS_MNT_TUCK_END(parent) &&
+				mnt->mnt_mountpoint == parent->mnt.mnt_root);
 
 	if (mnt == parent)
 		return do_refcount_check(mnt, refcnt);
@@ -372,19 +364,19 @@  int propagate_mount_busy(struct mount *mnt, int refcnt)
 		return 1;
 
 	for (m = propagation_next(parent, parent); m;
-	     		m = propagation_next(m, parent)) {
+			m = propagation_next(m, parent)) {
 		int count = 1;
-		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
-		if (!child)
-			continue;
 
-		/* Is there exactly one mount on the child that covers
-		 * it completely whose reference should be ignored?
+		/*
+		 * read the comment in __propagate_umount()
 		 */
-		topper = find_topper(child);
-		if (topper)
-			count += 1;
-		else if (!list_empty(&child->mnt_mounts))
+		if (tucks_can_ignore &&
+			IS_MNT_TUCK_END(m) &&
+			(mnt->mnt_mountpoint == m->mnt.mnt_root))
+			continue;
+
+		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
+		if (!child || !list_empty(&child->mnt_mounts))
 			continue;
 
 		if (do_refcount_check(child, count))
@@ -403,7 +395,7 @@  void propagate_mount_unlock(struct mount *mnt)
 	struct mount *parent = mnt->mnt_parent;
 	struct mount *m, *child;
 
-	BUG_ON(parent == mnt);
+	WARN_ON(parent == mnt);
 
 	for (m = propagation_next(parent, parent); m;
 			m = propagation_next(m, parent)) {
@@ -421,7 +413,7 @@  static void mark_umount_candidates(struct mount *mnt)
 	struct mount *parent = mnt->mnt_parent;
 	struct mount *m;
 
-	BUG_ON(parent == mnt);
+	WARN_ON(parent == mnt);
 
 	for (m = propagation_next(parent, parent); m;
 			m = propagation_next(m, parent)) {
@@ -436,6 +428,98 @@  static void mark_umount_candidates(struct mount *mnt)
 }
 
 /*
+ * Ready the mnt for a tuck between  parent and child.
+ * The mnt must not be in the hash of its  parent yet.
+ * Neither should the child be in the hash of the mnt.
+ */
+void prep_for_tuck(struct mount *mnt)
+{
+	struct mount *parent = mnt->mnt_parent;
+
+	/*
+	 * Tucks can  only  occur  due  to   propagation.
+	 * There  is  no  way  to  refer a   tree if its
+	 * root dentry  is  hidden.  Clone  of this tree
+	 * will  also  have their  root dentries exposed.
+	 * Hence there cannot be a covering child on mnt.
+	 */
+	WARN_ON(find_covering_child(mnt));
+
+	if (IS_MNT_TUCK_END(parent) && mnt_is_covering(mnt)) {
+		CLEAR_MNT_TUCK_END(parent);
+		SET_MNT_TUCK_END(mnt);
+		return;
+	}
+
+	/* this is the start of a new TUCK pair */
+	SET_MNT_TUCK_END(mnt);
+	SET_MNT_TUCK_START(mnt);
+}
+
+static inline void reset_tuck_pair(struct mount *mnt)
+{
+	int count = 1;
+
+	WARN_ON(!IS_MNT_TUCK_END(mnt));
+	CLEAR_MNT_TUCK_END(mnt);
+	if (IS_MNT_TUCK_START(mnt))
+		count--;
+	while (!(count == 0 && IS_MNT_TUCK_START(mnt))) {
+
+		/* we should not reach the root mount.
+		 * root mount was never a tucked one.
+		 */
+		WARN_ON(mnt == mnt->mnt_parent);
+
+		/*
+		 * if we are walking up the TUCK_END,
+		 * TUCK_START series, the mnt better
+		 * be covering.
+		 */
+		WARN_ON(!mnt_is_covering(mnt));
+
+		mnt = mnt->mnt_parent;
+		if (IS_MNT_TUCK_END(mnt))
+			count++;
+		if (IS_MNT_TUCK_START(mnt))
+			count--;
+	}
+	CLEAR_MNT_TUCK_START(mnt);
+}
+
+void prep_for_untuck(struct mount *mnt)
+{
+	struct mount *covering_child = find_covering_child(mnt);
+
+	if (IS_MNT_TUCK_START(mnt) &&
+		       IS_MNT_TUCK_END(mnt)) {
+		CLEAR_MNT_TUCK_START(mnt);
+		CLEAR_MNT_TUCK_END(mnt);
+
+		if (mnt_is_covering(mnt) && IS_MNT_TUCK_END(mnt->mnt_parent))
+			reset_tuck_pair(mnt->mnt_parent);
+
+	} else if (IS_MNT_TUCK_END(mnt)) {
+		SET_MNT_TUCK_END(mnt->mnt_parent);
+		CLEAR_MNT_TUCK_END(mnt);
+	} else if (IS_MNT_TUCK_START(mnt)) {
+		/*
+		 * There has to be a covering child.
+		 * otherwise TUCK_START would'nt be set
+		 * without a TUCK_END.
+		 */
+		WARN_ON(!covering_child);
+		SET_MNT_TUCK_START(covering_child);
+		CLEAR_MNT_TUCK_START(mnt);
+	} else if (mnt_is_covering(mnt) && IS_MNT_TUCK_END(mnt->mnt_parent))
+		reset_tuck_pair(mnt->mnt_parent);
+
+	if (covering_child)
+		mnt_change_mountpoint(mnt->mnt_parent,
+			mnt->mnt_mp, covering_child);
+}
+
+/*
  * NOTE: unmounting 'mnt' naturally propagates to all other mounts its
  * parent propagates to.
  */
@@ -443,36 +527,62 @@  static void __propagate_umount(struct mount *mnt)
 {
 	struct mount *parent = mnt->mnt_parent;
 	struct mount *m;
+	bool tucks_can_ignore = !(IS_MNT_TUCK_END(parent) &&
+				mnt->mnt_mountpoint == parent->mnt.mnt_root);
 
-	BUG_ON(parent == mnt);
+	WARN_ON(parent == mnt);
 
 	for (m = propagation_next(parent, parent); m;
 			m = propagation_next(m, parent)) {
-		struct mount *topper;
-		struct mount *child = __lookup_mnt(&m->mnt,
-						mnt->mnt_mountpoint);
-		/*
-		 * umount the child only if the child has no children
-		 * and the child is marked safe to unmount.
+		struct mount *child;
+
+		/* Ideally we should traverse the propagation tree of a nearest
+		 * ancestor which is not a tuck-mount. Reason: 'mnt' and its
+		 * peer-cousins existed even before its tucked parents existed.
+		 * 'mnt' and its peer-cousins were mounted traversing the
+		 * then-parent propagation tree, which has now become an
+		 * anscestor propagation tree, thanks to the tucks.  However
+		 * since the parent tree and the actual ancestor tree have
+		 * exactly the same tree structure, we shall optimize our walk,
+		 * and walk the parents' propagation tree.
+		 *
+		 * There is one catch. A parent on the propagation tree; if
+		 * tucked, should ignore the umount event to its covering
+		 * child.  Reason -- the child was born before the tucked
+		 * parent.  However if the parent of 'mnt' is itself a child of
+		 * a tucked parent, than there is an anscestor that existed
+		 * when the 'mnt' and its peer-cousins were created.  Hence
+		 * unmount event must not be ignored by all tucked parents
+		 * residing on the parent propagation tree.
 		 */
+		if (tucks_can_ignore &&
+				IS_MNT_TUCK_END(m) &&
+				(mnt->mnt_mountpoint == m->mnt.mnt_root))
+			continue;
+
+		child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint);
+
 		if (!child || !IS_MNT_MARKED(child))
 			continue;
+
 		CLEAR_MNT_MARK(child);
 
-		/* If there is exactly one mount covering all of child
-		 * replace child with that mount.
-		 */
-		topper = find_topper(child);
-		if (topper)
-			mnt_change_mountpoint(child->mnt_parent, child->mnt_mp,
-					      topper);
-
-		if (list_empty(&child->mnt_mounts)) {
-			list_del_init(&child->mnt_child);
-			child->mnt.mnt_flags |= MNT_UMOUNT;
-			list_move_tail(&child->mnt_list, &mnt->mnt_list);
-		}
+		prep_for_untuck(child);
+
+		/* Yes we expect no children. None. */
+		WARN_ON(!list_empty(&child->mnt_mounts));
+
+		list_del_init(&child->mnt_child);
+		child->mnt.mnt_flags |= MNT_UMOUNT;
+		list_move_tail(&child->mnt_list, &mnt->mnt_list);
 	}
+
+	/*
+	 * This explicit umount operation is exposing the parent.
+	 * In case the parent was a 'tucked' mount, it cannot be so
+	 * anymore.
+	 */
+	prep_for_untuck(mnt);
 }
 
 /*
diff --git a/fs/pnode.h b/fs/pnode.h
index dc87e65..56a55cb 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -18,8 +18,17 @@ 
 #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE)
 #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED)
 #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED)
+#define SET_MNT_TUCKED(m) ((m)->mnt.mnt_flags |= MNT_TUCKED)
+#define SET_MNT_TUCK_END(m) ((m)->mnt.mnt_flags |= MNT_TUCK_END)
+#define SET_MNT_TUCK_START(m) ((m)->mnt.mnt_flags |= MNT_TUCK_START)
 #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED)
+#define CLEAR_MNT_TUCKED(m) ((m)->mnt.mnt_flags &= ~MNT_TUCKED)
+#define CLEAR_MNT_TUCK_END(m) ((m)->mnt.mnt_flags &= ~MNT_TUCK_END)
+#define CLEAR_MNT_TUCK_START(m) ((m)->mnt.mnt_flags &= ~MNT_TUCK_START)
 #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED)
+#define IS_MNT_TUCKED(m) ((m)->mnt.mnt_flags & MNT_TUCKED)
+#define IS_MNT_TUCK_END(m) ((m)->mnt.mnt_flags & MNT_TUCK_END)
+#define IS_MNT_TUCK_START(m) ((m)->mnt.mnt_flags & MNT_TUCK_START)
 
 #define CL_EXPIRE    		0x01
 #define CL_SLAVE     		0x02
@@ -55,4 +64,6 @@  void mnt_change_mountpoint(struct mount *parent, struct mountpoint *mp,
 bool is_path_reachable(struct mount *, struct dentry *,
 			 const struct path *root);
 int count_mounts(struct mnt_namespace *ns, struct mount *mnt);
+void prep_for_tuck(struct mount *mnt);
+void prep_for_untuck(struct mount *mnt);
 #endif /* _LINUX_PNODE_H */
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index b5713fe..f2e5bac 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -166,6 +166,10 @@  static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt)
 	}
 	if (IS_MNT_UNBINDABLE(r))
 		seq_puts(m, " unbindable");
+	if (IS_MNT_TUCK_START(r))
+		seq_puts(m, " tuckstart");
+	if (IS_MNT_TUCK_END(r))
+		seq_puts(m, " tuckend");
 
 	/* Filesystem specific data */
 	seq_puts(m, " - ");
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 8e0352a..2bc4750 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -34,6 +34,9 @@ 
 
 #define MNT_SHARED	0x1000	/* if the vfsmount is a shared mount */
 #define MNT_UNBINDABLE	0x2000	/* if the vfsmount is a unbindable mount */
+#define MNT_TUCKED	0x8000	/* am i tucked ? */
+#define MNT_TUCK_START	0x10000 /* state of the parent when I got tucked */
+#define MNT_TUCK_END	0x20000 /* state of the child when I got tucked */
 /*
  * MNT_SHARED_MASK is the set of flags that should be cleared when a
  * mount becomes shared.  Currently, this is only the flag that says a