Message ID | 20170524204253.GB5554@ram.oc3035372033.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Ram Pai <linuxram@us.ibm.com> writes: > On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote: >> >> While investigating some poor umount performance I realized that in >> the case of overlapping mount trees where some of the mounts are locked >> the code has been failing to unmount all of the mounts it should >> have been unmounting. >> >> This failure to unmount all of the necessary >> mounts can be reproduced with: >> >> $ cat locked_mounts_test.sh >> >> mount -t tmpfs test-base /mnt >> mount --make-shared /mnt >> mkdir -p /mnt/b >> >> mount -t tmpfs test1 /mnt/b >> mount --make-shared /mnt/b >> mkdir -p /mnt/b/10 >> >> mount -t tmpfs test2 /mnt/b/10 >> mount --make-shared /mnt/b/10 >> mkdir -p /mnt/b/10/20 >> >> mount --rbind /mnt/b /mnt/b/10/20 >> >> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi' >> sleep 1 >> umount -l /mnt/b >> wait %% >> >> $ unshare -Urm ./locked_mounts_test.sh >> >> This failure is corrected by removing the prepass that marks mounts >> that may be umounted. >> >> A first pass is added that umounts mounts if possible and if not sets >> mount mark if they could be unmounted if they weren't locked and adds >> them to a list to umount possibilities. This first pass reconsiders >> the mounts parent if it is on the list of umount possibilities, ensuring >> that information of umoutability will pass from child to mount parent. >> >> A second pass then walks through all mounts that are umounted and processes >> their children unmounting them or marking them for reparenting. >> >> A last pass cleans up the state on the mounts that could not be umounted >> and if applicable reparents them to their first parent that remained >> mounted. >> >> While a bit longer than the old code this code is much more robust >> as it allows information to flow up from the leaves and down >> from the trunk making the order in which mounts are encountered >> in the umount propgation tree irrelevant. > > Eric, > I think we can accomplish what you want in a much simpler way. > Would the patch below; UNTESTED BUT COMPILED, resolve your > issue? The reason I came up with an algorithm where the information flows both directions is that especially in the case of umount -l but even in some rare cases of a simple umount, the ordering of the mount propagation tree can result in parent mounts being visited before the child mounts. This case shows in in the case of a mount or a set of mounts being mounted below itself. So no. Irregardless of tucked mount state we can't do this. I see this also doesn't have the change to move mnt_change_mountpoint into another pass. That one is quite important from a practical point of view as that means the way the mount tree changes in umount is the same irrespective of the number of times a mount shows up in the mount propagation trees. Which is a very important property to have for optimizing umount -l. Which in the worst case allows reduces umount from O(N^2+) to roughly O(N). All of what I am doing should have not effect on an implementation of MNT_TUCKED. That said your code to deal with MNT_TUCKED seems reasonable. Eric > > Its a two pass unmount. First pass marks mounts that can > be unmounted, and second pass does the neccessary unlinks. > It does mark TUCKED mounts, and uses that information > to peel off the correct mounts. Key points are > > a) a tucked mount never entertain any unmount propagation > on its root dentry. > > b) when the child on the root dentry of a tucked mount is > unmounted, the tucked mount is not a tucked mount anymore. > > c) if the child is a tucked mount, than its child is reparented > to the parent. > > > Signed-off-by: "Ram Pai" <linuxram@us.ibm.com> > > fs/namespace.c | 4 ++- > fs/pnode.c | 53 +++++++++++++++++++++++++++++++++++++------------- > fs/pnode.h | 3 ++ > include/linux/mount.h | 1 > > diff --git a/fs/namespace.c b/fs/namespace.c > index cc1375ef..ff3ec90 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2050,8 +2050,10 @@ static int attach_recursive_mnt(struct mount *source_mnt, > hlist_del_init(&child->mnt_hash); > q = __lookup_mnt(&child->mnt_parent->mnt, > child->mnt_mountpoint); > - if (q) > + if (q) { > mnt_change_mountpoint(child, smp, q); > + SET_MNT_TUCKED(child); > + } > commit_tree(child); > } > put_mountpoint(smp); > diff --git a/fs/pnode.c b/fs/pnode.c > index 5bc7896..b44a544 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -448,31 +448,58 @@ static void __propagate_umount(struct mount *mnt) > > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > - struct mount *topper; > - struct mount *child = __lookup_mnt(&m->mnt, > - mnt->mnt_mountpoint); > - /* > - * umount the child only if the child has no children > - * and the child is marked safe to unmount. > + struct mount *topper, *child; > + > + /* Tucked mount must drop umount propagation events on > + * its **root dentry**. > + * The tucked mount did not exist when that child came > + * into existence. It never received that mount propagation. > + * Hence it should never entertain the umount propagation > + * aswell. > */ > + if (IS_MNT_TUCKED(m) && list_is_singular(&mnt->mnt_mounts)) > + continue; > + > + > + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); > + > if (!child || !IS_MNT_MARKED(child)) > continue; > + > CLEAR_MNT_MARK(child); > > - /* If there is exactly one mount covering all of child > - * replace child with that mount. > - */ > - topper = find_topper(child); > - if (topper) > - mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, > - topper); > + if (IS_MNT_TUCKED(child) && > + (list_is_singular(&child->mnt_mounts))) { > + topper = find_topper(child); > + if (topper) { > + mnt_change_mountpoint(child->mnt_parent, > + child->mnt_mp, topper); > + CLEAR_MNT_TUCKED(child); /*lets be precise*/ > + } > + } > > if (list_empty(&child->mnt_mounts)) { > list_del_init(&child->mnt_child); > child->mnt.mnt_flags |= MNT_UMOUNT; > list_move_tail(&child->mnt_list, &mnt->mnt_list); > } > +#if 0 > + else { > + mntput(child); /* mark it for deletion. It will > + be deleted whenever it looses > + all its remaining references. > + TODO: some more thought > + needed, please validate */ > + } > +#endif > } > + > + /* > + * This explicit umount operation is exposing the parent. > + * In case the parent was a 'tucked' mount, it cannot be so > + * anymore. > + */ > + CLEAR_MNT_TUCKED(parent); > } > > /* > diff --git a/fs/pnode.h b/fs/pnode.h > index dc87e65..9ebd1a8 100644 > --- a/fs/pnode.h > +++ b/fs/pnode.h > @@ -18,8 +18,11 @@ > #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE) > #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED) > #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED) > +#define SET_MNT_TUCKED(m) ((m)->mnt.mnt_flags |= MNT_TUCKED) > #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED) > +#define CLEAR_MNT_TUCKED(m) ((m)->mnt.mnt_flags &= ~MNT_TUCKED) > #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED) > +#define IS_MNT_TUCKED(m) ((m)->mnt.mnt_flags & MNT_TUCKED) > > #define CL_EXPIRE 0x01 > #define CL_SLAVE 0x02 > diff --git a/include/linux/mount.h b/include/linux/mount.h > index 8e0352a..41674e7 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -62,6 +62,7 @@ > #define MNT_SYNC_UMOUNT 0x2000000 > #define MNT_MARKED 0x4000000 > #define MNT_UMOUNT 0x8000000 > +#define MNT_TUCKED 0x10000000 > > struct vfsmount { > struct dentry *mnt_root; /* root of the mounted tree */
On Wed, May 24, 2017 at 04:54:34PM -0500, Eric W. Biederman wrote: > Ram Pai <linuxram@us.ibm.com> writes: > > > On Wed, May 17, 2017 at 12:54:34AM -0500, Eric W. Biederman wrote: > >> > >> While investigating some poor umount performance I realized that in > >> the case of overlapping mount trees where some of the mounts are locked > >> the code has been failing to unmount all of the mounts it should > >> have been unmounting. > >> > >> This failure to unmount all of the necessary > >> mounts can be reproduced with: > >> > >> $ cat locked_mounts_test.sh > >> > >> mount -t tmpfs test-base /mnt > >> mount --make-shared /mnt > >> mkdir -p /mnt/b > >> > >> mount -t tmpfs test1 /mnt/b > >> mount --make-shared /mnt/b > >> mkdir -p /mnt/b/10 > >> > >> mount -t tmpfs test2 /mnt/b/10 > >> mount --make-shared /mnt/b/10 > >> mkdir -p /mnt/b/10/20 > >> > >> mount --rbind /mnt/b /mnt/b/10/20 > >> > >> unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi' > >> sleep 1 > >> umount -l /mnt/b > >> wait %% > >> > >> $ unshare -Urm ./locked_mounts_test.sh > >> > >> This failure is corrected by removing the prepass that marks mounts > >> that may be umounted. > >> > >> A first pass is added that umounts mounts if possible and if not sets > >> mount mark if they could be unmounted if they weren't locked and adds > >> them to a list to umount possibilities. This first pass reconsiders > >> the mounts parent if it is on the list of umount possibilities, ensuring > >> that information of umoutability will pass from child to mount parent. > >> > >> A second pass then walks through all mounts that are umounted and processes > >> their children unmounting them or marking them for reparenting. > >> > >> A last pass cleans up the state on the mounts that could not be umounted > >> and if applicable reparents them to their first parent that remained > >> mounted. > >> > >> While a bit longer than the old code this code is much more robust > >> as it allows information to flow up from the leaves and down > >> from the trunk making the order in which mounts are encountered > >> in the umount propgation tree irrelevant. > > > > Eric, > > I think we can accomplish what you want in a much simpler way. > > Would the patch below; UNTESTED BUT COMPILED, resolve your > > issue? > > The reason I came up with an algorithm where the information flows > both directions is that especially in the case of umount -l > but even in some rare cases of a simple umount, the ordering > of the mount propagation tree can result in parent mounts being > visited before the child mounts. > > This case shows in in the case of a mount or a set of mounts > being mounted below itself. > > So no. Irregardless of tucked mount state we can't do this. Ok. I thought I had taken care, regardles of the order in which the mounts were encountered. I need to understand your patch better. Will relook at it later tonight. RP > > I see this also doesn't have the change to move mnt_change_mountpoint > into another pass. That one is quite important from a practical > point of view as that means the way the mount tree changes in umount > is the same irrespective of the number of times a mount shows > up in the mount propagation trees. Which is a very important > property to have for optimizing umount -l. Which in > the worst case allows reduces umount from O(N^2+) to roughly O(N). > > All of what I am doing should have not effect on an implementation of > MNT_TUCKED. > > That said your code to deal with MNT_TUCKED seems reasonable. > > Eric > > > > > Its a two pass unmount. First pass marks mounts that can > > be unmounted, and second pass does the neccessary unlinks. > > It does mark TUCKED mounts, and uses that information > > to peel off the correct mounts. Key points are > > > > a) a tucked mount never entertain any unmount propagation > > on its root dentry. > > > > b) when the child on the root dentry of a tucked mount is > > unmounted, the tucked mount is not a tucked mount anymore. > > > > c) if the child is a tucked mount, than its child is reparented > > to the parent. > > > > > > Signed-off-by: "Ram Pai" <linuxram@us.ibm.com> > > > > fs/namespace.c | 4 ++- > > fs/pnode.c | 53 +++++++++++++++++++++++++++++++++++++------------- > > fs/pnode.h | 3 ++ > > include/linux/mount.h | 1 > > > > diff --git a/fs/namespace.c b/fs/namespace.c > > index cc1375ef..ff3ec90 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -2050,8 +2050,10 @@ static int attach_recursive_mnt(struct mount *source_mnt, > > hlist_del_init(&child->mnt_hash); > > q = __lookup_mnt(&child->mnt_parent->mnt, > > child->mnt_mountpoint); > > - if (q) > > + if (q) { > > mnt_change_mountpoint(child, smp, q); > > + SET_MNT_TUCKED(child); > > + } > > commit_tree(child); > > } > > put_mountpoint(smp); > > diff --git a/fs/pnode.c b/fs/pnode.c > > index 5bc7896..b44a544 100644 > > --- a/fs/pnode.c > > +++ b/fs/pnode.c > > @@ -448,31 +448,58 @@ static void __propagate_umount(struct mount *mnt) > > > > for (m = propagation_next(parent, parent); m; > > m = propagation_next(m, parent)) { > > - struct mount *topper; > > - struct mount *child = __lookup_mnt(&m->mnt, > > - mnt->mnt_mountpoint); > > - /* > > - * umount the child only if the child has no children > > - * and the child is marked safe to unmount. > > + struct mount *topper, *child; > > + > > + /* Tucked mount must drop umount propagation events on > > + * its **root dentry**. > > + * The tucked mount did not exist when that child came > > + * into existence. It never received that mount propagation. > > + * Hence it should never entertain the umount propagation > > + * aswell. > > */ > > + if (IS_MNT_TUCKED(m) && list_is_singular(&mnt->mnt_mounts)) > > + continue; > > + > > + > > + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); > > + > > if (!child || !IS_MNT_MARKED(child)) > > continue; > > + > > CLEAR_MNT_MARK(child); > > > > - /* If there is exactly one mount covering all of child > > - * replace child with that mount. > > - */ > > - topper = find_topper(child); > > - if (topper) > > - mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, > > - topper); > > + if (IS_MNT_TUCKED(child) && > > + (list_is_singular(&child->mnt_mounts))) { > > + topper = find_topper(child); > > + if (topper) { > > + mnt_change_mountpoint(child->mnt_parent, > > + child->mnt_mp, topper); > > + CLEAR_MNT_TUCKED(child); /*lets be precise*/ > > + } > > + } > > > > if (list_empty(&child->mnt_mounts)) { > > list_del_init(&child->mnt_child); > > child->mnt.mnt_flags |= MNT_UMOUNT; > > list_move_tail(&child->mnt_list, &mnt->mnt_list); > > } > > +#if 0 > > + else { > > + mntput(child); /* mark it for deletion. It will > > + be deleted whenever it looses > > + all its remaining references. > > + TODO: some more thought > > + needed, please validate */ > > + } > > +#endif > > } > > + > > + /* > > + * This explicit umount operation is exposing the parent. > > + * In case the parent was a 'tucked' mount, it cannot be so > > + * anymore. > > + */ > > + CLEAR_MNT_TUCKED(parent); > > } > > > > /* > > diff --git a/fs/pnode.h b/fs/pnode.h > > index dc87e65..9ebd1a8 100644 > > --- a/fs/pnode.h > > +++ b/fs/pnode.h > > @@ -18,8 +18,11 @@ > > #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE) > > #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED) > > #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED) > > +#define SET_MNT_TUCKED(m) ((m)->mnt.mnt_flags |= MNT_TUCKED) > > #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED) > > +#define CLEAR_MNT_TUCKED(m) ((m)->mnt.mnt_flags &= ~MNT_TUCKED) > > #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED) > > +#define IS_MNT_TUCKED(m) ((m)->mnt.mnt_flags & MNT_TUCKED) > > > > #define CL_EXPIRE 0x01 > > #define CL_SLAVE 0x02 > > diff --git a/include/linux/mount.h b/include/linux/mount.h > > index 8e0352a..41674e7 100644 > > --- a/include/linux/mount.h > > +++ b/include/linux/mount.h > > @@ -62,6 +62,7 @@ > > #define MNT_SYNC_UMOUNT 0x2000000 > > #define MNT_MARKED 0x4000000 > > #define MNT_UMOUNT 0x8000000 > > +#define MNT_TUCKED 0x10000000 > > > > struct vfsmount { > > struct dentry *mnt_root; /* root of the mounted tree */
diff --git a/fs/namespace.c b/fs/namespace.c index cc1375ef..ff3ec90 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2050,8 +2050,10 @@ static int attach_recursive_mnt(struct mount *source_mnt, hlist_del_init(&child->mnt_hash); q = __lookup_mnt(&child->mnt_parent->mnt, child->mnt_mountpoint); - if (q) + if (q) { mnt_change_mountpoint(child, smp, q); + SET_MNT_TUCKED(child); + } commit_tree(child); } put_mountpoint(smp); diff --git a/fs/pnode.c b/fs/pnode.c index 5bc7896..b44a544 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -448,31 +448,58 @@ static void __propagate_umount(struct mount *mnt) for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { - struct mount *topper; - struct mount *child = __lookup_mnt(&m->mnt, - mnt->mnt_mountpoint); - /* - * umount the child only if the child has no children - * and the child is marked safe to unmount. + struct mount *topper, *child; + + /* Tucked mount must drop umount propagation events on + * its **root dentry**. + * The tucked mount did not exist when that child came + * into existence. It never received that mount propagation. + * Hence it should never entertain the umount propagation + * aswell. */ + if (IS_MNT_TUCKED(m) && list_is_singular(&mnt->mnt_mounts)) + continue; + + + child = __lookup_mnt(&m->mnt, mnt->mnt_mountpoint); + if (!child || !IS_MNT_MARKED(child)) continue; + CLEAR_MNT_MARK(child); - /* If there is exactly one mount covering all of child - * replace child with that mount. - */ - topper = find_topper(child); - if (topper) - mnt_change_mountpoint(child->mnt_parent, child->mnt_mp, - topper); + if (IS_MNT_TUCKED(child) && + (list_is_singular(&child->mnt_mounts))) { + topper = find_topper(child); + if (topper) { + mnt_change_mountpoint(child->mnt_parent, + child->mnt_mp, topper); + CLEAR_MNT_TUCKED(child); /*lets be precise*/ + } + } if (list_empty(&child->mnt_mounts)) { list_del_init(&child->mnt_child); child->mnt.mnt_flags |= MNT_UMOUNT; list_move_tail(&child->mnt_list, &mnt->mnt_list); } +#if 0 + else { + mntput(child); /* mark it for deletion. It will + be deleted whenever it looses + all its remaining references. + TODO: some more thought + needed, please validate */ + } +#endif } + + /* + * This explicit umount operation is exposing the parent. + * In case the parent was a 'tucked' mount, it cannot be so + * anymore. + */ + CLEAR_MNT_TUCKED(parent); } /* diff --git a/fs/pnode.h b/fs/pnode.h index dc87e65..9ebd1a8 100644 --- a/fs/pnode.h +++ b/fs/pnode.h @@ -18,8 +18,11 @@ #define IS_MNT_UNBINDABLE(m) ((m)->mnt.mnt_flags & MNT_UNBINDABLE) #define IS_MNT_MARKED(m) ((m)->mnt.mnt_flags & MNT_MARKED) #define SET_MNT_MARK(m) ((m)->mnt.mnt_flags |= MNT_MARKED) +#define SET_MNT_TUCKED(m) ((m)->mnt.mnt_flags |= MNT_TUCKED) #define CLEAR_MNT_MARK(m) ((m)->mnt.mnt_flags &= ~MNT_MARKED) +#define CLEAR_MNT_TUCKED(m) ((m)->mnt.mnt_flags &= ~MNT_TUCKED) #define IS_MNT_LOCKED(m) ((m)->mnt.mnt_flags & MNT_LOCKED) +#define IS_MNT_TUCKED(m) ((m)->mnt.mnt_flags & MNT_TUCKED) #define CL_EXPIRE 0x01 #define CL_SLAVE 0x02 diff --git a/include/linux/mount.h b/include/linux/mount.h index 8e0352a..41674e7 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -62,6 +62,7 @@ #define MNT_SYNC_UMOUNT 0x2000000 #define MNT_MARKED 0x4000000 #define MNT_UMOUNT 0x8000000 +#define MNT_TUCKED 0x10000000 struct vfsmount { struct dentry *mnt_root; /* root of the mounted tree */