Message ID | 87inplh8or.fsf_-_@xmission.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: > > When I look at what propagate_mount_busy is trying to do and I look > at the code closely I discover there is a great disconnect between the > two. In the ordinary non-propagation case propagate_mount_busy has > been verifying that there are no submounts and that there are no > extraneous references on the mount. > > For mounts that the unmount would propagate to propagate_mount_busy has > been verifying that there are no extraneous references only if there > are no submounts. Which is nonsense. ... because? > Thefore rework the logic in propgate_mount_busy so that for each > mount it examines it considers that mount busy if that mount has > children or if there are extraneous references to that mount. > > While this check was incorrect we could leak mounts instead of simply > failing umount. What do you mean, leak? We ended up not unmounting them, and they stayed around until umount of whatever they'd been shadowed by/slipped under had exposed them and they got explicitly unmounted. This is not a leak in a sense of "data structure is unreachable and will never be freed", unlike the one your previous version had introduced. Your change might very well be a nicer behaviour - or a DoS in making. But it really deserves more detailed rationale than that and yes, it is a user-visible change. With rather insane userland setups in that area (*cough* systemd *cough* docker), it's _not_ obviously correct. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, Something wrong is in this patch. Pls, take a look at this script: [root@fc24 ~]# unshare -m bash -x xxx.sh + set -x -e -m + mount --make-rprivate / + mount --make-shared / + mount -t tmpfs xxx /mnt + mount --make-private /mnt + mkdir /mnt/yyy + mount -t tmpfs xxx /mnt/yyy + sleep 1 + unshare --propagation unchanged -m sleep 1000 + pid=452 + umount /mnt/yyy + umount /mnt/ umount: /mnt/: target is busy (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1).) + echo FAIL FAIL + kill 452 + wait xxx.sh: line 15: 452 Terminated unshare --propagation unchanged -m sleep 1000 [root@fc24 ~]# cat xxx.sh set -x -e -m mount --make-rprivate / mount --make-shared / mount -t tmpfs xxx /mnt mount --make-private /mnt mkdir /mnt/yyy mount -t tmpfs xxx /mnt/yyy unshare --propagation unchanged -m sleep 1000 & sleep 1 pid=$! umount /mnt/yyy umount /mnt/ || echo FAIL kill $pid wait On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: > > When I look at what propagate_mount_busy is trying to do and I look > at the code closely I discover there is a great disconnect between the > two. In the ordinary non-propagation case propagate_mount_busy has > been verifying that there are no submounts and that there are no > extraneous references on the mount. > > For mounts that the unmount would propagate to propagate_mount_busy has > been verifying that there are no extraneous references only if there > are no submounts. Which is nonsense. > > Thefore rework the logic in propgate_mount_busy so that for each > mount it examines it considers that mount busy if that mount has > children or if there are extraneous references to that mount. > > While this check was incorrect we could leak mounts instead of simply > failing umount. > > Cc: stable@vger.kernel.org > Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > If you don't figure this fix is worth it after all of this time please > let me know. This feels like the proper thing to do, and I don't expect > it will break anyone to fix this. > > fs/pnode.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/fs/pnode.c b/fs/pnode.c > index 06a793f4ae38..12fafa711114 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -344,7 +344,6 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > { > struct mount *m, *child; > struct mount *parent = mnt->mnt_parent; > - int ret = 0; > > if (mnt == parent) > return do_refcount_check(mnt, refcnt); > @@ -360,11 +359,13 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); > - if (child && list_empty(&child->mnt_mounts) && > - (ret = do_refcount_check(child, 1))) > - break; > + if (!child) > + continue; > + if (!list_empty(&child->mnt_mounts) || > + do_refcount_check(child, 1)) > + return 1; > } > - return ret; > + return 0; > } > > /* > -- > 2.10.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 13, 2017 at 12:31:59PM -0800, Andrei Vagin wrote: > Hi Eric, > > Something wrong is in this patch. Pls, take a look at this script: Actually, it works as expected. The reason of this error is what was fixed in this patch. I'm sorry for this noise. Tested-by: Andrei Vagin <avagin@virtuozzo.com> > > [root@fc24 ~]# unshare -m bash -x xxx.sh > + set -x -e -m > + mount --make-rprivate / > + mount --make-shared / > + mount -t tmpfs xxx /mnt > + mount --make-private /mnt > + mkdir /mnt/yyy > + mount -t tmpfs xxx /mnt/yyy > + sleep 1 > + unshare --propagation unchanged -m sleep 1000 > + pid=452 > + umount /mnt/yyy > + umount /mnt/ > umount: /mnt/: target is busy > (In some cases useful info about processes that > use the device is found by lsof(8) or fuser(1).) > + echo FAIL > FAIL > + kill 452 > + wait > xxx.sh: line 15: 452 Terminated unshare --propagation > unchanged -m sleep 1000 > > > [root@fc24 ~]# cat xxx.sh > set -x -e -m > > mount --make-rprivate / > mount --make-shared / > mount -t tmpfs xxx /mnt > mount --make-private /mnt > mkdir /mnt/yyy > mount -t tmpfs xxx /mnt/yyy > unshare --propagation unchanged -m sleep 1000 & > sleep 1 > pid=$! > umount /mnt/yyy > umount /mnt/ || echo FAIL > kill $pid > wait > > > On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: > > > > When I look at what propagate_mount_busy is trying to do and I look > > at the code closely I discover there is a great disconnect between the > > two. In the ordinary non-propagation case propagate_mount_busy has > > been verifying that there are no submounts and that there are no > > extraneous references on the mount. > > > > For mounts that the unmount would propagate to propagate_mount_busy has > > been verifying that there are no extraneous references only if there > > are no submounts. Which is nonsense. > > > > Thefore rework the logic in propgate_mount_busy so that for each > > mount it examines it considers that mount busy if that mount has > > children or if there are extraneous references to that mount. > > > > While this check was incorrect we could leak mounts instead of simply > > failing umount. > > > > Cc: stable@vger.kernel.org > > Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > > --- > > > > If you don't figure this fix is worth it after all of this time please > > let me know. This feels like the proper thing to do, and I don't expect > > it will break anyone to fix this. > > > > fs/pnode.c | 11 ++++++----- > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/fs/pnode.c b/fs/pnode.c > > index 06a793f4ae38..12fafa711114 100644 > > --- a/fs/pnode.c > > +++ b/fs/pnode.c > > @@ -344,7 +344,6 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > > { > > struct mount *m, *child; > > struct mount *parent = mnt->mnt_parent; > > - int ret = 0; > > > > if (mnt == parent) > > return do_refcount_check(mnt, refcnt); > > @@ -360,11 +359,13 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > > for (m = propagation_next(parent, parent); m; > > m = propagation_next(m, parent)) { > > child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); > > - if (child && list_empty(&child->mnt_mounts) && > > - (ret = do_refcount_check(child, 1))) > > - break; > > + if (!child) > > + continue; > > + if (!list_empty(&child->mnt_mounts) || > > + do_refcount_check(child, 1)) > > + return 1; > > } > > - return ret; > > + return 0; > > } > > > > /* > > -- > > 2.10.1 > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: >> >> When I look at what propagate_mount_busy is trying to do and I look >> at the code closely I discover there is a great disconnect between the >> two. In the ordinary non-propagation case propagate_mount_busy has >> been verifying that there are no submounts and that there are no >> extraneous references on the mount. >> >> For mounts that the unmount would propagate to propagate_mount_busy has >> been verifying that there are no extraneous references only if there >> are no submounts. Which is nonsense. > > ... because? > >> Thefore rework the logic in propgate_mount_busy so that for each >> mount it examines it considers that mount busy if that mount has >> children or if there are extraneous references to that mount. >> >> While this check was incorrect we could leak mounts instead of simply >> failing umount. > > What do you mean, leak? We ended up not unmounting them, and they > stayed around until umount of whatever they'd been shadowed by/slipped under > had exposed them and they got explicitly unmounted. Leak in the sense of userspace expecting everything to be cleaned up and it was not. My concerns exist in the presence of a slave mount with something mounted on it. Nothing exotic needs to exist. > This is not a leak in a sense of "data structure is unreachable and > will never be freed", unlike the one your previous version had introduced. > > Your change might very well be a nicer behaviour - or a DoS in making. > But it really deserves more detailed rationale than that and yes, it > is a user-visible change. With rather insane userland setups in that > area (*cough* systemd *cough* docker), it's _not_ obviously correct. I wrote this patch primarily because I looked at what the code was doing and saw semantics that make no obvious sense to me given my experience with how unmount ordinarily works, I needed to point that out so we can have this conversation independently. Having just looked and doubled checked I can say this is how umount is documented to behave in Documentation/filesystems/shared-subtrees. My experience with umount in other contexts is either umount succeeds, umount fails because something is making the mount busy, or a lazy umount is requested and users of the mount are ignored. The way umount propagation works adds another case into my understanding of umount behavior. Namely that the umount will be propagated to some places but not to other places depending on the presence of submounts. For me at least that violates the principle of least surprise. I do not understand if we are going to give up in some cases if the mount is busy but not in other cases why we even bother looking at propgated mounts. Given this behavior has existed for a decade and we have some very creative pieces of userspace code I completely agree that my case for making this change is insufficiently strong. To actually make this change would require extensive testing to verify I don't introduce any regressions in userspace applications. This patch was my way of pointing out the very strange (to my eyes) behaviour of umount propagation ignoring some cases of busy mounts, and asking if that was what you were concerned about in propagate_mount_busy. As my case is insufficient for this change. And my concerns about what you were concerned about with propagate_mount_busy have been addressed I am going to drop this patch. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: > > When I look at what propagate_mount_busy is trying to do and I look > at the code closely I discover there is a great disconnect between the > two. In the ordinary non-propagation case propagate_mount_busy has > been verifying that there are no submounts and that there are no > extraneous references on the mount. > > For mounts that the unmount would propagate to propagate_mount_busy has > been verifying that there are no extraneous references only if there > are no submounts. Which is nonsense. the reason why we had to do it that way was because there were situations where it was impossible to umount anything... take for example. (1) mount --make-shared A (2) mount --bind A A/a The tree looks like this A | B (3) mount --bind A B/a The tree looks like this A | B B' (B' becomes a shadow mount) | C (4) mount --make-slave A At this point B and C are peers and A is a slave. (5) umount B' NOTE: This used to be possible a decade ago if the process doing the umount had access to its dentry. The tree looks like this A | B | C Now if you try to unmount C, it becomes impossible, reason being... B is the parent of C. So the umount propagates to A. But A has B mounted at the same location. But B is busy since it has got a child C. So the entire umount has to fail. There is no way to umount it all. Kind of stuck for ever. That is the reason; in those days a decade ago, we relaxed the rule to let go propagated mounts that had children. The above example is a simplest case that demonstrates the phenomenon. Given that, the current code does not allow any process to reach shadow mount B' and given that we are getting rid of shadow mounts, I think we should allow the code changes you propose in this patch. RP > > Thefore rework the logic in propgate_mount_busy so that for each > mount it examines it considers that mount busy if that mount has > children or if there are extraneous references to that mount. > > While this check was incorrect we could leak mounts instead of simply > failing umount. > > Cc: stable@vger.kernel.org > Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > > If you don't figure this fix is worth it after all of this time please > let me know. This feels like the proper thing to do, and I don't expect > it will break anyone to fix this. > > fs/pnode.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/fs/pnode.c b/fs/pnode.c > index 06a793f4ae38..12fafa711114 100644 > --- a/fs/pnode.c > +++ b/fs/pnode.c > @@ -344,7 +344,6 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > { > struct mount *m, *child; > struct mount *parent = mnt->mnt_parent; > - int ret = 0; > > if (mnt == parent) > return do_refcount_check(mnt, refcnt); > @@ -360,11 +359,13 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) > for (m = propagation_next(parent, parent); m; > m = propagation_next(m, parent)) { > child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); > - if (child && list_empty(&child->mnt_mounts) && > - (ret = do_refcount_check(child, 1))) > - break; > + if (!child) > + continue; > + if (!list_empty(&child->mnt_mounts) || > + do_refcount_check(child, 1)) > + return 1; > } > - return ret; > + return 0; > } > > /* > -- > 2.10.1
Ram Pai <linuxram@us.ibm.com> writes: > On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: >> >> When I look at what propagate_mount_busy is trying to do and I look >> at the code closely I discover there is a great disconnect between the >> two. In the ordinary non-propagation case propagate_mount_busy has >> been verifying that there are no submounts and that there are no >> extraneous references on the mount. >> >> For mounts that the unmount would propagate to propagate_mount_busy has >> been verifying that there are no extraneous references only if there >> are no submounts. Which is nonsense. > > > the reason why we had to do it that way was because there were > situations where it was impossible to umount anything... > > take for example. > > (1) mount --make-shared A > > (2) mount --bind A A/a > > The tree looks like this > > A > | > B > > (3) mount --bind A B/a > The tree looks like this > A > | > B B' (B' becomes a shadow mount) > | > C > > > (4) mount --make-slave A > At this point B and C are peers and A is a slave. > > (5) umount B' > NOTE: This used to be possible a decade ago if the process doing > the umount had access to its dentry. > The tree looks like this > A > | > B > | > C > > Now if you try to unmount C, it becomes impossible, reason being... > > B is the parent of C. > So the umount propagates to A. But A has B mounted at the same > location. But B is busy since it has got a child C. > So the entire umount has to fail. There is no way to umount it all. > Kind of stuck for ever. That is the reason; in those days a decade ago, > we relaxed the rule to let go propagated mounts that had children. > > The above example is a simplest case that demonstrates the phenomenon. > > Given that, the current code does not allow any process to reach shadow > mount B' and given that we are getting rid of shadow mounts, I think we > should allow the code changes you propose in this patch. Thank you very much for the good description of why propagate_mount_busy works the way it does. I just finished taking a hard look at this and in fact the current code does allow reaching B' via umount propagation. My other patch changes exactly how you have to reach it but it is still possible to umount B' At the same time those mounts have alwasy been unmountable with "umount -l" aka MOUNT_DETACH. Have you ever encountered a non-contrived situation that leads to this kind of problem? I expect if we can verify that docker, and systemd are similar pieces of the linux ecosystem are not depending on the exact details of the propagation of the umount busy we should be able to remove this. Last I looked the uses of mount and umount were all quite simple, so I think it is very possible to make this change. Especially as it is now much harder to get into the situation you describe. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 23, 2017 at 09:15:02PM +1300, Eric W. Biederman wrote: > Ram Pai <linuxram@us.ibm.com> writes: > > > On Thu, Jan 12, 2017 at 05:18:12AM +1300, Eric W. Biederman wrote: > >> > >> When I look at what propagate_mount_busy is trying to do and I look > >> at the code closely I discover there is a great disconnect between the > >> two. In the ordinary non-propagation case propagate_mount_busy has > >> been verifying that there are no submounts and that there are no > >> extraneous references on the mount. > >> > >> For mounts that the unmount would propagate to propagate_mount_busy has > >> been verifying that there are no extraneous references only if there > >> are no submounts. Which is nonsense. > > > > > > the reason why we had to do it that way was because there were > > situations where it was impossible to umount anything... > > > > take for example. > > > > (1) mount --make-shared A > > > > (2) mount --bind A A/a > > > > The tree looks like this > > > > A > > | > > B > > > > (3) mount --bind A B/a > > The tree looks like this > > A > > | > > B B' (B' becomes a shadow mount) > > | > > C > > > > > > (4) mount --make-slave A > > At this point B and C are peers and A is a slave. > > > > (5) umount B' > > NOTE: This used to be possible a decade ago if the process doing > > the umount had access to its dentry. > > The tree looks like this > > A > > | > > B > > | > > C > > > > Now if you try to unmount C, it becomes impossible, reason being... > > > > B is the parent of C. > > So the umount propagates to A. But A has B mounted at the same > > location. But B is busy since it has got a child C. > > So the entire umount has to fail. There is no way to umount it all. > > Kind of stuck for ever. That is the reason; in those days a decade ago, > > we relaxed the rule to let go propagated mounts that had children. > > > > The above example is a simplest case that demonstrates the phenomenon. > > > > Given that, the current code does not allow any process to reach shadow > > mount B' and given that we are getting rid of shadow mounts, I think we > > should allow the code changes you propose in this patch. > > Thank you very much for the good description of why propagate_mount_busy > works the way it does. > > I just finished taking a hard look at this and in fact the current code > does allow reaching B' via umount propagation. My other patch changes > exactly how you have to reach it but it is still possible to umount B' > > At the same time those mounts have alwasy been unmountable with > "umount -l" aka MOUNT_DETACH. > > Have you ever encountered a non-contrived situation that leads to this > kind of problem? No. Simple cases dont expose any of these hidden issues. The devil is in the detail. There used to be a test suite; which I believe has been integrated into LTP, that had all kinds of contrived cases to expose as many hidden issues. RP -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/pnode.c b/fs/pnode.c index 06a793f4ae38..12fafa711114 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -344,7 +344,6 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) { struct mount *m, *child; struct mount *parent = mnt->mnt_parent; - int ret = 0; if (mnt == parent) return do_refcount_check(mnt, refcnt); @@ -360,11 +359,13 @@ int propagate_mount_busy(struct mount *mnt, int refcnt) for (m = propagation_next(parent, parent); m; m = propagation_next(m, parent)) { child = __lookup_mnt_last(&m->mnt, mnt->mnt_mountpoint); - if (child && list_empty(&child->mnt_mounts) && - (ret = do_refcount_check(child, 1))) - break; + if (!child) + continue; + if (!list_empty(&child->mnt_mounts) || + do_refcount_check(child, 1)) + return 1; } - return ret; + return 0; } /*
When I look at what propagate_mount_busy is trying to do and I look at the code closely I discover there is a great disconnect between the two. In the ordinary non-propagation case propagate_mount_busy has been verifying that there are no submounts and that there are no extraneous references on the mount. For mounts that the unmount would propagate to propagate_mount_busy has been verifying that there are no extraneous references only if there are no submounts. Which is nonsense. Thefore rework the logic in propgate_mount_busy so that for each mount it examines it considers that mount busy if that mount has children or if there are extraneous references to that mount. While this check was incorrect we could leak mounts instead of simply failing umount. Cc: stable@vger.kernel.org Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> --- If you don't figure this fix is worth it after all of this time please let me know. This feels like the proper thing to do, and I don't expect it will break anyone to fix this. fs/pnode.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)