Message ID | 20130218132509.0ce779de@notabene.brown (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 18 Feb 2013 13:25:09 +1100 NeilBrown <neilb@suse.de> wrote: > On Thu, 14 Feb 2013 10:42:30 -0500 Jeff Layton <jlayton@redhat.com> wrote: > > > On Tue, 12 Feb 2013 11:38:13 +1100 > > NeilBrown <neilb@suse.de> wrote: > > > > > > > > I've been exploring difficulties with unmounting stale directories and > > > discovered another bug. > > > > > > If I: > > > > > > SERVER: mkdir /foo/bar #and make sure it is exported > > > CLIENT: mount -o vers=4 server:/foo/bar /mnt > > > SERVER: rm -r /foo > > > CLIENT: > /mnt/baz # gets an error of course > > > CLIENT: ls -l /mnt # error again > > > CLIENT: umount /mnt > > > > > > The result of that last command is: > > > > > > /mnt was not found in /proc/mounts > > > /mnt was not found in /proc/mounts > > > > > > Strange? > > > > > > cat /proc/mounts > > > > > > ..... > > > 10.0.2.2://foo/bar /mnt\040(deleted) nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,minorversion=0,local_lock=none,addr=10.0.2.2 0 0 > > > .... > > > > > > Notice the "\040(deleted)". > > > > > > NFS has unhashed that directory because it is obviously bad, and d_path() > > > notices and adds " (deleted)". > > > > > > Now I might be able to argue that NFS shouldn't be unhashing a directory that > > > is a mountpoint - it certainly seems strange behaviour. > > > > > > But I think I can more strongly argue that /proc/mounts shouldn't be showing > > > the mounted directory, but instead the directory that it is mounted on. > > > Obviously these both have the same name so it shouldn't matter ... except > > > that here is a case where it does. > > > > > > I "fixed" it with > > > > > > --- a/fs/proc_namespace.c > > > +++ b/fs/proc_namespace.c > > > @@ -93,7 +93,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) > > > { > > > struct mount *r = real_mount(mnt); > > > int err = 0; > > > - struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; > > > + struct path mnt_path = { .dentry = r->mnt_mountpoint, .mnt = &(r->mnt_parent)->mnt }; > > > struct super_block *sb = mnt_path.dentry->d_sb; > > > > > > if (sb->s_op->show_devname) { > > > > > > though I suspect that isn't safe and needs some locking. > > > > > > Probably both should be fixed: NFS should not invalidate any mounted > > > directory, and show_vfsmnt() should report the mointpoint, not the mounted > > > directory. > > > > > > I can't figure out any way to get NFS to not invalidate the mounted directory. > > > I think it happens in nfs_lookup_revalidate() when it calls d_drop(), but I > > > don't know how to tell if a given dentry is a mnt_root for any mountpoint. > > > > > > Suggestions? Thoughts? > > > > > > Thanks, > > > NeilBrown > > > > > > > I've also been looking at some weird ESTALE problems. Here's another > > fun one that doesn't involve mountpoints. Assume here that we're > > working in the same exported directory on client and server: > > > > server# mkdir a > > client# cd a > > server# mv a a.bak > > client# sleep 30 # (or whatever the dir attrcache timeout is) > > client# stat . > > stat: cannot stat ‘.’: Stale NFS file handle > > > > Obviously, "." should not be stale. It got renamed, but the inode still > > exists on the server. > > > > If you sniff on the wire, you'll see that the server doesn't ever send > > an ESTALE here. What happens is that due to FS_REVAL_DOT being set, we > > end up trying to revalidate the dentry that "." refers to. We find that > > the parent changed (obviously) and then try to redo the lookup of "a". > > At that point we notice that it doesn't exist and turn it into ESTALE. > > > > I don't really understand the point of FS_REVAL_DOT. What does that > > actually buy us? I wonder if removing it would also help your testcase? > > > > I think that is a slightly different issue, but certainly related. > I have hit your problem before and have the following patch in SLES. I think > I tried pushing it upstream, but didn't get much in the way of a useful > response. > (patch is space-damaged - don't try to apply with 'patch'). > > BTW I have another problem, related to this one and which could be fixed by > removing FS_REVAL_DOT. > > If you > mount -o vers=4,noac server:/some/path /mnt > then stop nfsd on the server and > umount /mnt > > it hangs. > Partly it hangs because 'mount' tries to do a 'readlink' on the mountpoint. > I can probably get it to not do that (or use --no-canonicalize). > But then sys_umount hangs because it tries to check with the server that the > thing being unmounted really is still a directory... > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very > last step and returns the mounted-on directory, not the mountpoint that is > mounted there - or at least makes sure not revalidate happens on that final > mounted directory. > > > I think FS_REVAL_DOT is needed so that if you call stat("."), it will update > attributes from the server if the cache is old. However it seems to do a > whole lot more than that, including "lookup" calls which it I'm sure is wrong. > > > > Subject: [PATCH] nfs - handle d_revalidate of 'dot' correctly. > > When d_revalidate is called on a dentry because FS_REVAL_DOT is set > it isn't really appropriate to revalidate the name. > > If the path was simply ".", then the current-working-directory could > have been renamed on the server and should still be accessible as "." > even if it has a new name. > > If the path was "/some/long/path/.", then the final component ("path" in > this case) has already been revalidated and there is no particular > need to do it again. > > If we change nd->last_type to refer to "the last component looked at" > rather than just "the last component", then these cases can be > detected by "nd->last_type != LAST_NORM". > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > fs/namei.c | 2 +- > fs/nfs/dir.c | 9 +++++++++ > 2 files changed, 10 insertions(+), 1 deletion(-) > > --- linux-3.0-SLE11-SP2.orig/fs/namei.c > +++ linux-3.0-SLE11-SP2/fs/namei.c > @@ -1460,6 +1460,7 @@ static int link_path_walk(const char *na > } > } > > + nd->last_type = type; > /* remove trailing slashes? */ > if (!c) > goto last_component; > @@ -1486,7 +1487,6 @@ last_component: > /* Clear LOOKUP_CONTINUE iff it was previously unset */ > nd->flags &= lookup_flags | ~LOOKUP_CONTINUE; > nd->last = this; > - nd->last_type = type; > return 0; > } > terminate_walk(nd); > --- linux-3.0-SLE11-SP2.orig/fs/nfs/dir.c > +++ linux-3.0-SLE11-SP2/fs/nfs/dir.c > @@ -1138,6 +1138,15 @@ static int nfs_lookup_revalidate(struct > if (NFS_STALE(inode)) > goto out_bad; > > + if (nd->last_type != LAST_NORM) { > + /* name not relevant, just inode */ > + error = nfs_revalidate_inode(NFS_SERVER(inode), inode); > + if (error) > + goto out_bad; > + else > + goto out_valid; > + } > + > error = -ENOMEM; > fhandle = nfs_alloc_fhandle(); > fattr = nfs_alloc_fattr(); Ahh thanks -- that is the same problem exactly. I'll have to look over your patch and see whether and how it could be applied to current mainline code. I think the umount thing may be the same problem that steved was talking about the other day (V4 unmount causes a GETATTR). I hadn't put the two together, but you're probably right. LOOKUP_MOUNTPOINT is a very interesting idea and might even be reasonable in conjunction with removing FS_REVAL_DOT as it would make the needs of umount more explicit. As far as FS_REVAL_DOT goes though, in the current mainline code, the only place that looks at it is complete_walk(), and it just means that we won't skip doing a d_revalidate on the final component (which will only have not been done if it's a '.', right?). If we remove FS_REVAL_DOT altogether, then we can chop off the bottom half of that function, which has a certain appeal. I guess the question we have to answer is -- if we remove FS_REVAL_DOT, what (if anything) will break?
On Feb 18, 2013, at 7:41 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Mon, 18 Feb 2013 13:25:09 +1100 > NeilBrown <neilb@suse.de> wrote: > >> On Thu, 14 Feb 2013 10:42:30 -0500 Jeff Layton <jlayton@redhat.com> wrote: >> >>> On Tue, 12 Feb 2013 11:38:13 +1100 >>> NeilBrown <neilb@suse.de> wrote: >>> >>>> >>>> I've been exploring difficulties with unmounting stale directories and >>>> discovered another bug. >>>> >>>> If I: >>>> >>>> SERVER: mkdir /foo/bar #and make sure it is exported >>>> CLIENT: mount -o vers=4 server:/foo/bar /mnt >>>> SERVER: rm -r /foo >>>> CLIENT: > /mnt/baz # gets an error of course >>>> CLIENT: ls -l /mnt # error again >>>> CLIENT: umount /mnt >>>> >>>> The result of that last command is: >>>> >>>> /mnt was not found in /proc/mounts >>>> /mnt was not found in /proc/mounts >>>> >>>> Strange? >>>> >>>> cat /proc/mounts >>>> >>>> ..... >>>> 10.0.2.2://foo/bar /mnt\040(deleted) nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,minorversion=0,local_lock=none,addr=10.0.2.2 0 0 >>>> .... >>>> >>>> Notice the "\040(deleted)". >>>> >>>> NFS has unhashed that directory because it is obviously bad, and d_path() >>>> notices and adds " (deleted)". >>>> >>>> Now I might be able to argue that NFS shouldn't be unhashing a directory that >>>> is a mountpoint - it certainly seems strange behaviour. >>>> >>>> But I think I can more strongly argue that /proc/mounts shouldn't be showing >>>> the mounted directory, but instead the directory that it is mounted on. >>>> Obviously these both have the same name so it shouldn't matter ... except >>>> that here is a case where it does. >>>> >>>> I "fixed" it with >>>> >>>> --- a/fs/proc_namespace.c >>>> +++ b/fs/proc_namespace.c >>>> @@ -93,7 +93,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) >>>> { >>>> struct mount *r = real_mount(mnt); >>>> int err = 0; >>>> - struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; >>>> + struct path mnt_path = { .dentry = r->mnt_mountpoint, .mnt = &(r->mnt_parent)->mnt }; >>>> struct super_block *sb = mnt_path.dentry->d_sb; >>>> >>>> if (sb->s_op->show_devname) { >>>> >>>> though I suspect that isn't safe and needs some locking. >>>> >>>> Probably both should be fixed: NFS should not invalidate any mounted >>>> directory, and show_vfsmnt() should report the mointpoint, not the mounted >>>> directory. >>>> >>>> I can't figure out any way to get NFS to not invalidate the mounted directory. >>>> I think it happens in nfs_lookup_revalidate() when it calls d_drop(), but I >>>> don't know how to tell if a given dentry is a mnt_root for any mountpoint. >>>> >>>> Suggestions? Thoughts? >>>> >>>> Thanks, >>>> NeilBrown >>>> >>> >>> I've also been looking at some weird ESTALE problems. Here's another >>> fun one that doesn't involve mountpoints. Assume here that we're >>> working in the same exported directory on client and server: >>> >>> server# mkdir a >>> client# cd a >>> server# mv a a.bak >>> client# sleep 30 # (or whatever the dir attrcache timeout is) >>> client# stat . >>> stat: cannot stat ‘.’: Stale NFS file handle >>> >>> Obviously, "." should not be stale. It got renamed, but the inode still >>> exists on the server. >>> >>> If you sniff on the wire, you'll see that the server doesn't ever send >>> an ESTALE here. What happens is that due to FS_REVAL_DOT being set, we >>> end up trying to revalidate the dentry that "." refers to. We find that >>> the parent changed (obviously) and then try to redo the lookup of "a". >>> At that point we notice that it doesn't exist and turn it into ESTALE. >>> >>> I don't really understand the point of FS_REVAL_DOT. What does that >>> actually buy us? I wonder if removing it would also help your testcase? >>> >> >> I think that is a slightly different issue, but certainly related. >> I have hit your problem before and have the following patch in SLES. I think >> I tried pushing it upstream, but didn't get much in the way of a useful >> response. >> (patch is space-damaged - don't try to apply with 'patch'). >> >> BTW I have another problem, related to this one and which could be fixed by >> removing FS_REVAL_DOT. >> >> If you >> mount -o vers=4,noac server:/some/path /mnt >> then stop nfsd on the server and >> umount /mnt >> >> it hangs. >> Partly it hangs because 'mount' tries to do a 'readlink' on the mountpoint. >> I can probably get it to not do that (or use --no-canonicalize). >> But then sys_umount hangs because it tries to check with the server that the >> thing being unmounted really is still a directory... >> >> I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that >> works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very >> last step and returns the mounted-on directory, not the mountpoint that is >> mounted there - or at least makes sure not revalidate happens on that final >> mounted directory. >> >> >> I think FS_REVAL_DOT is needed so that if you call stat("."), it will update >> attributes from the server if the cache is old. However it seems to do a >> whole lot more than that, including "lookup" calls which it I'm sure is wrong. >> >> >> >> Subject: [PATCH] nfs - handle d_revalidate of 'dot' correctly. >> >> When d_revalidate is called on a dentry because FS_REVAL_DOT is set >> it isn't really appropriate to revalidate the name. >> >> If the path was simply ".", then the current-working-directory could >> have been renamed on the server and should still be accessible as "." >> even if it has a new name. >> >> If the path was "/some/long/path/.", then the final component ("path" in >> this case) has already been revalidated and there is no particular >> need to do it again. >> >> If we change nd->last_type to refer to "the last component looked at" >> rather than just "the last component", then these cases can be >> detected by "nd->last_type != LAST_NORM". >> >> Signed-off-by: NeilBrown <neilb@suse.de> >> >> --- >> fs/namei.c | 2 +- >> fs/nfs/dir.c | 9 +++++++++ >> 2 files changed, 10 insertions(+), 1 deletion(-) >> >> --- linux-3.0-SLE11-SP2.orig/fs/namei.c >> +++ linux-3.0-SLE11-SP2/fs/namei.c >> @@ -1460,6 +1460,7 @@ static int link_path_walk(const char *na >> } >> } >> >> + nd->last_type = type; >> /* remove trailing slashes? */ >> if (!c) >> goto last_component; >> @@ -1486,7 +1487,6 @@ last_component: >> /* Clear LOOKUP_CONTINUE iff it was previously unset */ >> nd->flags &= lookup_flags | ~LOOKUP_CONTINUE; >> nd->last = this; >> - nd->last_type = type; >> return 0; >> } >> terminate_walk(nd); >> --- linux-3.0-SLE11-SP2.orig/fs/nfs/dir.c >> +++ linux-3.0-SLE11-SP2/fs/nfs/dir.c >> @@ -1138,6 +1138,15 @@ static int nfs_lookup_revalidate(struct >> if (NFS_STALE(inode)) >> goto out_bad; >> >> + if (nd->last_type != LAST_NORM) { >> + /* name not relevant, just inode */ >> + error = nfs_revalidate_inode(NFS_SERVER(inode), inode); >> + if (error) >> + goto out_bad; >> + else >> + goto out_valid; >> + } >> + >> error = -ENOMEM; >> fhandle = nfs_alloc_fhandle(); >> fattr = nfs_alloc_fattr(); > > Ahh thanks -- that is the same problem exactly. I'll have to look over > your patch and see whether and how it could be applied to current > mainline code. > > I think the umount thing may be the same problem that steved was > talking about the other day (V4 unmount causes a GETATTR). I hadn't put > the two together, but you're probably right. > > LOOKUP_MOUNTPOINT is a very interesting idea and might even be > reasonable in conjunction with removing FS_REVAL_DOT as it would make > the needs of umount more explicit. > > As far as FS_REVAL_DOT goes though, in the current mainline code, the > only place that looks at it is complete_walk(), and it just means that > we won't skip doing a d_revalidate on the final component (which will > only have not been done if it's a '.', right?). > > If we remove FS_REVAL_DOT altogether, then we can chop off the bottom > half of that function, which has a certain appeal. I guess the question > we have to answer is -- if we remove FS_REVAL_DOT, what (if anything) > will break? Before removing FS_REVAL_DOT, I recommend some archaeology: find out what was fixed by adding that for NFS. I worked on something related years ago and have a vague recollection about this that there is a situation where revalidating dot is important. Unfortunately I don't see it in a quick browse through the commit log for fs/nfs/dir.c. It may be in BitKeeper.
On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote: > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very > last step and returns the mounted-on directory, not the mountpoint that is > mounted there - or at least makes sure not revalidate happens on that final > mounted directory. I don't think LOOKUP_MOUNTPOINT is a good idea. For one thing, we have fairly few places that might want it, all of them in core VFS. Might as well provide a separate function for them, a-la path_lookupat() vs. path_openat(). For another, we need to decide what to do with a really nasty corner case: a/b is a mountpoint, with c/d bound on it. c/d is a symlink to c/e c/e is a mountpoint What should umount("a/b", 0) do? There are two possibilities - removing vfsmount on top of a/b or one on top of c/e... We have the latter semantics; _that_ is what this GETATTR is about. It's a fairly obscure corner case - the question is not even whether to follow symlinks, it's whether to follow _mounts_ on the last component. Hell knows; I'm seriously tempted to change it "don't follow mounts" and see if anyone complains. The only case when behaviour would change would be a symlink mounted somewhere (note that this is _not_ something that can easily happen; e.g. mount --bind does follow symlinks) and umount(2) given the path resolving to the mountpoint of that symlink. > I think FS_REVAL_DOT is needed so that if you call stat("."), it will update > attributes from the server if the cache is old. However it seems to do a > whole lot more than that, including "lookup" calls which it I'm sure is wrong. Far from only that. FS_REVAL_DOT is a misnomer - it's not just ".". Take a look at the places where LOOKUP_JUMPED is set; _that_ is what drives the damn thing. Essentially, LOOKUP_JUMPED is "we hadn't arrived here by lookup in parent"; "." (or "/") obviously qualify, but so does following a procfs-style symlink, or .., for that matter. "foo/.", OTOH, does *not*. It matters only in the end of the pathname, of course. So we don't need to do revalidate every time we step on e.g. .. or cross a mountpoint (let alone when we step on .), as long as we make sure that revalidate isn't missing in the end of path. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote: > > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very > > last step and returns the mounted-on directory, not the mountpoint that is > > mounted there - or at least makes sure not revalidate happens on that final > > mounted directory. > > I don't think LOOKUP_MOUNTPOINT is a good idea. For one thing, we have > fairly few places that might want it, all of them in core VFS. Might as > well provide a separate function for them, a-la path_lookupat() vs. > path_openat(). > > For another, we need to decide what to do with a really nasty corner case: > a/b is a mountpoint, with c/d bound on it. > c/d is a symlink to c/e > c/e is a mountpoint > What should umount("a/b", 0) do? There are two possibilities - removing > vfsmount on top of a/b or one on top of c/e... > > We have the latter semantics; _that_ is what this GETATTR is about. It's > a fairly obscure corner case - the question is not even whether to follow > symlinks, it's whether to follow _mounts_ on the last component. Hell > knows; I'm seriously tempted to change it "don't follow mounts" and see if > anyone complains. The only case when behaviour would change would be > a symlink mounted somewhere (note that this is _not_ something that can easily > happen; e.g. mount --bind does follow symlinks) and umount(2) given the > path resolving to the mountpoint of that symlink. Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea was too simplistic and missed the real point. The real point is that for unmount, we want to resolve the the path without any reference to any filesystem at all - the lookup should be handled entirely by the dcache. Any mountpoint is pinned in the dcache, and consequently any ancestor of any mount point also is. So the dcache will lead us to the dentry that we want. And the dentry is *all* we want. It doesn't really matter what the inode is like, or whether the filesystem thinks that the inode or name still exist. All we need to do is find a dentry that must be in the cache, and detach the mount that is there. Whether that is achieved by a LOOKUP_ flag or a separate lookup function doesn't matter much to me. I suspect we need to allow for passing a symlink to unmount, and the symlink might not be in cache, so we cannot insist uniformly on only using the dcache. However if a name is in the cache, and the cached data suggests that it is a directory, then we should trust that and avoid referring to the filesystem. umount is really quite unique in this. All other times we lookup a path we want to use the thing we found. With umount, we want to stop using it. ??? NeilBrown
On Tue, 2013-02-19 at 10:10 +1100, NeilBrown wrote: > On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote: > > > > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that > > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very > > > last step and returns the mounted-on directory, not the mountpoint that is > > > mounted there - or at least makes sure not revalidate happens on that final > > > mounted directory. > > > > I don't think LOOKUP_MOUNTPOINT is a good idea. For one thing, we have > > fairly few places that might want it, all of them in core VFS. Might as > > well provide a separate function for them, a-la path_lookupat() vs. > > path_openat(). > > > > For another, we need to decide what to do with a really nasty corner case: > > a/b is a mountpoint, with c/d bound on it. > > c/d is a symlink to c/e > > c/e is a mountpoint > > What should umount("a/b", 0) do? There are two possibilities - removing > > vfsmount on top of a/b or one on top of c/e... > > > > We have the latter semantics; _that_ is what this GETATTR is about. It's > > a fairly obscure corner case - the question is not even whether to follow > > symlinks, it's whether to follow _mounts_ on the last component. Hell > > knows; I'm seriously tempted to change it "don't follow mounts" and see if > > anyone complains. The only case when behaviour would change would be > > a symlink mounted somewhere (note that this is _not_ something that can easily > > happen; e.g. mount --bind does follow symlinks) and umount(2) given the > > path resolving to the mountpoint of that symlink. > > Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea > was too simplistic and missed the real point. > > The real point is that for unmount, we want to resolve the the path without > any reference to any filesystem at all - the lookup should be handled > entirely by the dcache. > Any mountpoint is pinned in the dcache, and consequently any ancestor of any > mount point also is. So the dcache will lead us to the dentry that we want. > > And the dentry is *all* we want. It doesn't really matter what the inode is > like, or whether the filesystem thinks that the inode or name still exist. > All we need to do is find a dentry that must be in the cache, and detach the > mount that is there. > > Whether that is achieved by a LOOKUP_ flag or a separate lookup function > doesn't matter much to me. > > I suspect we need to allow for passing a symlink to unmount, and the symlink > might not be in cache, so we cannot insist uniformly on only using the dcache. > However if a name is in the cache, and the cached data suggests that it is a > directory, then we should trust that and avoid referring to the filesystem. > > umount is really quite unique in this. All other times we lookup a path we > want to use the thing we found. With umount, we want to stop using it. > > ??? Add a umountat() syscall so that you can supply a file descriptor? :-)
On Mon, 18 Feb 2013 23:17:42 +0000 "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Tue, 2013-02-19 at 10:10 +1100, NeilBrown wrote: > > On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote: > > > > > > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that > > > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very > > > > last step and returns the mounted-on directory, not the mountpoint that is > > > > mounted there - or at least makes sure not revalidate happens on that final > > > > mounted directory. > > > > > > I don't think LOOKUP_MOUNTPOINT is a good idea. For one thing, we have > > > fairly few places that might want it, all of them in core VFS. Might as > > > well provide a separate function for them, a-la path_lookupat() vs. > > > path_openat(). > > > > > > For another, we need to decide what to do with a really nasty corner case: > > > a/b is a mountpoint, with c/d bound on it. > > > c/d is a symlink to c/e > > > c/e is a mountpoint > > > What should umount("a/b", 0) do? There are two possibilities - removing > > > vfsmount on top of a/b or one on top of c/e... > > > > > > We have the latter semantics; _that_ is what this GETATTR is about. It's > > > a fairly obscure corner case - the question is not even whether to follow > > > symlinks, it's whether to follow _mounts_ on the last component. Hell > > > knows; I'm seriously tempted to change it "don't follow mounts" and see if > > > anyone complains. The only case when behaviour would change would be > > > a symlink mounted somewhere (note that this is _not_ something that can easily > > > happen; e.g. mount --bind does follow symlinks) and umount(2) given the > > > path resolving to the mountpoint of that symlink. > > > > Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea > > was too simplistic and missed the real point. > > > > The real point is that for unmount, we want to resolve the the path without > > any reference to any filesystem at all - the lookup should be handled > > entirely by the dcache. > > Any mountpoint is pinned in the dcache, and consequently any ancestor of any > > mount point also is. So the dcache will lead us to the dentry that we want. > > > > And the dentry is *all* we want. It doesn't really matter what the inode is > > like, or whether the filesystem thinks that the inode or name still exist. > > All we need to do is find a dentry that must be in the cache, and detach the > > mount that is there. > > > > Whether that is achieved by a LOOKUP_ flag or a separate lookup function > > doesn't matter much to me. > > > > I suspect we need to allow for passing a symlink to unmount, and the symlink > > might not be in cache, so we cannot insist uniformly on only using the dcache. > > However if a name is in the cache, and the cached data suggests that it is a > > directory, then we should trust that and avoid referring to the filesystem. > > > > umount is really quite unique in this. All other times we lookup a path we > > want to use the thing we found. With umount, we want to stop using it. > > > ??? > > Add a umountat() syscall so that you can supply a file descriptor? :-) > If I could get that file descriptor by opening some magic file in /proc which led immediately to the mount point, then I'd say "yes please!". Otherwise, I don't think it helps, and so support your ":-)". NeilBrown
On Tue, 19 Feb 2013 10:10:31 +1100 NeilBrown <neilb@suse.de> wrote: > On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote: > > > > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that > > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very > > > last step and returns the mounted-on directory, not the mountpoint that is > > > mounted there - or at least makes sure not revalidate happens on that final > > > mounted directory. > > > > I don't think LOOKUP_MOUNTPOINT is a good idea. For one thing, we have > > fairly few places that might want it, all of them in core VFS. Might as > > well provide a separate function for them, a-la path_lookupat() vs. > > path_openat(). > > > > For another, we need to decide what to do with a really nasty corner case: > > a/b is a mountpoint, with c/d bound on it. > > c/d is a symlink to c/e > > c/e is a mountpoint > > What should umount("a/b", 0) do? There are two possibilities - removing > > vfsmount on top of a/b or one on top of c/e... > > > > We have the latter semantics; _that_ is what this GETATTR is about. It's > > a fairly obscure corner case - the question is not even whether to follow > > symlinks, it's whether to follow _mounts_ on the last component. Hell > > knows; I'm seriously tempted to change it "don't follow mounts" and see if > > anyone complains. The only case when behaviour would change would be > > a symlink mounted somewhere (note that this is _not_ something that can easily > > happen; e.g. mount --bind does follow symlinks) and umount(2) given the > > path resolving to the mountpoint of that symlink. > > Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea > was too simplistic and missed the real point. > > The real point is that for unmount, we want to resolve the the path without > any reference to any filesystem at all - the lookup should be handled > entirely by the dcache. > Any mountpoint is pinned in the dcache, and consequently any ancestor of any > mount point also is. So the dcache will lead us to the dentry that we want. > > And the dentry is *all* we want. It doesn't really matter what the inode is > like, or whether the filesystem thinks that the inode or name still exist. > All we need to do is find a dentry that must be in the cache, and detach the > mount that is there. > > Whether that is achieved by a LOOKUP_ flag or a separate lookup function > doesn't matter much to me. > > I suspect we need to allow for passing a symlink to unmount, and the symlink > might not be in cache, so we cannot insist uniformly on only using the dcache. > However if a name is in the cache, and the cached data suggests that it is a > directory, then we should trust that and avoid referring to the filesystem. > > umount is really quite unique in this. All other times we lookup a path we > want to use the thing we found. With umount, we want to stop using it. > From an IRC conversation with Al yesterday, which may point you in the right direction: 12:49 < viro> jlayton: umount() simply shouldn't do full lookup for path 12:50 < viro> it should get the parent 12:50 < viro> _then_ do pure dcache lookup for the last step ...of course that's still tricky if the last component is a symlink since you'd need to chase it by hand, but that seems like a reasonable way to start.
--- linux-3.0-SLE11-SP2.orig/fs/namei.c +++ linux-3.0-SLE11-SP2/fs/namei.c @@ -1460,6 +1460,7 @@ static int link_path_walk(const char *na } } + nd->last_type = type; /* remove trailing slashes? */ if (!c) goto last_component; @@ -1486,7 +1487,6 @@ last_component: /* Clear LOOKUP_CONTINUE iff it was previously unset */ nd->flags &= lookup_flags | ~LOOKUP_CONTINUE; nd->last = this; - nd->last_type = type; return 0; } terminate_walk(nd); --- linux-3.0-SLE11-SP2.orig/fs/nfs/dir.c +++ linux-3.0-SLE11-SP2/fs/nfs/dir.c @@ -1138,6 +1138,15 @@ static int nfs_lookup_revalidate(struct if (NFS_STALE(inode)) goto out_bad; + if (nd->last_type != LAST_NORM) { + /* name not relevant, just inode */ + error = nfs_revalidate_inode(NFS_SERVER(inode), inode); + if (error) + goto out_bad; + else + goto out_valid; + } + error = -ENOMEM; fhandle = nfs_alloc_fhandle(); fattr = nfs_alloc_fattr();