diff mbox

More fun with unmounting ESTALE directories.

Message ID 20130218132509.0ce779de@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown Feb. 18, 2013, 2:25 a.m. UTC
On Thu, 14 Feb 2013 10:42:30 -0500 Jeff Layton <jlayton@redhat.com> wrote:

> On Tue, 12 Feb 2013 11:38:13 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
> > 
> > I've been exploring difficulties with unmounting stale directories and
> > discovered another bug.
> > 
> > If I:
> > 
> > SERVER:  mkdir /foo/bar  #and make sure it is exported
> > CLIENT:  mount -o vers=4 server:/foo/bar /mnt
> > SERVER:  rm -r /foo
> > CLIENT:  > /mnt/baz # gets an error of course
> > CLIENT:  ls -l /mnt # error again
> > CLIENT:  umount /mnt
> > 
> > The result of that last command is:
> > 
> > /mnt was not found in /proc/mounts
> > /mnt was not found in /proc/mounts
> > 
> > Strange?
> > 
> > cat /proc/mounts
> > 
> > .....
> > 10.0.2.2://foo/bar /mnt\040(deleted) nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,minorversion=0,local_lock=none,addr=10.0.2.2 0 0
> > ....
> > 
> > Notice the "\040(deleted)".
> > 
> > NFS has unhashed that directory because it is obviously bad, and d_path()
> > notices and adds " (deleted)".
> > 
> > Now I might be able to argue that NFS shouldn't be unhashing a directory that
> > is a mountpoint - it certainly seems strange behaviour.
> > 
> > But I think I can more strongly argue that /proc/mounts shouldn't be showing
> > the mounted directory, but instead the directory that it is mounted on.
> > Obviously these both have the same name so it shouldn't matter ... except
> > that here is a case where it does.
> > 
> > I "fixed" it with
> > 
> > --- a/fs/proc_namespace.c
> > +++ b/fs/proc_namespace.c
> > @@ -93,7 +93,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
> >  {
> >  	struct mount *r = real_mount(mnt);
> >  	int err = 0;
> > -	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
> > +	struct path mnt_path = { .dentry = r->mnt_mountpoint, .mnt = &(r->mnt_parent)->mnt };
> >  	struct super_block *sb = mnt_path.dentry->d_sb;
> >  
> >  	if (sb->s_op->show_devname) {
> > 
> > though I suspect that isn't safe and needs some locking.
> > 
> > Probably both should be fixed:  NFS should not invalidate any mounted
> > directory, and show_vfsmnt() should report the mointpoint, not the mounted
> > directory.
> > 
> > I can't figure out any way to get NFS to not invalidate the mounted directory.
> > I think it happens in nfs_lookup_revalidate() when it calls d_drop(), but I
> > don't know how to tell if a given dentry is a mnt_root for any mountpoint.
> > 
> > Suggestions?  Thoughts?
> > 
> > Thanks,
> > NeilBrown
> > 
> 
> I've also been looking at some weird ESTALE problems. Here's another
> fun one that doesn't involve mountpoints. Assume here that we're
> working in the same exported directory on client and server:
> 
>     server# mkdir a
>     client# cd a
>     server# mv a a.bak
>     client# sleep 30  # (or whatever the dir attrcache timeout is)
>     client# stat .
>     stat: cannot stat ‘.’: Stale NFS file handle
> 
> Obviously, "." should not be stale. It got renamed, but the inode still
> exists on the server.
> 
> If you sniff on the wire, you'll see that the server doesn't ever send
> an ESTALE here. What happens is that due to FS_REVAL_DOT being set, we
> end up trying to revalidate the dentry that "." refers to. We find that
> the parent changed (obviously) and then try to redo the lookup of "a".
> At that point we notice that it doesn't exist and turn it into ESTALE.
> 
> I don't really understand the point of FS_REVAL_DOT. What does that
> actually buy us? I wonder if removing it would also help your testcase?
> 

I think that is a slightly different issue, but certainly related. 
I have hit your problem before and have the following patch in SLES.  I think
I tried pushing it upstream, but didn't get much in the way of a useful
response.
(patch is space-damaged - don't try to apply with 'patch').

BTW I have another problem, related to this one and which could be fixed by
removing FS_REVAL_DOT.

If you
   mount -o vers=4,noac server:/some/path /mnt
then stop nfsd on the server and
   umount /mnt

it hangs.
Partly it hangs because  'mount' tries to do a 'readlink' on the mountpoint.
I can probably get it to not do that (or use --no-canonicalize).
But then sys_umount hangs because it tries to check with the server that the
thing being unmounted really is still a directory...

I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
last step and returns the mounted-on directory, not the mountpoint that is
mounted there - or at least makes sure not revalidate happens on that final
mounted directory.


I think FS_REVAL_DOT is needed so that if you call stat("."), it will update
attributes from the server if the cache is old.  However it seems to do a
whole lot more than that, including "lookup" calls which it I'm sure is wrong.

NeilBrown


Subject: [PATCH] nfs - handle d_revalidate of 'dot' correctly.

When d_revalidate is called on a dentry because FS_REVAL_DOT is set
it isn't really appropriate to revalidate the name.

If the path was simply ".", then the current-working-directory could
have been renamed on the server and should still be accessible as "."
even if it has a new name.

If the path was "/some/long/path/.", then the final component ("path" in
this case) has already been revalidated and there is no particular
need to do it again.

If we change nd->last_type to refer to "the last component looked at"
rather than just "the last component", then these cases can be
detected by "nd->last_type != LAST_NORM".

Signed-off-by: NeilBrown <neilb@suse.de>

---
 fs/namei.c   |    2 +-
 fs/nfs/dir.c |    9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Jeff Layton Feb. 18, 2013, 12:41 p.m. UTC | #1
On Mon, 18 Feb 2013 13:25:09 +1100
NeilBrown <neilb@suse.de> wrote:

> On Thu, 14 Feb 2013 10:42:30 -0500 Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Tue, 12 Feb 2013 11:38:13 +1100
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > 
> > > I've been exploring difficulties with unmounting stale directories and
> > > discovered another bug.
> > > 
> > > If I:
> > > 
> > > SERVER:  mkdir /foo/bar  #and make sure it is exported
> > > CLIENT:  mount -o vers=4 server:/foo/bar /mnt
> > > SERVER:  rm -r /foo
> > > CLIENT:  > /mnt/baz # gets an error of course
> > > CLIENT:  ls -l /mnt # error again
> > > CLIENT:  umount /mnt
> > > 
> > > The result of that last command is:
> > > 
> > > /mnt was not found in /proc/mounts
> > > /mnt was not found in /proc/mounts
> > > 
> > > Strange?
> > > 
> > > cat /proc/mounts
> > > 
> > > .....
> > > 10.0.2.2://foo/bar /mnt\040(deleted) nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,minorversion=0,local_lock=none,addr=10.0.2.2 0 0
> > > ....
> > > 
> > > Notice the "\040(deleted)".
> > > 
> > > NFS has unhashed that directory because it is obviously bad, and d_path()
> > > notices and adds " (deleted)".
> > > 
> > > Now I might be able to argue that NFS shouldn't be unhashing a directory that
> > > is a mountpoint - it certainly seems strange behaviour.
> > > 
> > > But I think I can more strongly argue that /proc/mounts shouldn't be showing
> > > the mounted directory, but instead the directory that it is mounted on.
> > > Obviously these both have the same name so it shouldn't matter ... except
> > > that here is a case where it does.
> > > 
> > > I "fixed" it with
> > > 
> > > --- a/fs/proc_namespace.c
> > > +++ b/fs/proc_namespace.c
> > > @@ -93,7 +93,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
> > >  {
> > >  	struct mount *r = real_mount(mnt);
> > >  	int err = 0;
> > > -	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
> > > +	struct path mnt_path = { .dentry = r->mnt_mountpoint, .mnt = &(r->mnt_parent)->mnt };
> > >  	struct super_block *sb = mnt_path.dentry->d_sb;
> > >  
> > >  	if (sb->s_op->show_devname) {
> > > 
> > > though I suspect that isn't safe and needs some locking.
> > > 
> > > Probably both should be fixed:  NFS should not invalidate any mounted
> > > directory, and show_vfsmnt() should report the mointpoint, not the mounted
> > > directory.
> > > 
> > > I can't figure out any way to get NFS to not invalidate the mounted directory.
> > > I think it happens in nfs_lookup_revalidate() when it calls d_drop(), but I
> > > don't know how to tell if a given dentry is a mnt_root for any mountpoint.
> > > 
> > > Suggestions?  Thoughts?
> > > 
> > > Thanks,
> > > NeilBrown
> > > 
> > 
> > I've also been looking at some weird ESTALE problems. Here's another
> > fun one that doesn't involve mountpoints. Assume here that we're
> > working in the same exported directory on client and server:
> > 
> >     server# mkdir a
> >     client# cd a
> >     server# mv a a.bak
> >     client# sleep 30  # (or whatever the dir attrcache timeout is)
> >     client# stat .
> >     stat: cannot stat ‘.’: Stale NFS file handle
> > 
> > Obviously, "." should not be stale. It got renamed, but the inode still
> > exists on the server.
> > 
> > If you sniff on the wire, you'll see that the server doesn't ever send
> > an ESTALE here. What happens is that due to FS_REVAL_DOT being set, we
> > end up trying to revalidate the dentry that "." refers to. We find that
> > the parent changed (obviously) and then try to redo the lookup of "a".
> > At that point we notice that it doesn't exist and turn it into ESTALE.
> > 
> > I don't really understand the point of FS_REVAL_DOT. What does that
> > actually buy us? I wonder if removing it would also help your testcase?
> > 
> 
> I think that is a slightly different issue, but certainly related. 
> I have hit your problem before and have the following patch in SLES.  I think
> I tried pushing it upstream, but didn't get much in the way of a useful
> response.
> (patch is space-damaged - don't try to apply with 'patch').
> 
> BTW I have another problem, related to this one and which could be fixed by
> removing FS_REVAL_DOT.
> 
> If you
>    mount -o vers=4,noac server:/some/path /mnt
> then stop nfsd on the server and
>    umount /mnt
> 
> it hangs.
> Partly it hangs because  'mount' tries to do a 'readlink' on the mountpoint.
> I can probably get it to not do that (or use --no-canonicalize).
> But then sys_umount hangs because it tries to check with the server that the
> thing being unmounted really is still a directory...
> 
> I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
> works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
> last step and returns the mounted-on directory, not the mountpoint that is
> mounted there - or at least makes sure not revalidate happens on that final
> mounted directory.
> 
> 
> I think FS_REVAL_DOT is needed so that if you call stat("."), it will update
> attributes from the server if the cache is old.  However it seems to do a
> whole lot more than that, including "lookup" calls which it I'm sure is wrong.
> 
> 
> 
> Subject: [PATCH] nfs - handle d_revalidate of 'dot' correctly.
> 
> When d_revalidate is called on a dentry because FS_REVAL_DOT is set
> it isn't really appropriate to revalidate the name.
> 
> If the path was simply ".", then the current-working-directory could
> have been renamed on the server and should still be accessible as "."
> even if it has a new name.
> 
> If the path was "/some/long/path/.", then the final component ("path" in
> this case) has already been revalidated and there is no particular
> need to do it again.
> 
> If we change nd->last_type to refer to "the last component looked at"
> rather than just "the last component", then these cases can be
> detected by "nd->last_type != LAST_NORM".
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> 
> ---
>  fs/namei.c   |    2 +-
>  fs/nfs/dir.c |    9 +++++++++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> --- linux-3.0-SLE11-SP2.orig/fs/namei.c
> +++ linux-3.0-SLE11-SP2/fs/namei.c
> @@ -1460,6 +1460,7 @@ static int link_path_walk(const char *na
>                         }
>                 }
>  
> +               nd->last_type = type;
>                 /* remove trailing slashes? */
>                 if (!c)
>                         goto last_component;
> @@ -1486,7 +1487,6 @@ last_component:
>                 /* Clear LOOKUP_CONTINUE iff it was previously unset */
>                 nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
>                 nd->last = this;
> -               nd->last_type = type;
>                 return 0;
>         }
>         terminate_walk(nd);
> --- linux-3.0-SLE11-SP2.orig/fs/nfs/dir.c
> +++ linux-3.0-SLE11-SP2/fs/nfs/dir.c
> @@ -1138,6 +1138,15 @@ static int nfs_lookup_revalidate(struct
>         if (NFS_STALE(inode))
>                 goto out_bad;
>  
> +       if (nd->last_type != LAST_NORM) {
> +               /* name not relevant, just inode */
> +               error = nfs_revalidate_inode(NFS_SERVER(inode), inode);
> +               if (error)
> +                       goto out_bad;
> +               else
> +                       goto out_valid;
> +       }
> +
>         error = -ENOMEM;
>         fhandle = nfs_alloc_fhandle();
>         fattr = nfs_alloc_fattr();

Ahh thanks -- that is the same problem exactly. I'll have to look over
your patch and see whether and how it could be applied to current
mainline code.

I think the umount thing may be the same problem that steved was
talking about the other day (V4 unmount causes a GETATTR). I hadn't put
the two together, but you're probably right.

LOOKUP_MOUNTPOINT is a very interesting idea and might even be
reasonable in conjunction with removing FS_REVAL_DOT as it would make
the needs of umount more explicit.

As far as FS_REVAL_DOT goes though, in the current mainline code, the
only place that looks at it is complete_walk(), and it just means that
we won't skip doing a d_revalidate on the final component (which will
only have not been done if it's a '.', right?).

If we remove FS_REVAL_DOT altogether, then we can chop off the bottom
half of that function, which has a certain appeal. I guess the question
we have to answer is -- if we remove FS_REVAL_DOT, what (if anything)
will break?
Chuck Lever III Feb. 18, 2013, 3:36 p.m. UTC | #2
On Feb 18, 2013, at 7:41 AM, Jeff Layton <jlayton@redhat.com> wrote:

> On Mon, 18 Feb 2013 13:25:09 +1100
> NeilBrown <neilb@suse.de> wrote:
> 
>> On Thu, 14 Feb 2013 10:42:30 -0500 Jeff Layton <jlayton@redhat.com> wrote:
>> 
>>> On Tue, 12 Feb 2013 11:38:13 +1100
>>> NeilBrown <neilb@suse.de> wrote:
>>> 
>>>> 
>>>> I've been exploring difficulties with unmounting stale directories and
>>>> discovered another bug.
>>>> 
>>>> If I:
>>>> 
>>>> SERVER:  mkdir /foo/bar  #and make sure it is exported
>>>> CLIENT:  mount -o vers=4 server:/foo/bar /mnt
>>>> SERVER:  rm -r /foo
>>>> CLIENT:  > /mnt/baz # gets an error of course
>>>> CLIENT:  ls -l /mnt # error again
>>>> CLIENT:  umount /mnt
>>>> 
>>>> The result of that last command is:
>>>> 
>>>> /mnt was not found in /proc/mounts
>>>> /mnt was not found in /proc/mounts
>>>> 
>>>> Strange?
>>>> 
>>>> cat /proc/mounts
>>>> 
>>>> .....
>>>> 10.0.2.2://foo/bar /mnt\040(deleted) nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,minorversion=0,local_lock=none,addr=10.0.2.2 0 0
>>>> ....
>>>> 
>>>> Notice the "\040(deleted)".
>>>> 
>>>> NFS has unhashed that directory because it is obviously bad, and d_path()
>>>> notices and adds " (deleted)".
>>>> 
>>>> Now I might be able to argue that NFS shouldn't be unhashing a directory that
>>>> is a mountpoint - it certainly seems strange behaviour.
>>>> 
>>>> But I think I can more strongly argue that /proc/mounts shouldn't be showing
>>>> the mounted directory, but instead the directory that it is mounted on.
>>>> Obviously these both have the same name so it shouldn't matter ... except
>>>> that here is a case where it does.
>>>> 
>>>> I "fixed" it with
>>>> 
>>>> --- a/fs/proc_namespace.c
>>>> +++ b/fs/proc_namespace.c
>>>> @@ -93,7 +93,7 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt)
>>>> {
>>>> 	struct mount *r = real_mount(mnt);
>>>> 	int err = 0;
>>>> -	struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt };
>>>> +	struct path mnt_path = { .dentry = r->mnt_mountpoint, .mnt = &(r->mnt_parent)->mnt };
>>>> 	struct super_block *sb = mnt_path.dentry->d_sb;
>>>> 
>>>> 	if (sb->s_op->show_devname) {
>>>> 
>>>> though I suspect that isn't safe and needs some locking.
>>>> 
>>>> Probably both should be fixed:  NFS should not invalidate any mounted
>>>> directory, and show_vfsmnt() should report the mointpoint, not the mounted
>>>> directory.
>>>> 
>>>> I can't figure out any way to get NFS to not invalidate the mounted directory.
>>>> I think it happens in nfs_lookup_revalidate() when it calls d_drop(), but I
>>>> don't know how to tell if a given dentry is a mnt_root for any mountpoint.
>>>> 
>>>> Suggestions?  Thoughts?
>>>> 
>>>> Thanks,
>>>> NeilBrown
>>>> 
>>> 
>>> I've also been looking at some weird ESTALE problems. Here's another
>>> fun one that doesn't involve mountpoints. Assume here that we're
>>> working in the same exported directory on client and server:
>>> 
>>>    server# mkdir a
>>>    client# cd a
>>>    server# mv a a.bak
>>>    client# sleep 30  # (or whatever the dir attrcache timeout is)
>>>    client# stat .
>>>    stat: cannot stat ‘.’: Stale NFS file handle
>>> 
>>> Obviously, "." should not be stale. It got renamed, but the inode still
>>> exists on the server.
>>> 
>>> If you sniff on the wire, you'll see that the server doesn't ever send
>>> an ESTALE here. What happens is that due to FS_REVAL_DOT being set, we
>>> end up trying to revalidate the dentry that "." refers to. We find that
>>> the parent changed (obviously) and then try to redo the lookup of "a".
>>> At that point we notice that it doesn't exist and turn it into ESTALE.
>>> 
>>> I don't really understand the point of FS_REVAL_DOT. What does that
>>> actually buy us? I wonder if removing it would also help your testcase?
>>> 
>> 
>> I think that is a slightly different issue, but certainly related. 
>> I have hit your problem before and have the following patch in SLES.  I think
>> I tried pushing it upstream, but didn't get much in the way of a useful
>> response.
>> (patch is space-damaged - don't try to apply with 'patch').
>> 
>> BTW I have another problem, related to this one and which could be fixed by
>> removing FS_REVAL_DOT.
>> 
>> If you
>>   mount -o vers=4,noac server:/some/path /mnt
>> then stop nfsd on the server and
>>   umount /mnt
>> 
>> it hangs.
>> Partly it hangs because  'mount' tries to do a 'readlink' on the mountpoint.
>> I can probably get it to not do that (or use --no-canonicalize).
>> But then sys_umount hangs because it tries to check with the server that the
>> thing being unmounted really is still a directory...
>> 
>> I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
>> works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
>> last step and returns the mounted-on directory, not the mountpoint that is
>> mounted there - or at least makes sure not revalidate happens on that final
>> mounted directory.
>> 
>> 
>> I think FS_REVAL_DOT is needed so that if you call stat("."), it will update
>> attributes from the server if the cache is old.  However it seems to do a
>> whole lot more than that, including "lookup" calls which it I'm sure is wrong.
>> 
>> 
>> 
>> Subject: [PATCH] nfs - handle d_revalidate of 'dot' correctly.
>> 
>> When d_revalidate is called on a dentry because FS_REVAL_DOT is set
>> it isn't really appropriate to revalidate the name.
>> 
>> If the path was simply ".", then the current-working-directory could
>> have been renamed on the server and should still be accessible as "."
>> even if it has a new name.
>> 
>> If the path was "/some/long/path/.", then the final component ("path" in
>> this case) has already been revalidated and there is no particular
>> need to do it again.
>> 
>> If we change nd->last_type to refer to "the last component looked at"
>> rather than just "the last component", then these cases can be
>> detected by "nd->last_type != LAST_NORM".
>> 
>> Signed-off-by: NeilBrown <neilb@suse.de>
>> 
>> ---
>> fs/namei.c   |    2 +-
>> fs/nfs/dir.c |    9 +++++++++
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> --- linux-3.0-SLE11-SP2.orig/fs/namei.c
>> +++ linux-3.0-SLE11-SP2/fs/namei.c
>> @@ -1460,6 +1460,7 @@ static int link_path_walk(const char *na
>>                        }
>>                }
>> 
>> +               nd->last_type = type;
>>                /* remove trailing slashes? */
>>                if (!c)
>>                        goto last_component;
>> @@ -1486,7 +1487,6 @@ last_component:
>>                /* Clear LOOKUP_CONTINUE iff it was previously unset */
>>                nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
>>                nd->last = this;
>> -               nd->last_type = type;
>>                return 0;
>>        }
>>        terminate_walk(nd);
>> --- linux-3.0-SLE11-SP2.orig/fs/nfs/dir.c
>> +++ linux-3.0-SLE11-SP2/fs/nfs/dir.c
>> @@ -1138,6 +1138,15 @@ static int nfs_lookup_revalidate(struct
>>        if (NFS_STALE(inode))
>>                goto out_bad;
>> 
>> +       if (nd->last_type != LAST_NORM) {
>> +               /* name not relevant, just inode */
>> +               error = nfs_revalidate_inode(NFS_SERVER(inode), inode);
>> +               if (error)
>> +                       goto out_bad;
>> +               else
>> +                       goto out_valid;
>> +       }
>> +
>>        error = -ENOMEM;
>>        fhandle = nfs_alloc_fhandle();
>>        fattr = nfs_alloc_fattr();
> 
> Ahh thanks -- that is the same problem exactly. I'll have to look over
> your patch and see whether and how it could be applied to current
> mainline code.
> 
> I think the umount thing may be the same problem that steved was
> talking about the other day (V4 unmount causes a GETATTR). I hadn't put
> the two together, but you're probably right.
> 
> LOOKUP_MOUNTPOINT is a very interesting idea and might even be
> reasonable in conjunction with removing FS_REVAL_DOT as it would make
> the needs of umount more explicit.
> 
> As far as FS_REVAL_DOT goes though, in the current mainline code, the
> only place that looks at it is complete_walk(), and it just means that
> we won't skip doing a d_revalidate on the final component (which will
> only have not been done if it's a '.', right?).
> 
> If we remove FS_REVAL_DOT altogether, then we can chop off the bottom
> half of that function, which has a certain appeal. I guess the question
> we have to answer is -- if we remove FS_REVAL_DOT, what (if anything)
> will break?

Before removing FS_REVAL_DOT, I recommend some archaeology:  find out what was fixed by adding that for NFS.  I worked on something related years ago and have a vague recollection about this that there is a situation where revalidating dot is important.

Unfortunately I don't see it in a quick browse through the commit log for fs/nfs/dir.c.  It may be in BitKeeper.
Al Viro Feb. 18, 2013, 6:46 p.m. UTC | #3
On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote:

> I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
> works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
> last step and returns the mounted-on directory, not the mountpoint that is
> mounted there - or at least makes sure not revalidate happens on that final
> mounted directory.

I don't think LOOKUP_MOUNTPOINT is a good idea.  For one thing, we have
fairly few places that might want it, all of them in core VFS.  Might as
well provide a separate function for them, a-la path_lookupat() vs.
path_openat().

For another, we need to decide what to do with a really nasty corner case:
	a/b is a mountpoint, with c/d bound on it.
	c/d is a symlink to c/e
	c/e is a mountpoint
What should umount("a/b", 0) do?  There are two possibilities - removing
vfsmount on top of a/b or one on top of c/e...

We have the latter semantics; _that_ is what this GETATTR is about.  It's
a fairly obscure corner case - the question is not even whether to follow
symlinks, it's whether to follow _mounts_ on the last component.  Hell
knows; I'm seriously tempted to change it "don't follow mounts" and see if
anyone complains.  The only case when behaviour would change would be
a symlink mounted somewhere (note that this is _not_ something that can easily
happen; e.g. mount --bind does follow symlinks) and umount(2) given the
path resolving to the mountpoint of that symlink.
 
> I think FS_REVAL_DOT is needed so that if you call stat("."), it will update
> attributes from the server if the cache is old.  However it seems to do a
> whole lot more than that, including "lookup" calls which it I'm sure is wrong.

Far from only that.  FS_REVAL_DOT is a misnomer - it's not just ".".  Take
a look at the places where LOOKUP_JUMPED is set; _that_ is what drives the
damn thing.  Essentially, LOOKUP_JUMPED is "we hadn't arrived here by
lookup in parent"; "." (or "/") obviously qualify, but so does following
a procfs-style symlink, or .., for that matter.  "foo/.", OTOH, does *not*.

It matters only in the end of the pathname, of course.  So we don't need to do
revalidate every time we step on e.g. .. or cross a mountpoint (let alone
when we step on .), as long as we make sure that revalidate isn't missing in
the end of path.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown Feb. 18, 2013, 11:10 p.m. UTC | #4
On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote:
> 
> > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
> > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
> > last step and returns the mounted-on directory, not the mountpoint that is
> > mounted there - or at least makes sure not revalidate happens on that final
> > mounted directory.
> 
> I don't think LOOKUP_MOUNTPOINT is a good idea.  For one thing, we have
> fairly few places that might want it, all of them in core VFS.  Might as
> well provide a separate function for them, a-la path_lookupat() vs.
> path_openat().
> 
> For another, we need to decide what to do with a really nasty corner case:
> 	a/b is a mountpoint, with c/d bound on it.
> 	c/d is a symlink to c/e
> 	c/e is a mountpoint
> What should umount("a/b", 0) do?  There are two possibilities - removing
> vfsmount on top of a/b or one on top of c/e...
> 
> We have the latter semantics; _that_ is what this GETATTR is about.  It's
> a fairly obscure corner case - the question is not even whether to follow
> symlinks, it's whether to follow _mounts_ on the last component.  Hell
> knows; I'm seriously tempted to change it "don't follow mounts" and see if
> anyone complains.  The only case when behaviour would change would be
> a symlink mounted somewhere (note that this is _not_ something that can easily
> happen; e.g. mount --bind does follow symlinks) and umount(2) given the
> path resolving to the mountpoint of that symlink.

Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea
was too simplistic and missed the real point.

The real point is that for unmount, we want to resolve the the path without
any reference to any filesystem at all - the lookup should be handled
entirely by the dcache.
Any mountpoint is pinned in the dcache, and consequently any ancestor of any
mount point also is.  So the dcache will lead us to the dentry that we want.

And the dentry is *all* we want.  It doesn't really matter what the inode is
like, or whether the filesystem thinks that the inode or name still exist.
All we need to do is find a dentry that must be  in the cache, and detach the
mount that is there.

Whether that is achieved by a LOOKUP_ flag or a separate lookup function
doesn't matter much to me.

I suspect we need to allow for passing a symlink to unmount, and the symlink
might not be in cache, so we cannot insist uniformly on only using the dcache.
However if a name is in the cache, and the cached data suggests that it is a
directory, then we should trust that and avoid referring to the filesystem.

umount is really quite unique in this.  All other times we lookup a path we
want to use the thing we found.  With umount, we want to stop using it.

???

NeilBrown
Trond Myklebust Feb. 18, 2013, 11:17 p.m. UTC | #5
On Tue, 2013-02-19 at 10:10 +1100, NeilBrown wrote:
> On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote:
> > 
> > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
> > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
> > > last step and returns the mounted-on directory, not the mountpoint that is
> > > mounted there - or at least makes sure not revalidate happens on that final
> > > mounted directory.
> > 
> > I don't think LOOKUP_MOUNTPOINT is a good idea.  For one thing, we have
> > fairly few places that might want it, all of them in core VFS.  Might as
> > well provide a separate function for them, a-la path_lookupat() vs.
> > path_openat().
> > 
> > For another, we need to decide what to do with a really nasty corner case:
> > 	a/b is a mountpoint, with c/d bound on it.
> > 	c/d is a symlink to c/e
> > 	c/e is a mountpoint
> > What should umount("a/b", 0) do?  There are two possibilities - removing
> > vfsmount on top of a/b or one on top of c/e...
> > 
> > We have the latter semantics; _that_ is what this GETATTR is about.  It's
> > a fairly obscure corner case - the question is not even whether to follow
> > symlinks, it's whether to follow _mounts_ on the last component.  Hell
> > knows; I'm seriously tempted to change it "don't follow mounts" and see if
> > anyone complains.  The only case when behaviour would change would be
> > a symlink mounted somewhere (note that this is _not_ something that can easily
> > happen; e.g. mount --bind does follow symlinks) and umount(2) given the
> > path resolving to the mountpoint of that symlink.
> 
> Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea
> was too simplistic and missed the real point.
> 
> The real point is that for unmount, we want to resolve the the path without
> any reference to any filesystem at all - the lookup should be handled
> entirely by the dcache.
> Any mountpoint is pinned in the dcache, and consequently any ancestor of any
> mount point also is.  So the dcache will lead us to the dentry that we want.
> 
> And the dentry is *all* we want.  It doesn't really matter what the inode is
> like, or whether the filesystem thinks that the inode or name still exist.
> All we need to do is find a dentry that must be  in the cache, and detach the
> mount that is there.
> 
> Whether that is achieved by a LOOKUP_ flag or a separate lookup function
> doesn't matter much to me.
> 
> I suspect we need to allow for passing a symlink to unmount, and the symlink
> might not be in cache, so we cannot insist uniformly on only using the dcache.
> However if a name is in the cache, and the cached data suggests that it is a
> directory, then we should trust that and avoid referring to the filesystem.
> 
> umount is really quite unique in this.  All other times we lookup a path we
> want to use the thing we found.  With umount, we want to stop using it.
> 
> ???

Add a umountat() syscall so that you can supply a file descriptor? :-)
NeilBrown Feb. 18, 2013, 11:31 p.m. UTC | #6
On Mon, 18 Feb 2013 23:17:42 +0000 "Myklebust, Trond"
<Trond.Myklebust@netapp.com> wrote:

> On Tue, 2013-02-19 at 10:10 +1100, NeilBrown wrote:
> > On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > 
> > > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote:
> > > 
> > > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
> > > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
> > > > last step and returns the mounted-on directory, not the mountpoint that is
> > > > mounted there - or at least makes sure not revalidate happens on that final
> > > > mounted directory.
> > > 
> > > I don't think LOOKUP_MOUNTPOINT is a good idea.  For one thing, we have
> > > fairly few places that might want it, all of them in core VFS.  Might as
> > > well provide a separate function for them, a-la path_lookupat() vs.
> > > path_openat().
> > > 
> > > For another, we need to decide what to do with a really nasty corner case:
> > > 	a/b is a mountpoint, with c/d bound on it.
> > > 	c/d is a symlink to c/e
> > > 	c/e is a mountpoint
> > > What should umount("a/b", 0) do?  There are two possibilities - removing
> > > vfsmount on top of a/b or one on top of c/e...
> > > 
> > > We have the latter semantics; _that_ is what this GETATTR is about.  It's
> > > a fairly obscure corner case - the question is not even whether to follow
> > > symlinks, it's whether to follow _mounts_ on the last component.  Hell
> > > knows; I'm seriously tempted to change it "don't follow mounts" and see if
> > > anyone complains.  The only case when behaviour would change would be
> > > a symlink mounted somewhere (note that this is _not_ something that can easily
> > > happen; e.g. mount --bind does follow symlinks) and umount(2) given the
> > > path resolving to the mountpoint of that symlink.
> > 
> > Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea
> > was too simplistic and missed the real point.
> > 
> > The real point is that for unmount, we want to resolve the the path without
> > any reference to any filesystem at all - the lookup should be handled
> > entirely by the dcache.
> > Any mountpoint is pinned in the dcache, and consequently any ancestor of any
> > mount point also is.  So the dcache will lead us to the dentry that we want.
> > 
> > And the dentry is *all* we want.  It doesn't really matter what the inode is
> > like, or whether the filesystem thinks that the inode or name still exist.
> > All we need to do is find a dentry that must be  in the cache, and detach the
> > mount that is there.
> > 
> > Whether that is achieved by a LOOKUP_ flag or a separate lookup function
> > doesn't matter much to me.
> > 
> > I suspect we need to allow for passing a symlink to unmount, and the symlink
> > might not be in cache, so we cannot insist uniformly on only using the dcache.
> > However if a name is in the cache, and the cached data suggests that it is a
> > directory, then we should trust that and avoid referring to the filesystem.
> > 
> > umount is really quite unique in this.  All other times we lookup a path we
> > want to use the thing we found.  With umount, we want to stop using it.
>
> > ???
> 
> Add a umountat() syscall so that you can supply a file descriptor? :-)
> 

If I could get that file descriptor by opening some magic file in /proc which
led immediately to the mount point, then I'd say "yes please!".
Otherwise, I don't think it helps, and so support your ":-)".

NeilBrown
Jeff Layton Feb. 19, 2013, 2:27 p.m. UTC | #7
On Tue, 19 Feb 2013 10:10:31 +1100
NeilBrown <neilb@suse.de> wrote:

> On Mon, 18 Feb 2013 18:46:09 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Mon, Feb 18, 2013 at 01:25:09PM +1100, NeilBrown wrote:
> > 
> > > I would be really nice if sys_unmount used a LOOKUP_MOUNTPOINT flag that
> > > works a bit like LOOKUP_PARENT and LOOKUP_NOFOLLOW in that it skips the very
> > > last step and returns the mounted-on directory, not the mountpoint that is
> > > mounted there - or at least makes sure not revalidate happens on that final
> > > mounted directory.
> > 
> > I don't think LOOKUP_MOUNTPOINT is a good idea.  For one thing, we have
> > fairly few places that might want it, all of them in core VFS.  Might as
> > well provide a separate function for them, a-la path_lookupat() vs.
> > path_openat().
> > 
> > For another, we need to decide what to do with a really nasty corner case:
> > 	a/b is a mountpoint, with c/d bound on it.
> > 	c/d is a symlink to c/e
> > 	c/e is a mountpoint
> > What should umount("a/b", 0) do?  There are two possibilities - removing
> > vfsmount on top of a/b or one on top of c/e...
> > 
> > We have the latter semantics; _that_ is what this GETATTR is about.  It's
> > a fairly obscure corner case - the question is not even whether to follow
> > symlinks, it's whether to follow _mounts_ on the last component.  Hell
> > knows; I'm seriously tempted to change it "don't follow mounts" and see if
> > anyone complains.  The only case when behaviour would change would be
> > a symlink mounted somewhere (note that this is _not_ something that can easily
> > happen; e.g. mount --bind does follow symlinks) and umount(2) given the
> > path resolving to the mountpoint of that symlink.
> 
> Thinking about this some more, I now realise that my LOOKUP_MOUNTPOINT idea
> was too simplistic and missed the real point.
> 
> The real point is that for unmount, we want to resolve the the path without
> any reference to any filesystem at all - the lookup should be handled
> entirely by the dcache.
> Any mountpoint is pinned in the dcache, and consequently any ancestor of any
> mount point also is.  So the dcache will lead us to the dentry that we want.
> 
> And the dentry is *all* we want.  It doesn't really matter what the inode is
> like, or whether the filesystem thinks that the inode or name still exist.
> All we need to do is find a dentry that must be  in the cache, and detach the
> mount that is there.
> 
> Whether that is achieved by a LOOKUP_ flag or a separate lookup function
> doesn't matter much to me.
> 
> I suspect we need to allow for passing a symlink to unmount, and the symlink
> might not be in cache, so we cannot insist uniformly on only using the dcache.
> However if a name is in the cache, and the cached data suggests that it is a
> directory, then we should trust that and avoid referring to the filesystem.
> 
> umount is really quite unique in this.  All other times we lookup a path we
> want to use the thing we found.  With umount, we want to stop using it.
> 

From an IRC conversation with Al yesterday, which may point you in the
right direction:

12:49 < viro> jlayton: umount() simply shouldn't do full lookup for path
12:50 < viro> it should get the parent
12:50 < viro> _then_ do pure dcache lookup for the last step

...of course that's still tricky if the last component is a symlink
since you'd need to chase it by hand, but that seems like a reasonable
way to start.
diff mbox

Patch

--- linux-3.0-SLE11-SP2.orig/fs/namei.c
+++ linux-3.0-SLE11-SP2/fs/namei.c
@@ -1460,6 +1460,7 @@  static int link_path_walk(const char *na
                        }
                }
 
+               nd->last_type = type;
                /* remove trailing slashes? */
                if (!c)
                        goto last_component;
@@ -1486,7 +1487,6 @@  last_component:
                /* Clear LOOKUP_CONTINUE iff it was previously unset */
                nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
                nd->last = this;
-               nd->last_type = type;
                return 0;
        }
        terminate_walk(nd);
--- linux-3.0-SLE11-SP2.orig/fs/nfs/dir.c
+++ linux-3.0-SLE11-SP2/fs/nfs/dir.c
@@ -1138,6 +1138,15 @@  static int nfs_lookup_revalidate(struct
        if (NFS_STALE(inode))
                goto out_bad;
 
+       if (nd->last_type != LAST_NORM) {
+               /* name not relevant, just inode */
+               error = nfs_revalidate_inode(NFS_SERVER(inode), inode);
+               if (error)
+                       goto out_bad;
+               else
+                       goto out_valid;
+       }
+
        error = -ENOMEM;
        fhandle = nfs_alloc_fhandle();
        fattr = nfs_alloc_fattr();