Message ID | 20130605130541.5968d5c2@notabene.brown (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > Hi Bruce, > this is a little issue that seems to keep coming up so I thought it might be > time to fix it. > > As you know, a filesystem that is exported cannot be unmounted as the export > cache holds a reference to it. Though if it hasn't been accessed for a > while then it can. > > As I hadn't realised before sometimes *non* exported filesystems can be > pinned to. A negative entry in the cache can pin a filesystem just as > easily as a positive entry. > An amusing, if somewhat contrived, example is that if you export '/' with > crossmnt and: > > mount localhost:/ /mnt > ls -l / > umount /mnt > > the umount might fail. This is because the "ls -l" tried to export every > filesystem found mounted in '/'. The export of "/mnt" failed of course > because you cannot re-export an NFS filesystem. But it is still in the > cache. > An 'exportfs -f' fixes this, but shouldn't be necessary. > > So this RFC patch makes it possible to register a notifier which gets > called on unmount, and links the export table in to the notifier chain. > > The "atomic" flavour is used so that notifiers can be registered under a > spin_lock. This is needed for "expkey_update" as ->update is called under a > lock. > > As notifier callees cannot unregister themselves, the unregister needs to > happen in a workqueue item, and the unmount will wait for that. > > It seems to work for me (once I figured out all the locking issues and found > a way to make it work without deadlocking). > > If you are OK with in in general I'll make it into a proper patch series and > include Al Viro for the VFS bits. > @@ -1201,6 +1234,11 @@ static int do_umount(struct mount *mnt, int flags) > sb->s_op->umount_begin(sb); > } > > + /* Some in-kernel users (nfsd) might need to be asked to release > + * the filesystem > + */ > + umount_notifier_call(mnt); NAK. I'm sorry, but it's a fundamentally wrong approach - there are _tons_ of places where vfsmount could get evicted (consider shared-subtree umount propagation, for starters), not to mention that notifiers tend to be the wrong answer to just about any question. I'd suggest looking at what kernel/acct.c is doing; I'm absolutely serious about notifiers being unacceptable BS. If you want something generic, consider turning ->mnt_pinned into a list of callbacks, with mntput_no_expire calling them one by one; calling acct_auto_close_mnt() would be replaced with callbacks, each doing single acct_file_reopen(acct, NULL, NULL). I'm about to fall asleep right now, so any further details will have to wait until tomorrow; sorry... -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > Hi Bruce, > > this is a little issue that seems to keep coming up so I thought it might be > > time to fix it. > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > cache holds a reference to it. Though if it hasn't been accessed for a > > while then it can. > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > pinned to. A negative entry in the cache can pin a filesystem just as > > easily as a positive entry. > > An amusing, if somewhat contrived, example is that if you export '/' with > > crossmnt and: > > > > mount localhost:/ /mnt > > ls -l / > > umount /mnt > > > > the umount might fail. This is because the "ls -l" tried to export every > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > because you cannot re-export an NFS filesystem. But it is still in the > > cache. > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > > > So this RFC patch makes it possible to register a notifier which gets > > called on unmount, and links the export table in to the notifier chain. > > > > The "atomic" flavour is used so that notifiers can be registered under a > > spin_lock. This is needed for "expkey_update" as ->update is called under a > > lock. > > > > As notifier callees cannot unregister themselves, the unregister needs to > > happen in a workqueue item, and the unmount will wait for that. > > > > It seems to work for me (once I figured out all the locking issues and found > > a way to make it work without deadlocking). > > > > If you are OK with in in general I'll make it into a proper patch series and > > include Al Viro for the VFS bits. > > > @@ -1201,6 +1234,11 @@ static int do_umount(struct mount *mnt, int flags) > > sb->s_op->umount_begin(sb); > > } > > > > + /* Some in-kernel users (nfsd) might need to be asked to release > > + * the filesystem > > + */ > > + umount_notifier_call(mnt); > > NAK. I'm sorry, but it's a fundamentally wrong approach - there are _tons_ > of places where vfsmount could get evicted (consider shared-subtree umount > propagation, for starters), not to mention that notifiers tend to be > the wrong answer to just about any question. > > I'd suggest looking at what kernel/acct.c is doing; I'm absolutely serious > about notifiers being unacceptable BS. If you want something generic, > consider turning ->mnt_pinned into a list of callbacks, with mntput_no_expire > calling them one by one; calling acct_auto_close_mnt() would be replaced with > callbacks, each doing single acct_file_reopen(acct, NULL, NULL). > > I'm about to fall asleep right now, so any further details will have to wait > until tomorrow; sorry... When tomorrow does come: - Can you say why you don't like notifiers? - mnt_pinned handling happens *after* the unmount has happened. The unmount is effectively always 'lazy' with respect to pinning users. I don't think I want that. If an NFS request is in progress when the unmount is requested, I think it should fail, and with my code it will - the notifier handler will expire the cache entries but they will continue to exist until the last user goes away. For most requests it probably wouldn't matter if they continued on a lazy-unmounted filesystem, but the NFSv4 LOOKUPP (lookup parent) request might get confused. - Your point about shared subtree umount certainly deserved consideration. It is probably time I wrapped my mind around that that really means. Putting the umount_notifier_call() call in do_refcount_check() might almost be the right place, except that it would introduce even more locking problems (unless it is OK to call flush_sheduled_work() under vfsmount_lock). Thanks for the feedback. I'm very open to other ideas. NeilBrown
On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > Hi Bruce, > > > this is a little issue that seems to keep coming up so I thought it might be > > > time to fix it. > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > while then it can. > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > easily as a positive entry. > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > crossmnt and: > > > > > > mount localhost:/ /mnt > > > ls -l / > > > umount /mnt > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > because you cannot re-export an NFS filesystem. But it is still in the > > > cache. > > > An 'exportfs -f' fixes this, but shouldn't be necessary. Yeah, ugh. As a less contrived example, can the default v4 root export lead to arbitrary filesystems being pinned just because a client tried to mount the wrong path? Could the export cache be indexed on a path string instead of a struct path? The problem of negative entries aside, anyone counting on the ability to unexport mounted filesystems on a running nfs server is setting themselves up for trouble: suppose we fix this problem, what if an rpc is in progress, or a lock or v4 open or delegation is held? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 5 Jun 2013 09:36:58 -0400 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > > > Hi Bruce, > > > > this is a little issue that seems to keep coming up so I thought it might be > > > > time to fix it. > > > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > > while then it can. > > > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > > easily as a positive entry. > > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > > crossmnt and: > > > > > > > > mount localhost:/ /mnt > > > > ls -l / > > > > umount /mnt > > > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > > because you cannot re-export an NFS filesystem. But it is still in the > > > > cache. > > > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > Yeah, ugh. As a less contrived example, can the default v4 root export > lead to arbitrary filesystems being pinned just because a client tried > to mount the wrong path? I think it can only pin filesystems that are exported, either explicitly or via a parent being exported with 'crossmnt'. > > Could the export cache be indexed on a path string instead of a struct > path? Yes. It would mean lots of extra pathname lookups and possibly lots of "d_path()" calls. > > The problem of negative entries aside, anyone counting on the ability to > unexport mounted filesystems on a running nfs server is setting ^unmount exported^? > themselves up for trouble: suppose we fix this problem, what if an rpc > is in progress, or a lock or v4 open or delegation is held? All of these should prevent the unmount - and I think people would expect that - you cannot unmount a filesystem that is still being used, and all of these are examples of current use. At the very least, the filesystem must be mounted on some client. In the (few) times this has been raised over the years, the position has been "I'm not using it any more, why cannot I unmount it? fuser/lsof doesn't show any users!". An alternate approach to the problem might be to get rpc.mountd to hold an open file descriptor for every exported filesystem. That would guide people (via lsof/fuser) to stopping nfsd, which would make the filesystem unmountable. However I don't find that approach particularly elegant... NeilBrown
On Thu, Jun 06, 2013 at 10:05:12AM +1000, NeilBrown wrote: > On Wed, 5 Jun 2013 09:36:58 -0400 "J. Bruce Fields" <bfields@fieldses.org> > wrote: > > > On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > > > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > > > > > Hi Bruce, > > > > > this is a little issue that seems to keep coming up so I thought it might be > > > > > time to fix it. > > > > > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > > > while then it can. > > > > > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > > > easily as a positive entry. > > > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > > > crossmnt and: > > > > > > > > > > mount localhost:/ /mnt > > > > > ls -l / > > > > > umount /mnt > > > > > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > > > because you cannot re-export an NFS filesystem. But it is still in the > > > > > cache. > > > > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > > > Yeah, ugh. As a less contrived example, can the default v4 root export > > lead to arbitrary filesystems being pinned just because a client tried > > to mount the wrong path? > > I think it can only pin filesystems that are exported, either explicitly or > via a parent being exported with 'crossmnt'. But see utils/mountd/v4root.c, and: [root@server ~]# exportfs -v /export <world>(rw,...) [root@server ~]# mount /mnt [root@pip4 ~]# mount pip4:/ /mnt/ [root@pip4 ~]# ls -l /mnt/ total 4 drwxrwxrwt 3 root root 4096 Jun 7 10:34 export [root@pip4 ~]# [root@server ~]# umount /mnt/ umount: /mnt: target is busy. ... [root@server ~]# grep /mnt /proc/net/rpc/nfsd.export/content # /mnt *() > > Could the export cache be indexed on a path string instead of a struct > > path? > > Yes. It would mean lots of extra pathname lookups and possibly lots of > "d_path()" calls. Right. Ugh. Still, struct path seems wrong as it's not something gssd knows about, and there's not really a 1-1 mapping between the two (see e.g. the recent complaint about the case where the struct path represents a lazy-unmounted export http://mid.gmane.org/<20130625191008.GA20277@us.ibm.com> ). --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 1 Jul 2013 15:12:38 -0400 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Thu, Jun 06, 2013 at 10:05:12AM +1000, NeilBrown wrote: > > On Wed, 5 Jun 2013 09:36:58 -0400 "J. Bruce Fields" <bfields@fieldses.org> > > wrote: > > > > > On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > > > > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > > > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > > > > > > > Hi Bruce, > > > > > > this is a little issue that seems to keep coming up so I thought it might be > > > > > > time to fix it. > > > > > > > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > > > > while then it can. > > > > > > > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > > > > easily as a positive entry. > > > > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > > > > crossmnt and: > > > > > > > > > > > > mount localhost:/ /mnt > > > > > > ls -l / > > > > > > umount /mnt > > > > > > > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > > > > because you cannot re-export an NFS filesystem. But it is still in the > > > > > > cache. > > > > > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > > > > > Yeah, ugh. As a less contrived example, can the default v4 root export > > > lead to arbitrary filesystems being pinned just because a client tried > > > to mount the wrong path? > > > > I think it can only pin filesystems that are exported, either explicitly or > > via a parent being exported with 'crossmnt'. > > But see utils/mountd/v4root.c, and: > > [root@server ~]# exportfs -v > /export <world>(rw,...) > [root@server ~]# mount /mnt > > [root@pip4 ~]# mount pip4:/ /mnt/ > [root@pip4 ~]# ls -l /mnt/ > total 4 > drwxrwxrwt 3 root root 4096 Jun 7 10:34 export > [root@pip4 ~]# > > [root@server ~]# umount /mnt/ > umount: /mnt: target is busy. > ... > [root@server ~]# grep /mnt /proc/net/rpc/nfsd.export/content > # /mnt *() You make a good point. I didn't think that would happen, and I think I could argue that it is a bug. I have no idea how easy it would be to "fix" without pouring over the code for a while though. Or whether it is worth "fixing". > > > > Could the export cache be indexed on a path string instead of a struct > > > path? > > > > Yes. It would mean lots of extra pathname lookups and possibly lots of > > "d_path()" calls. > > Right. Ugh. Still, struct path seems wrong as it's not something gssd > knows about, and there's not really a 1-1 mapping between the two (see > e.g. the recent complaint about the case where the struct path > represents a lazy-unmounted export > http://mid.gmane.org/<20130625191008.GA20277@us.ibm.com> ). I noticed that but haven't really followed it (though I just checked and there isn't much to follow...) What do we *want* to happen in this case? I would argue that when the filesystem is detached the export should immediately become invalid. We could possibly put a check in exp_find_key() to see if ek->ek_path was still attached (not sure how to do that) and if it is: invalidate the ek before calling cache_check(). If the "is path detach" test is cheap, we should probably do this. Or - we could flush relevant entries from the cache whenever there is a change in the mount table. That is certainly the preferred option if the "is path detached" test is at all expensive. But it seems it isn't completely clear how that flushing should be triggered... NeilBrown
On Tue, Jul 02, 2013 at 08:24:13AM +1000, NeilBrown wrote: > On Mon, 1 Jul 2013 15:12:38 -0400 "J. Bruce Fields" <bfields@fieldses.org> > wrote: > > > On Thu, Jun 06, 2013 at 10:05:12AM +1000, NeilBrown wrote: > > > On Wed, 5 Jun 2013 09:36:58 -0400 "J. Bruce Fields" <bfields@fieldses.org> > > > wrote: > > > > > > > On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > > > > > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > > > > > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > > > > > > > > > Hi Bruce, > > > > > > > this is a little issue that seems to keep coming up so I thought it might be > > > > > > > time to fix it. > > > > > > > > > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > > > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > > > > > while then it can. > > > > > > > > > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > > > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > > > > > easily as a positive entry. > > > > > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > > > > > crossmnt and: > > > > > > > > > > > > > > mount localhost:/ /mnt > > > > > > > ls -l / > > > > > > > umount /mnt > > > > > > > > > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > > > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > > > > > because you cannot re-export an NFS filesystem. But it is still in the > > > > > > > cache. > > > > > > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > > > > > > > Yeah, ugh. As a less contrived example, can the default v4 root export > > > > lead to arbitrary filesystems being pinned just because a client tried > > > > to mount the wrong path? > > > > > > I think it can only pin filesystems that are exported, either explicitly or > > > via a parent being exported with 'crossmnt'. > > > > But see utils/mountd/v4root.c, and: > > > > [root@server ~]# exportfs -v > > /export <world>(rw,...) > > [root@server ~]# mount /mnt > > > > [root@pip4 ~]# mount pip4:/ /mnt/ > > [root@pip4 ~]# ls -l /mnt/ > > total 4 > > drwxrwxrwt 3 root root 4096 Jun 7 10:34 export > > [root@pip4 ~]# > > > > [root@server ~]# umount /mnt/ > > umount: /mnt: target is busy. > > ... > > [root@server ~]# grep /mnt /proc/net/rpc/nfsd.export/content > > # /mnt *() > > You make a good point. I didn't think that would happen, and I think I could > argue that it is a bug. Definitely looks like a bug to me. > I have no idea how easy it would be to "fix" without > pouring over the code for a while though. Or whether it is worth "fixing". As long as clients are mostly just LOOKUPing down to the export they care about people may not hit this too much, but no doubt somebody will hit this eventually. > > > > Could the export cache be indexed on a path string instead of a struct > > > > path? > > > > > > Yes. It would mean lots of extra pathname lookups and possibly lots of > > > "d_path()" calls. > > > > Right. Ugh. Still, struct path seems wrong as it's not something gssd > > knows about, and there's not really a 1-1 mapping between the two (see > > e.g. the recent complaint about the case where the struct path > > represents a lazy-unmounted export > > http://mid.gmane.org/<20130625191008.GA20277@us.ibm.com> ). > > I noticed that but haven't really followed it (though I just checked and > there isn't much to follow...) > > What do we *want* to happen in this case? I would argue that when the > filesystem is detached the export should immediately become invalid. > > We could possibly put a check in exp_find_key() to see if ek->ek_path > was still attached (not sure how to do that) and if it is: invalidate the ek > before calling cache_check(). If the "is path detach" test is cheap, we > should probably do this. > > Or - we could flush relevant entries from the cache whenever there is a > change in the mount table. That is certainly the preferred option if the "is > path detached" test is at all expensive. But it seems it isn't completely > clear how that flushing should be triggered... There are probably other ways to get this kind of hang too: mounting over an export? mount --move? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2 Jul 2013 11:50:59 -0400 "J. Bruce Fields" <bfields@fieldses.org> wrote: > On Tue, Jul 02, 2013 at 08:24:13AM +1000, NeilBrown wrote: > > On Mon, 1 Jul 2013 15:12:38 -0400 "J. Bruce Fields" <bfields@fieldses.org> > > wrote: > > > > > On Thu, Jun 06, 2013 at 10:05:12AM +1000, NeilBrown wrote: > > > > On Wed, 5 Jun 2013 09:36:58 -0400 "J. Bruce Fields" <bfields@fieldses.org> > > > > wrote: > > > > > > > > > On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > > > > > > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > > > > > > > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > > > > > > > > > > > Hi Bruce, > > > > > > > > this is a little issue that seems to keep coming up so I thought it might be > > > > > > > > time to fix it. > > > > > > > > > > > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > > > > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > > > > > > while then it can. > > > > > > > > > > > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > > > > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > > > > > > easily as a positive entry. > > > > > > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > > > > > > crossmnt and: > > > > > > > > > > > > > > > > mount localhost:/ /mnt > > > > > > > > ls -l / > > > > > > > > umount /mnt > > > > > > > > > > > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > > > > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > > > > > > because you cannot re-export an NFS filesystem. But it is still in the > > > > > > > > cache. > > > > > > > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > > > > > > > > > Yeah, ugh. As a less contrived example, can the default v4 root export > > > > > lead to arbitrary filesystems being pinned just because a client tried > > > > > to mount the wrong path? > > > > > > > > I think it can only pin filesystems that are exported, either explicitly or > > > > via a parent being exported with 'crossmnt'. > > > > > > But see utils/mountd/v4root.c, and: > > > > > > [root@server ~]# exportfs -v > > > /export <world>(rw,...) > > > [root@server ~]# mount /mnt > > > > > > [root@pip4 ~]# mount pip4:/ /mnt/ > > > [root@pip4 ~]# ls -l /mnt/ > > > total 4 > > > drwxrwxrwt 3 root root 4096 Jun 7 10:34 export > > > [root@pip4 ~]# > > > > > > [root@server ~]# umount /mnt/ > > > umount: /mnt: target is busy. > > > ... > > > [root@server ~]# grep /mnt /proc/net/rpc/nfsd.export/content > > > # /mnt *() > > > > You make a good point. I didn't think that would happen, and I think I could > > argue that it is a bug. > > Definitely looks like a bug to me. > > > I have no idea how easy it would be to "fix" without > > pouring over the code for a while though. Or whether it is worth "fixing". > > As long as clients are mostly just LOOKUPing down to the export they > care about people may not hit this too much, but no doubt somebody will > hit this eventually. > > > > > > Could the export cache be indexed on a path string instead of a struct > > > > > path? > > > > > > > > Yes. It would mean lots of extra pathname lookups and possibly lots of > > > > "d_path()" calls. > > > > > > Right. Ugh. Still, struct path seems wrong as it's not something gssd > > > knows about, and there's not really a 1-1 mapping between the two (see > > > e.g. the recent complaint about the case where the struct path > > > represents a lazy-unmounted export > > > http://mid.gmane.org/<20130625191008.GA20277@us.ibm.com> ). > > > > I noticed that but haven't really followed it (though I just checked and > > there isn't much to follow...) > > > > What do we *want* to happen in this case? I would argue that when the > > filesystem is detached the export should immediately become invalid. > > > > We could possibly put a check in exp_find_key() to see if ek->ek_path > > was still attached (not sure how to do that) and if it is: invalidate the ek > > before calling cache_check(). If the "is path detach" test is cheap, we > > should probably do this. > > > > Or - we could flush relevant entries from the cache whenever there is a > > change in the mount table. That is certainly the preferred option if the "is > > path detached" test is at all expensive. But it seems it isn't completely > > clear how that flushing should be triggered... > > There are probably other ways to get this kind of hang too: mounting > over an export? mount --move? I could argue that "mount --move" should trigger a flush just like umount does. Mounting over an export (or mounting over a parent of an export) is not so easy. The core problem which causes the hang is that the path requested by svc_export_request cannot be used to refer to the exp->ex_path. This is something that svc_export_request could check. e.g. after "d_path", it could "kern_path(pth, 0, &old_path);" and if that fails or old_path doesn't match ex_path, then somehow invalidate the cache entry.... though that is a bit awkward. We really want to catch the problem at exp_find_key time. Doing it there would be more of a performance burden. Can d_path tell us that the path isn't reachable? I think __d_path can, but that isn't exported (yet). Using that could avoid the "kern_path()" call. So I'm thinking: - in svc_export_request() use __d_path (which needs to be exported) - if that fails, set a new CACHE_STALE flag on the cache_head - cache_read notices CACHE_STALE and calls cache_fresh_unlocked() (I think) to remove it from the queue. - cache_is_valid treats CACHE_STALE like CACHE_NEGATIVE only -ESTALE is returned - if exp_find() gets -ESTALE from exp_get_by_name, it sets CACHE_STALE on 'ek'. It gets a bit complicated doesn't it? An alternative is that if rpc.mountd gets a request for a path that doesn't exist, or that when it responds to a request, the request remains in the channel, then it simply flushes everything. This is a bit heavy handed, but it is a rare occurrence so maybe that doesn't matter. It doesn't solve my desire for umount of exported filesystems to "just work" though. NeilBrown -
On Mon, Jul 08, 2013 at 05:30:03PM +1000, NeilBrown wrote: > On Tue, 2 Jul 2013 11:50:59 -0400 "J. Bruce Fields" <bfields@fieldses.org> > wrote: > > > On Tue, Jul 02, 2013 at 08:24:13AM +1000, NeilBrown wrote: > > > On Mon, 1 Jul 2013 15:12:38 -0400 "J. Bruce Fields" <bfields@fieldses.org> > > > wrote: > > > > > > > On Thu, Jun 06, 2013 at 10:05:12AM +1000, NeilBrown wrote: > > > > > On Wed, 5 Jun 2013 09:36:58 -0400 "J. Bruce Fields" <bfields@fieldses.org> > > > > > wrote: > > > > > > > > > > > On Wed, Jun 05, 2013 at 04:19:34PM +1000, NeilBrown wrote: > > > > > > > On Wed, 5 Jun 2013 04:41:15 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > > > > > > > > > > > > On Wed, Jun 05, 2013 at 01:05:41PM +1000, NeilBrown wrote: > > > > > > > > > > > > > > > > > > Hi Bruce, > > > > > > > > > this is a little issue that seems to keep coming up so I thought it might be > > > > > > > > > time to fix it. > > > > > > > > > > > > > > > > > > As you know, a filesystem that is exported cannot be unmounted as the export > > > > > > > > > cache holds a reference to it. Though if it hasn't been accessed for a > > > > > > > > > while then it can. > > > > > > > > > > > > > > > > > > As I hadn't realised before sometimes *non* exported filesystems can be > > > > > > > > > pinned to. A negative entry in the cache can pin a filesystem just as > > > > > > > > > easily as a positive entry. > > > > > > > > > An amusing, if somewhat contrived, example is that if you export '/' with > > > > > > > > > crossmnt and: > > > > > > > > > > > > > > > > > > mount localhost:/ /mnt > > > > > > > > > ls -l / > > > > > > > > > umount /mnt > > > > > > > > > > > > > > > > > > the umount might fail. This is because the "ls -l" tried to export every > > > > > > > > > filesystem found mounted in '/'. The export of "/mnt" failed of course > > > > > > > > > because you cannot re-export an NFS filesystem. But it is still in the > > > > > > > > > cache. > > > > > > > > > An 'exportfs -f' fixes this, but shouldn't be necessary. > > > > > > > > > > > > Yeah, ugh. As a less contrived example, can the default v4 root export > > > > > > lead to arbitrary filesystems being pinned just because a client tried > > > > > > to mount the wrong path? > > > > > > > > > > I think it can only pin filesystems that are exported, either explicitly or > > > > > via a parent being exported with 'crossmnt'. > > > > > > > > But see utils/mountd/v4root.c, and: > > > > > > > > [root@server ~]# exportfs -v > > > > /export <world>(rw,...) > > > > [root@server ~]# mount /mnt > > > > > > > > [root@pip4 ~]# mount pip4:/ /mnt/ > > > > [root@pip4 ~]# ls -l /mnt/ > > > > total 4 > > > > drwxrwxrwt 3 root root 4096 Jun 7 10:34 export > > > > [root@pip4 ~]# > > > > > > > > [root@server ~]# umount /mnt/ > > > > umount: /mnt: target is busy. > > > > ... > > > > [root@server ~]# grep /mnt /proc/net/rpc/nfsd.export/content > > > > # /mnt *() > > > > > > You make a good point. I didn't think that would happen, and I think I could > > > argue that it is a bug. > > > > Definitely looks like a bug to me. > > > > > I have no idea how easy it would be to "fix" without > > > pouring over the code for a while though. Or whether it is worth "fixing". > > > > As long as clients are mostly just LOOKUPing down to the export they > > care about people may not hit this too much, but no doubt somebody will > > hit this eventually. > > > > > > > > Could the export cache be indexed on a path string instead of a struct > > > > > > path? > > > > > > > > > > Yes. It would mean lots of extra pathname lookups and possibly lots of > > > > > "d_path()" calls. > > > > > > > > Right. Ugh. Still, struct path seems wrong as it's not something gssd > > > > knows about, and there's not really a 1-1 mapping between the two (see > > > > e.g. the recent complaint about the case where the struct path > > > > represents a lazy-unmounted export > > > > http://mid.gmane.org/<20130625191008.GA20277@us.ibm.com> ). > > > > > > I noticed that but haven't really followed it (though I just checked and > > > there isn't much to follow...) > > > > > > What do we *want* to happen in this case? I would argue that when the > > > filesystem is detached the export should immediately become invalid. > > > > > > We could possibly put a check in exp_find_key() to see if ek->ek_path > > > was still attached (not sure how to do that) and if it is: invalidate the ek > > > before calling cache_check(). If the "is path detach" test is cheap, we > > > should probably do this. > > > > > > Or - we could flush relevant entries from the cache whenever there is a > > > change in the mount table. That is certainly the preferred option if the "is > > > path detached" test is at all expensive. But it seems it isn't completely > > > clear how that flushing should be triggered... > > > > There are probably other ways to get this kind of hang too: mounting > > over an export? mount --move? > > I could argue that "mount --move" should trigger a flush just like umount > does. Mounting over an export (or mounting over a parent of an export) is > not so easy. > > The core problem which causes the hang is that the path requested by > svc_export_request cannot be used to refer to the exp->ex_path. > This is something that svc_export_request could check. > e.g. after "d_path", it could "kern_path(pth, 0, &old_path);" and if that > fails or old_path doesn't match ex_path, then somehow invalidate the cache > entry.... though that is a bit awkward. We really want to catch the problem > at exp_find_key time. Doing it there would be more of a performance burden. > > Can d_path tell us that the path isn't reachable? > I think __d_path can, but that isn't exported (yet). Using that could avoid > the "kern_path()" call. > > So I'm thinking: > > - in svc_export_request() use __d_path (which needs to be exported) > - if that fails, set a new CACHE_STALE flag on the cache_head > - cache_read notices CACHE_STALE and calls cache_fresh_unlocked() (I think) > to remove it from the queue. There's still a race if the problem is detected after mountd reads the upcall but before it replies, right? > - cache_is_valid treats CACHE_STALE like CACHE_NEGATIVE only -ESTALE is > returned > - if exp_find() gets -ESTALE from exp_get_by_name, it sets CACHE_STALE on > 'ek'. > > It gets a bit complicated doesn't it? > > An alternative is that if rpc.mountd gets a request for a path that doesn't > exist, or that when it responds to a request, the request remains in the > channel, then it simply flushes everything. > > This is a bit heavy handed, but it is a rare occurrence so maybe that doesn't > matter. That sounds to me like possibly a good enough solution. --b. > It doesn't solve my desire for umount of exported filesystems to "just work" > though. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/mount.h b/fs/mount.h index cd50079..544ea17 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -44,6 +44,9 @@ struct mount { struct hlist_head mnt_fsnotify_marks; __u32 mnt_fsnotify_mask; #endif + /* Notifier chain to call when trying to unmount */ + struct atomic_notifier_head mnt_holders; + int mnt_id; /* mount identifier */ int mnt_group_id; /* peer group identifier */ int mnt_expiry_mark; /* true if marked for expiry */ diff --git a/fs/namespace.c b/fs/namespace.c index 341d3f5..123fcba 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -160,6 +160,37 @@ unsigned int mnt_get_count(struct mount *mnt) #endif } +/* Each mount has a notifier call chain which is called on unmount + * so that in-kernel users can let go. This is particularly used + * by nfsd. + * As a notify callee cannot unregister the notify block directly + * due to recursive locking, and as it must be unregistered before the + * unmount can be allow to complete (as unregistering afterwards is + * impossible), notify callees should arrange for the + * umount_notify_unregister() to happen via a scheduled worker. + * umount_notifier_call will wait for scheduled workers to finish. + * All callees should return NOTIFY_OK so that umount_notifier_call + * knows that at least one was called, and so to run flush_scheduled_work(). + */ +static void umount_notifier_call(struct mount *mnt) +{ + if (atomic_notifier_call_chain(&mnt->mnt_holders, 0, NULL)) + flush_scheduled_work(); +} +int umount_notifier_register(struct vfsmount *v, struct notifier_block *nb) +{ + struct mount *mnt = real_mount(v); + return atomic_notifier_chain_register(&mnt->mnt_holders, nb); +} +EXPORT_SYMBOL_GPL(umount_notifier_register); + +int umount_notifier_unregister(struct vfsmount *v, struct notifier_block *nb) +{ + struct mount *mnt = real_mount(v); + return atomic_notifier_chain_unregister(&mnt->mnt_holders, nb); +} +EXPORT_SYMBOL_GPL(umount_notifier_unregister); + static struct mount *alloc_vfsmnt(const char *name) { struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -198,6 +229,8 @@ static struct mount *alloc_vfsmnt(const char *name) #ifdef CONFIG_FSNOTIFY INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks); #endif + ATOMIC_INIT_NOTIFIER_HEAD(&mnt->mnt_holders); + } return mnt; @@ -1201,6 +1234,11 @@ static int do_umount(struct mount *mnt, int flags) sb->s_op->umount_begin(sb); } + /* Some in-kernel users (nfsd) might need to be asked to release + * the filesystem + */ + umount_notifier_call(mnt); + /* * No sense to grab the lock for this test, but test itself looks * somewhat bogus. Suggestions for better replacement? diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c index 5f38ea3..e4dbd5b 100644 --- a/fs/nfsd/export.c +++ b/fs/nfsd/export.c @@ -46,8 +46,11 @@ static void expkey_put(struct kref *ref) struct svc_expkey *key = container_of(ref, struct svc_expkey, h.ref); if (test_bit(CACHE_VALID, &key->h.flags) && - !test_bit(CACHE_NEGATIVE, &key->h.flags)) + !test_bit(CACHE_NEGATIVE, &key->h.flags)) { + umount_notifier_unregister(key->ek_path.mnt, + &key->ek_umount); path_put(&key->ek_path); + } auth_domain_put(key->ek_client); kfree(key); } @@ -71,6 +74,16 @@ static struct svc_expkey *svc_expkey_update(struct cache_detail *cd, struct svc_ struct svc_expkey *old); static struct svc_expkey *svc_expkey_lookup(struct cache_detail *cd, struct svc_expkey *); +static int purge_expkey(struct notifier_block *nb, + unsigned long mode, void *unused) +{ + struct svc_expkey *ek = container_of(nb, struct svc_expkey, ek_umount); + ek->h.expiry_time = 1; + ek->cd->nextcheck = 1; + queue_sunrpc_cache_flush(); + return NOTIFY_OK; +} + static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen) { /* client fsidtype fsid [path] */ @@ -123,8 +136,9 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen) if (key.h.expiry_time == 0) goto out; - key.ek_client = dom; + key.ek_client = dom; key.ek_fsidtype = fsidtype; + key.cd = cd; memcpy(key.ek_fsid, buf, len); ek = svc_expkey_lookup(cd, &key); @@ -212,6 +226,7 @@ static inline void expkey_init(struct cache_head *cnew, kref_get(&item->ek_client->ref); new->ek_client = item->ek_client; new->ek_fsidtype = item->ek_fsidtype; + new->cd = item->cd; memcpy(new->ek_fsid, item->ek_fsid, sizeof(new->ek_fsid)); } @@ -223,7 +238,10 @@ static inline void expkey_update(struct cache_head *cnew, struct svc_expkey *item = container_of(citem, struct svc_expkey, h); new->ek_path = item->ek_path; - path_get(&item->ek_path); + path_get(&new->ek_path); + new->ek_umount.notifier_call = purge_expkey; + umount_notifier_register(new->ek_path.mnt, + &new->ek_umount); } static struct cache_head *expkey_alloc(void) @@ -307,6 +325,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc) static void svc_export_put(struct kref *ref) { struct svc_export *exp = container_of(ref, struct svc_export, h.ref); + umount_notifier_unregister(exp->ex_path.mnt, &exp->ex_umount); path_put(&exp->ex_path); auth_domain_put(exp->ex_client); nfsd4_fslocs_free(&exp->ex_fslocs); @@ -653,6 +672,16 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b) orig->ex_path.mnt == new->ex_path.mnt; } +static int purge_export(struct notifier_block *nb, + unsigned long mode, void *unused) +{ + struct svc_export *exp = container_of(nb, struct svc_export, ex_umount); + exp->h.expiry_time = 1; + exp->cd->nextcheck = 1; + queue_sunrpc_cache_flush(); + return NOTIFY_OK; +} + static void svc_export_init(struct cache_head *cnew, struct cache_head *citem) { struct svc_export *new = container_of(cnew, struct svc_export, h); @@ -662,6 +691,8 @@ static void svc_export_init(struct cache_head *cnew, struct cache_head *citem) new->ex_client = item->ex_client; new->ex_path.dentry = dget(item->ex_path.dentry); new->ex_path.mnt = mntget(item->ex_path.mnt); + new->ex_umount.notifier_call = purge_export; + umount_notifier_register(new->ex_path.mnt, &new->ex_umount); new->ex_fslocs.locations = NULL; new->ex_fslocs.locations_count = 0; new->ex_fslocs.migrated = 0; @@ -766,6 +797,7 @@ exp_find_key(struct cache_detail *cd, svc_client *clp, int fsid_type, key.ek_client = clp; key.ek_fsidtype = fsid_type; + key.cd = cd; memcpy(key.ek_fsid, fsidv, key_len(fsid_type)); ek = svc_expkey_lookup(cd, &key); diff --git a/include/linux/mount.h b/include/linux/mount.h index 73005f9..1e18926 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -76,6 +76,10 @@ extern struct vfsmount *vfs_kern_mount(struct file_system_type *type, extern void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list); extern void mark_mounts_for_expiry(struct list_head *mounts); +struct notifier_block; +extern int umount_notifier_register(struct vfsmount *v, struct notifier_block *nb); +extern int umount_notifier_unregister(struct vfsmount *v, struct notifier_block *nb); + extern dev_t name_to_dev_t(char *name); #endif /* _LINUX_MOUNT_H */ diff --git a/include/linux/nfsd/export.h b/include/linux/nfsd/export.h index 7898c99..696cf62 100644 --- a/include/linux/nfsd/export.h +++ b/include/linux/nfsd/export.h @@ -49,6 +49,7 @@ struct svc_export { struct auth_domain * ex_client; int ex_flags; struct path ex_path; + struct notifier_block ex_umount; kuid_t ex_anon_uid; kgid_t ex_anon_gid; int ex_fsid; @@ -71,6 +72,8 @@ struct svc_expkey { u32 ek_fsid[6]; struct path ek_path; + struct notifier_block ek_umount; + struct cache_detail *cd; }; #define EX_ISSYNC(exp) (!((exp)->ex_flags & NFSEXP_ASYNC)) diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h index 303399b..ed23c31 100644 --- a/include/linux/sunrpc/cache.h +++ b/include/linux/sunrpc/cache.h @@ -195,6 +195,7 @@ static inline int cache_valid(struct cache_head *h) extern int cache_check(struct cache_detail *detail, struct cache_head *h, struct cache_req *rqstp); extern void cache_flush(void); +extern void queue_sunrpc_cache_flush(void); extern void cache_purge(struct cache_detail *detail); #define NEVER (0x7FFFFFFF) extern void __init cache_initialize(void); diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index 25d58e76..bf7d351 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -471,10 +471,15 @@ static int cache_clean(void) /* * We want to regularly clean the cache, so we need to schedule some work ... */ +static int cache_flush_required = 0; static void do_cache_clean(struct work_struct *work) { int delay = 5; - if (cache_clean() == -1) + + if (cache_flush_required) { + cache_flush_required = 0; + cache_flush(); + } else if (cache_clean() == -1) delay = round_jiffies_relative(30*HZ); if (list_empty(&cache_list)) @@ -508,6 +513,13 @@ void cache_purge(struct cache_detail *detail) } EXPORT_SYMBOL_GPL(cache_purge); +void queue_sunrpc_cache_flush(void) +{ + cache_flush_required = 1; + cancel_delayed_work(&cache_cleaner); + schedule_delayed_work(&cache_cleaner, 0); +} +EXPORT_SYMBOL_GPL(queue_sunrpc_cache_flush); /* * Deferral and Revisiting of Requests.