Message ID | 5561E9FA.4050808@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote: > --- a/fs/nfsd/export.c > +++ b/fs/nfsd/export.c > @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref) > > if (test_bit(CACHE_VALID, &key->h.flags) && > !test_bit(CACHE_NEGATIVE, &key->h.flags)) > - path_put(&key->ek_path); > + path_put_unpin(&key->ek_path, &key->ek_pin); > auth_domain_put(key->ek_client); > - kfree(key); > + kfree_rcu(key, rcu_head); > } That looks wrong. OK, so you want umount() to proceed; fine, no problem with that. However, what happens if the final mntput() hits while you are just approaching that path_put_unpin()? ->kill() will be triggered, and it would bloody better a) make sure that expkey_put() is called for that key if it hadn't already been done and b) do not return until such expkey_put() completes. Including the ones that might have been already entered by the time we'd got to ->kill(). Am I missing something subtle here? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 05, 2015 at 04:02:13PM +0100, Al Viro wrote: > On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote: > > --- a/fs/nfsd/export.c > > +++ b/fs/nfsd/export.c > > @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref) > > > > if (test_bit(CACHE_VALID, &key->h.flags) && > > !test_bit(CACHE_NEGATIVE, &key->h.flags)) > > - path_put(&key->ek_path); > > + path_put_unpin(&key->ek_path, &key->ek_pin); > > auth_domain_put(key->ek_client); > > - kfree(key); > > + kfree_rcu(key, rcu_head); > > } > > That looks wrong. OK, so you want umount() to proceed; fine, no problem > with that. However, what happens if the final mntput() hits while you > are just approaching that path_put_unpin()? ->kill() will be triggered, > and it would bloody better > a) make sure that expkey_put() is called for that key if it hadn't > already been done and > b) do not return until such expkey_put() completes. Including the > ones that might have been already entered by the time we'd got to ->kill(). > > Am I missing something subtle here? Having looked through that code... It *is* wrong. Note that the normal approach is to have pin_remove() called via pin_kill(), directly or triggered from group_pin_kill() and/or cleanup_mnt() on the mount it's attached to. pin_remove() should never be called outside of ->kill() callbacks. It should be called at the point where you are OK with fs being shut down. The fundamental reason why it's broken is different, though - you *can't* grab a reference if all you've got is a pin. By the time the callback is called, the mount in question is already irretrievably committed to being killed. There's one hell of a wide window between the point of no return and the point where you are notified of anything, and that's by design - you might very well have had several mounts doomed by a syscall and they all get through cleanup_mnt() just before return to userland. One by one. So between the point where this puppy is doomed and the call of your callback there might have been several filesystems going through shutdown, with tons of IO, waiting for remote servers, etc. We could add a primitive that would _try_ to grab a reference - that can be done (lock_mount_hash(), check if it has MNT_DOOMED or MNT_SYNC_UMOUNT, fail if it does, otherwise mnt_add_count(mnt, 1) and succeed, doing unlock_mount_hash() on both exit paths). HOWEVER, you'll need to think very carefully where to use that primitive - unlike mntget() it _can_ fail and lock_mount_hash() can inflict quite a bit of cacheline pingpong if used heavily. Could you give details on lifecycle of those objects, including the stages at which we might try to grab references? Combination of such primitive with a pin (doing just "NULL the references to vfsmount/dentry, do dput() on what that dentry used to be and call pin_remove()") might work, if the lifecycle is good enough. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 6/6/2015 10:21 AM, Al Viro wrote: > On Fri, Jun 05, 2015 at 04:02:13PM +0100, Al Viro wrote: >> On Sun, May 24, 2015 at 11:10:50PM +0800, Kinglong Mee wrote: >>> --- a/fs/nfsd/export.c >>> +++ b/fs/nfsd/export.c >>> @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref) >>> >>> if (test_bit(CACHE_VALID, &key->h.flags) && >>> !test_bit(CACHE_NEGATIVE, &key->h.flags)) >>> - path_put(&key->ek_path); >>> + path_put_unpin(&key->ek_path, &key->ek_pin); >>> auth_domain_put(key->ek_client); >>> - kfree(key); >>> + kfree_rcu(key, rcu_head); >>> } >> >> That looks wrong. OK, so you want umount() to proceed; fine, no problem >> with that. However, what happens if the final mntput() hits while you >> are just approaching that path_put_unpin()? ->kill() will be triggered, >> and it would bloody better >> a) make sure that expkey_put() is called for that key if it hadn't >> already been done and >> b) do not return until such expkey_put() completes. Including the >> ones that might have been already entered by the time we'd got to ->kill(). You are right. Sorry for my fault, the above patch misses caring the race. >> >> Am I missing something subtle here? > > Having looked through that code... It *is* wrong. Note that the normal > approach is to have pin_remove() called via pin_kill(), directly or triggered > from group_pin_kill() and/or cleanup_mnt() on the mount it's attached to. > pin_remove() should never be called outside of ->kill() callbacks. It should > be called at the point where you are OK with fs being shut down. Thank you very much for your comments. I will try to using fs_pin as the restrict. > > The fundamental reason why it's broken is different, though - you *can't* > grab a reference if all you've got is a pin. By the time the callback is > called, the mount in question is already irretrievably committed to being > killed. There's one hell of a wide window between the point of no return > and the point where you are notified of anything, and that's by design - > you might very well have had several mounts doomed by a syscall and they > all get through cleanup_mnt() just before return to userland. One by one. > So between the point where this puppy is doomed and the call of your callback > there might have been several filesystems going through shutdown, with tons > of IO, waiting for remote servers, etc. > > We could add a primitive that would _try_ to grab a reference - that can > be done (lock_mount_hash(), check if it has MNT_DOOMED or MNT_SYNC_UMOUNT, > fail if it does, otherwise mnt_add_count(mnt, 1) and succeed, doing > unlock_mount_hash() on both exit paths). HOWEVER, you'll need to think > very carefully where to use that primitive - unlike mntget() it _can_ > fail and lock_mount_hash() can inflict quite a bit of cacheline pingpong > if used heavily. Do you mean adding a new feature? > > Could you give details on lifecycle of those objects, including the stages > at which we might try to grab references? Combination of such primitive with > a pin (doing just "NULL the references to vfsmount/dentry, do dput() on > what that dentry used to be and call pin_remove()") might work, if the > lifecycle is good enough. NFSD has two caches named expkey and export which are managed by sunrpc cache fundamental. I will only explain export following for expkey is similar as export. struct cache_head { struct kref ref; ... ... }; struct svc_export { struct cache_head h; struct path ex_path; ... ... }; 1. svc_export has a reference, will be freed when the reference is decreased to zero. 2. ex_path must be put when freed (Want change mntget to fs_pin for ex_path's vfsmnt). 3. With fs_pin, there are two logic (one is the normal logic, the other is pin_kill) which can cause free svc_export. 4. The reference of the normal logic is zero, but the pin_kill logic is not zero. the second logic will decrease the reference indirectly, if decrease to zero, umount will go though the normal logic's code, at last frees the svc_export; if not zero, umount must don't free the svc_export. I try to solve the window as, struct svc_export { struct cache_head h; struct path ex_path; ... ... struct fs_pin ex_pin; struct rcu_head rcu_head; /* For cache_put and fs umounting window */ struct completion ex_done; struct work_struct ex_work; }; 1. ex_done is for umount waiting the reference is decreased to zero. 2. ex_work is for umount decrease the reference indirectly. 3. The normal logic don't free the svc_export, calls complete() and go though pin_kill() logic as, (svc_export_put will be called when reference is decreased to zero) static void svc_export_put(struct kref *ref) { struct svc_export *exp = container_of(ref, struct svc_export, h.ref); rcu_read_lock(); complete(&exp->ex_done); pin_kill(&exp->ex_pin); } 4. pin_kill() logic will schedules to decrease the reference though ex_work, and at last path_put_unpin and destroy the svc_export. static void export_pin_kill(struct fs_pin *pin) { struct svc_export *exp = container_of(pin, struct svc_export, ex_pin); if (!completion_done(&exp->ex_done)) { schedule_work(&exp->ex_work); wait_for_completion(&exp->ex_done); } path_put_unpin(&exp->ex_path, &exp->ex_pin); svc_export_destroy(exp); } The full patches will be sent later. Thanks again. thanks, Kinglong Mee -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c index f79521a..cc34b0b 100644 --- a/fs/nfsd/export.c +++ b/fs/nfsd/export.c @@ -43,9 +43,9 @@ static void expkey_put(struct kref *ref) if (test_bit(CACHE_VALID, &key->h.flags) && !test_bit(CACHE_NEGATIVE, &key->h.flags)) - path_put(&key->ek_path); + path_put_unpin(&key->ek_path, &key->ek_pin); auth_domain_put(key->ek_client); - kfree(key); + kfree_rcu(key, rcu_head); } static void expkey_request(struct cache_detail *cd, @@ -83,7 +83,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen) return -EINVAL; mesg[mlen-1] = 0; - buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + buf = kzalloc(PAGE_SIZE, GFP_KERNEL); err = -ENOMEM; if (!buf) goto out; @@ -120,6 +120,7 @@ static int expkey_parse(struct cache_detail *cd, char *mesg, int mlen) goto out; key.ek_client = dom; + key.cd = cd; key.ek_fsidtype = fsidtype; memcpy(key.ek_fsid, buf, len); @@ -210,6 +211,13 @@ static inline void expkey_init(struct cache_head *cnew, new->ek_fsidtype = item->ek_fsidtype; memcpy(new->ek_fsid, item->ek_fsid, sizeof(new->ek_fsid)); + new->cd = item->cd; +} + +static void expkey_pin_kill(struct fs_pin *pin) +{ + struct svc_expkey *key = container_of(pin, struct svc_expkey, ek_pin); + cache_force_expire(key->cd, &key->h); } static inline void expkey_update(struct cache_head *cnew, @@ -218,13 +226,14 @@ static inline void expkey_update(struct cache_head *cnew, struct svc_expkey *new = container_of(cnew, struct svc_expkey, h); struct svc_expkey *item = container_of(citem, struct svc_expkey, h); + init_fs_pin(&new->ek_pin, expkey_pin_kill); new->ek_path = item->ek_path; - path_get(&item->ek_path); + path_get_pin(&new->ek_path, &new->ek_pin); } static struct cache_head *expkey_alloc(void) { - struct svc_expkey *i = kmalloc(sizeof(*i), GFP_KERNEL); + struct svc_expkey *i = kzalloc(sizeof(*i), GFP_KERNEL); if (i) return &i->h; else @@ -309,11 +318,16 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc) static void svc_export_put(struct kref *ref) { struct svc_export *exp = container_of(ref, struct svc_export, h.ref); - path_put(&exp->ex_path); + + if (EX_ALLOW_UMOUNT(exp)) + path_put_unpin(&exp->ex_path, &exp->ex_pin); + else + path_put(&exp->ex_path); + auth_domain_put(exp->ex_client); nfsd4_fslocs_free(&exp->ex_fslocs); kfree(exp->ex_uuid); - kfree(exp); + kfree_rcu(exp, rcu_head); } static void svc_export_request(struct cache_detail *cd, @@ -520,7 +534,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen) return -EINVAL; mesg[mlen-1] = 0; - buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + buf = kzalloc(PAGE_SIZE, GFP_KERNEL); if (!buf) return -ENOMEM; @@ -694,15 +708,23 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b) path_equal(&orig->ex_path, &new->ex_path); } +static void export_pin_kill(struct fs_pin *pin) +{ + struct svc_export *exp = container_of(pin, struct svc_export, ex_pin); + cache_force_expire(exp->cd, &exp->h); +} + static void svc_export_init(struct cache_head *cnew, struct cache_head *citem) { struct svc_export *new = container_of(cnew, struct svc_export, h); struct svc_export *item = container_of(citem, struct svc_export, h); + init_fs_pin(&new->ex_pin, export_pin_kill); kref_get(&item->ex_client->ref); + new->ex_flags = NFSEXP_ALLOW_UMOUNT; new->ex_client = item->ex_client; new->ex_path = item->ex_path; - path_get(&item->ex_path); + path_get_pin(&new->ex_path, &new->ex_pin); new->ex_fslocs.locations = NULL; new->ex_fslocs.locations_count = 0; new->ex_fslocs.migrated = 0; @@ -717,6 +739,14 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem) struct svc_export *item = container_of(citem, struct svc_export, h); int i; + if (!EX_ALLOW_UMOUNT(item)) { + path_get(&new->ex_path); + if (EX_ALLOW_UMOUNT(new)) + path_put_unpin(&new->ex_path, &new->ex_pin); + else + path_put(&new->ex_path); + } + new->ex_flags = item->ex_flags; new->ex_anon_uid = item->ex_anon_uid; new->ex_anon_gid = item->ex_anon_gid; @@ -740,7 +770,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem) static struct cache_head *svc_export_alloc(void) { - struct svc_export *i = kmalloc(sizeof(*i), GFP_KERNEL); + struct svc_export *i = kzalloc(sizeof(*i), GFP_KERNEL); if (i) return &i->h; else @@ -811,6 +841,7 @@ exp_find_key(struct cache_detail *cd, struct auth_domain *clp, int fsid_type, key.ek_client = clp; key.ek_fsidtype = fsid_type; + key.cd = cd; memcpy(key.ek_fsid, fsidv, key_len(fsid_type)); ek = svc_expkey_lookup(cd, &key); @@ -1159,6 +1190,7 @@ static struct flags { { NFSEXP_NOAUTHNLM, {"insecure_locks", ""}}, { NFSEXP_V4ROOT, {"v4root", ""}}, { NFSEXP_PNFS, {"pnfs", ""}}, + { NFSEXP_ALLOW_UMOUNT, {"allow_umount", ""}}, { 0, {"", ""}} }; diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h index 1f52bfc..1134875 100644 --- a/fs/nfsd/export.h +++ b/fs/nfsd/export.h @@ -4,6 +4,7 @@ #ifndef NFSD_EXPORT_H #define NFSD_EXPORT_H +#include <linux/fs_pin.h> #include <linux/sunrpc/cache.h> #include <uapi/linux/nfsd/export.h> @@ -46,6 +47,8 @@ struct exp_flavor_info { struct svc_export { struct cache_head h; + struct cache_detail *cd; + struct auth_domain * ex_client; int ex_flags; struct path ex_path; @@ -58,7 +61,9 @@ struct svc_export { struct exp_flavor_info ex_flavors[MAX_SECINFO_LIST]; enum pnfs_layouttype ex_layout_type; struct nfsd4_deviceid_map *ex_devid_map; - struct cache_detail *cd; + + struct fs_pin ex_pin; + struct rcu_head rcu_head; }; /* an "export key" (expkey) maps a filehandlefragement to an @@ -67,17 +72,21 @@ struct svc_export { */ struct svc_expkey { struct cache_head h; + struct cache_detail *cd; struct auth_domain * ek_client; int ek_fsidtype; u32 ek_fsid[6]; struct path ek_path; + struct fs_pin ek_pin; + struct rcu_head rcu_head; }; #define EX_ISSYNC(exp) (!((exp)->ex_flags & NFSEXP_ASYNC)) #define EX_NOHIDE(exp) ((exp)->ex_flags & NFSEXP_NOHIDE) #define EX_WGATHER(exp) ((exp)->ex_flags & NFSEXP_GATHERED_WRITES) +#define EX_ALLOW_UMOUNT(exp) ((exp)->ex_flags & NFSEXP_ALLOW_UMOUNT) int nfsexp_flags(struct svc_rqst *rqstp, struct svc_export *exp); __be32 check_nfsd_access(struct svc_export *exp, struct svc_rqst *rqstp); diff --git a/include/uapi/linux/nfsd/export.h b/include/uapi/linux/nfsd/export.h index 0df7bd5..61aa8bb 100644 --- a/include/uapi/linux/nfsd/export.h +++ b/include/uapi/linux/nfsd/export.h @@ -51,9 +51,10 @@ */ #define NFSEXP_V4ROOT 0x10000 #define NFSEXP_PNFS 0x20000 +#define NFSEXP_ALLOW_UMOUNT 0x40000 /* All flags that we claim to support. (Note we don't support NOACL.) */ -#define NFSEXP_ALLFLAGS 0x3FE7F +#define NFSEXP_ALLFLAGS 0x7FE7F /* The flags that may vary depending on security flavor: */ #define NFSEXP_SECINFO_FLAGS (NFSEXP_READONLY | NFSEXP_ROOTSQUASH \
If there are some mount points(not exported for nfs) under pseudo root, after client's operation of those entry under the root, anyone *can't* unmount those mount points until export cache expired. /nfs/xfs *(rw,insecure,no_subtree_check,no_root_squash) /nfs/pnfs *(rw,insecure,no_subtree_check,no_root_squash) total 0 drwxr-xr-x. 3 root root 84 Apr 21 22:27 pnfs drwxr-xr-x. 3 root root 84 Apr 21 22:27 test drwxr-xr-x. 2 root root 6 Apr 20 22:01 xfs Filesystem 1K-blocks Used Available Use% Mounted on ...... /dev/sdd 1038336 32944 1005392 4% /nfs/pnfs /dev/sdc 10475520 32928 10442592 1% /nfs/xfs /dev/sde 999320 1284 929224 1% /nfs/test /mnt/pnfs/: total 0 -rw-r--r--. 1 root root 0 Apr 21 22:23 attr drwxr-xr-x. 2 root root 6 Apr 21 22:19 tmp /mnt/xfs/: total 0 umount: /nfs/test/: target is busy (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1).) I don't think that's user expect, they want umount /nfs/test/. It's caused by exports cache of nfsd holds the reference of the path (here is /nfs/test/), so, it can't be umounted. v2, 1. Update exports according to the "allow_umount" option. Pin to vfsmnt default, change when updating. 2. Using kzalloc for all memory allocating without kmalloc. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> --- fs/nfsd/export.c | 52 ++++++++++++++++++++++++++++++++-------- fs/nfsd/export.h | 11 ++++++++- include/uapi/linux/nfsd/export.h | 3 ++- 3 files changed, 54 insertions(+), 12 deletions(-)