Message ID | 165247081391.6691.14842389384935416109.stgit@bazille.1015granger.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make NFSv4 OPEN(CREATE) less brittle | expand |
On Fri, May 13, 2022 at 03:40:13PM -0400, Chuck Lever wrote: > +/** > + * dentry_create - Create and open a file > + * @path: path to create > + * @flags: O_ flags > + * @mode: mode bits for new file > + * @cred: credentials to use > + * > + * Caller must hold the parent directory's lock, and have prepared > + * a negative dentry, placed in @path->dentry, for the new file. > + * > + * On success, returns a "struct file *". Otherwise a ERR_PTR > + * is returned. > + */ > +struct file *dentry_create(const struct path *path, int flags, umode_t mode, > + const struct cred *cred) > +{ > + struct dentry *parent; > + struct file *f; > + int error; > + > + validate_creds(cred); > + f = alloc_empty_file(flags, cred); > + if (IS_ERR(f)) > + return f; > + > + parent = dget_parent(path->dentry); > + error = vfs_create(mnt_user_ns(path->mnt), d_inode(parent), > + path->dentry, mode, true); > + dput(parent); Yuck. dget_parent() is not entirely without valid uses, but this isn't one. It's for the cases when parent is *not* stable and you need to grab what had been the parent at some point (even though it might not be the parent anymore by the time dget_parent() returns). Here you seriously depend upon it remaining the parent of that sucker all the way through - otherwise vfs_create() would break. And you really, really depend upon its survival - the caller is holding it locked, so they would better have it pinned. So this dget_parent() (and dput()) is pointless and confusing; path->dentry->d_parent would suffice. And document that path->dentry must not be root and that its parent should match path->mnt. > + if (error) { > + fput(f); > + return ERR_PTR(error); > + } > + > + error = vfs_open(path, f); > + if (error) { > + fput(f); > + return ERR_PTR(error); > + } > + > + return f; FWIW, I'd rather have it done as error = vfs_create(...); if (!error) error = vfs_open(path, f); if (unlikely(error)) { fput(f); return ERR_PTR(error); } return f;
On Fri, May 13, 2022 at 10:20:18PM +0000, Al Viro wrote: > Yuck. dget_parent() is not entirely without valid uses, but this isn't > one. It's for the cases when parent is *not* stable and you need to grab > what had been the parent at some point (even though it might not be the > parent anymore by the time dget_parent() returns). Here you seriously > depend upon it remaining the parent of that sucker all the way through - > otherwise vfs_create() would break. And you really, really depend upon > its survival - the caller is holding it locked, so they would better > have it pinned. As an aside, the reason why vfs_create() takes inode of parent directory and dentry of child is basically that it's easier to describe the locking rules that way: vfs_create(..., dir, child, ...) must be called with 1) dir being held by caller (exclusive) and 2) child->d_parent->d_inode == dir, which is stabilized by (1) inode of parent directory is a redundant argument - it can be easily derived from the child dentry, for all that family. The only real objection against dropping it from vfs_create() and friends is that having rules described as "inode of parent dentry of child must be held exclusive by the caller" invites breakage along the lines of parent = dget_parent(child); inode_lock(d_inode(parent)); vfs_create(..., child, ...); // WRONG inode_unlock(d_inode(parent)); dput(parent); which *seems* to match the rules, but actually breaks them badly - 'parent' in the snippet above might be no longer related to child by the time dget_parent() returns it, so we end up calling vfs_create() with wrong directory locked, child->d_parent being completely unstable, etc. Note that the difference from your code (which is correct, if redundant) is rather subtle. If you have any suggestions how to describe these locking rules without an explicit inode-of-parent argument, I would really like to hear those. The best I'd been able to come up with had been "there's an inode locked exclusive by the caller that had been observed to be equal to child->d_parent->d_inode at some point after it had been locked", which is both cumbersome and confusing...
> On May 14, 2022, at 12:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, May 13, 2022 at 10:20:18PM +0000, Al Viro wrote: > >> Yuck. dget_parent() is not entirely without valid uses, but this isn't >> one. It's for the cases when parent is *not* stable and you need to grab >> what had been the parent at some point (even though it might not be the >> parent anymore by the time dget_parent() returns). Here you seriously >> depend upon it remaining the parent of that sucker all the way through - >> otherwise vfs_create() would break. And you really, really depend upon >> its survival - the caller is holding it locked, so they would better >> have it pinned. > > As an aside, the reason why vfs_create() takes inode of parent directory > and dentry of child is basically that it's easier to describe the locking > rules that way: vfs_create(..., dir, child, ...) must be called with > 1) dir being held by caller (exclusive) and > 2) child->d_parent->d_inode == dir, which is stabilized by (1) > > inode of parent directory is a redundant argument - it can be easily > derived from the child dentry, for all that family. The only real > objection against dropping it from vfs_create() and friends is that > having rules described as "inode of parent dentry of child must be held > exclusive by the caller" invites breakage along the lines of > > parent = dget_parent(child); > inode_lock(d_inode(parent)); > vfs_create(..., child, ...); // WRONG > inode_unlock(d_inode(parent)); > dput(parent); > > which *seems* to match the rules, but actually breaks them badly - > 'parent' in the snippet above might be no longer related to child by the > time dget_parent() returns it, so we end up calling vfs_create() with > wrong directory locked, child->d_parent being completely unstable, etc. > Note that the difference from your code (which is correct, if redundant) is > rather subtle. I'm not sure I have anything informed to say, but here are some random thoughts: vfs_create() has to work for both exclusive and non-exclusive create requests, but its caller is responsible for those semantics. So perhaps the current vfs_create() synopsis is correct (though as you point out above, the locking requirements could be documented a little better?). NFSD also has the interesting problem of re-exporting NFS mounts. We'd like the exclusiveness of the creation request to be passed through to the originating NFS server properly (maybe via ->atomic_open?). Perhaps that's for another day. But locking the parent means that file creation is serialized, which for large directories or workloads like "tar", impacts throughput. Again, perhaps that's for another day. > If you have any suggestions how to describe these locking rules without > an explicit inode-of-parent argument, I would really like to hear those. > The best I'd been able to come up with had been "there's an inode > locked exclusive by the caller that had been observed to be equal to > child->d_parent->d_inode at some point after it had been locked", which > is both cumbersome and confusing... I have a v2 of my patch which hopefully addresses your comments (many thanks for those). I'm not sure I captured the gist of your documentation request though. Looking at some other callers of vfs_create() such as do_mknodat(), dentry_create's use of the @path argument might be, well, unexpected. If it looks insane to you, it can be adjusted. I'll post v2 shortly and we can iterate on that. -- Chuck Lever
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index 781bef7e42d9..24772f246461 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -897,9 +897,9 @@ nfsd_file_is_cached(struct inode *inode) return ret; } -__be32 -nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, - unsigned int may_flags, struct nfsd_file **pnf) +static __be32 +nfsd_do_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, + unsigned int may_flags, struct nfsd_file **pnf, bool open) { __be32 status; struct net *net = SVC_NET(rqstp); @@ -994,10 +994,13 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, nfsd_file_gc(); nf->nf_mark = nfsd_file_mark_find_or_create(nf); - if (nf->nf_mark) - status = nfsd_open_verified(rqstp, fhp, may_flags, - &nf->nf_file); - else + if (nf->nf_mark) { + if (open) + status = nfsd_open_verified(rqstp, fhp, may_flags, + &nf->nf_file); + else + status = nfs_ok; + } else status = nfserr_jukebox; /* * If construction failed, or we raced with a call to unlink() @@ -1017,6 +1020,40 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, goto out; } +/** + * nfsd_file_acquire - Get a struct nfsd_file with an open file + * @rqstp: the RPC transaction being executed + * @fhp: the NFS filehandle of the file to be opened + * @may_flags: NFSD_MAY_ settings for the file + * @pnf: OUT: new or found "struct nfsd_file" object + * + * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in + * network byte order is returned. + */ +__be32 +nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, + unsigned int may_flags, struct nfsd_file **pnf) +{ + return nfsd_do_file_acquire(rqstp, fhp, may_flags, pnf, true); +} + +/** + * nfsd_file_create - Get a struct nfsd_file, do not open + * @rqstp: the RPC transaction being executed + * @fhp: the NFS filehandle of the file just created + * @may_flags: NFSD_MAY_ settings for the file + * @pnf: OUT: new or found "struct nfsd_file" object + * + * Returns nfs_ok and sets @pnf on success; otherwise an nfsstat in + * network byte order is returned. + */ +__be32 +nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp, + unsigned int may_flags, struct nfsd_file **pnf) +{ + return nfsd_do_file_acquire(rqstp, fhp, may_flags, pnf, false); +} + /* * Note that fields may be added, removed or reordered in the future. Programs * scraping this file for info should test the labels to ensure they're diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h index 435ceab27897..1da0c79a5580 100644 --- a/fs/nfsd/filecache.h +++ b/fs/nfsd/filecache.h @@ -59,5 +59,7 @@ void nfsd_file_close_inode_sync(struct inode *inode); bool nfsd_file_is_cached(struct inode *inode); __be32 nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, unsigned int may_flags, struct nfsd_file **nfp); +__be32 nfsd_file_create(struct svc_rqst *rqstp, struct svc_fh *fhp, + unsigned int may_flags, struct nfsd_file **nfp); int nfsd_file_cache_stats_open(struct inode *, struct file *); #endif /* _FS_NFSD_FILECACHE_H */ diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 99c0485a29e9..28bae84d7809 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -243,6 +243,37 @@ static inline bool nfsd4_create_is_exclusive(int createmode) createmode == NFS4_CREATE_EXCLUSIVE4_1; } +static __be32 +nfsd4_vfs_create(struct svc_fh *fhp, struct dentry *child, + struct nfsd4_open *open) +{ + struct file *filp; + struct path path; + int oflags; + + oflags = O_CREAT | O_LARGEFILE; + switch (open->op_share_access & NFS4_SHARE_ACCESS_BOTH) { + case NFS4_SHARE_ACCESS_WRITE: + oflags |= O_WRONLY; + break; + case NFS4_SHARE_ACCESS_BOTH: + oflags |= O_RDWR; + break; + default: + oflags |= O_RDONLY; + } + + path.mnt = fhp->fh_export->ex_path.mnt; + path.dentry = child; + filp = dentry_create(&path, oflags, open->op_iattr.ia_mode, + current_cred()); + if (IS_ERR(filp)) + return nfserrno(PTR_ERR(filp)); + + open->op_filp = filp; + return nfs_ok; +} + /* * Implement NFSv4's unchecked, guarded, and exclusive create * semantics for regular files. Open state for this new file is @@ -355,11 +386,9 @@ nfsd4_create_file(struct svc_rqst *rqstp, struct svc_fh *fhp, if (!IS_POSIXACL(inode)) iap->ia_mode &= ~current_umask(); - host_err = vfs_create(&init_user_ns, inode, child, iap->ia_mode, true); - if (host_err < 0) { - status = nfserrno(host_err); + status = nfsd4_vfs_create(fhp, child, open); + if (status != nfs_ok) goto out; - } open->op_created = true; /* A newly created file already has a file size of zero. */ @@ -517,6 +546,8 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, (int)open->op_fnamelen, open->op_fname, open->op_openowner); + open->op_filp = NULL; + /* This check required by spec. */ if (open->op_create && open->op_claim_type != NFS4_OPEN_CLAIM_NULL) return nfserr_inval; @@ -613,6 +644,10 @@ nfsd4_open(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, if (reclaim && !status) nn->somebody_reclaimed = true; out: + if (open->op_filp) { + fput(open->op_filp); + open->op_filp = NULL; + } if (resfh && resfh != &cstate->current_fh) { fh_dup2(&cstate->current_fh, resfh); fh_put(resfh); diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 231e5c19cdb7..131102cba06b 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -5093,9 +5093,19 @@ static __be32 nfs4_get_vfs_file(struct svc_rqst *rqstp, struct nfs4_file *fp, if (!fp->fi_fds[oflag]) { spin_unlock(&fp->fi_lock); - status = nfsd_file_acquire(rqstp, cur_fh, access, &nf); - if (status) - goto out_put_access; + + if (!open->op_filp) { + status = nfsd_file_acquire(rqstp, cur_fh, access, &nf); + if (status != nfs_ok) + goto out_put_access; + } else { + status = nfsd_file_create(rqstp, cur_fh, access, &nf); + if (status != nfs_ok) + goto out_put_access; + nf->nf_file = open->op_filp; + open->op_filp = NULL; + } + spin_lock(&fp->fi_lock); if (!fp->fi_fds[oflag]) { fp->fi_fds[oflag] = nf; diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h index 846ab6df9d48..7b744011f2d3 100644 --- a/fs/nfsd/xdr4.h +++ b/fs/nfsd/xdr4.h @@ -273,6 +273,7 @@ struct nfsd4_open { bool op_truncate; /* used during processing */ bool op_created; /* used during processing */ struct nfs4_openowner *op_openowner; /* used during processing */ + struct file *op_filp; /* used during processing */ struct nfs4_file *op_file; /* used during processing */ struct nfs4_ol_stateid *op_stp; /* used during processing */ struct nfs4_clnt_odstate *op_odstate; /* used during processing */ diff --git a/fs/open.c b/fs/open.c index 1315253e0247..e3cf72c269e1 100644 --- a/fs/open.c +++ b/fs/open.c @@ -981,6 +981,50 @@ struct file *dentry_open(const struct path *path, int flags, } EXPORT_SYMBOL(dentry_open); +/** + * dentry_create - Create and open a file + * @path: path to create + * @flags: O_ flags + * @mode: mode bits for new file + * @cred: credentials to use + * + * Caller must hold the parent directory's lock, and have prepared + * a negative dentry, placed in @path->dentry, for the new file. + * + * On success, returns a "struct file *". Otherwise a ERR_PTR + * is returned. + */ +struct file *dentry_create(const struct path *path, int flags, umode_t mode, + const struct cred *cred) +{ + struct dentry *parent; + struct file *f; + int error; + + validate_creds(cred); + f = alloc_empty_file(flags, cred); + if (IS_ERR(f)) + return f; + + parent = dget_parent(path->dentry); + error = vfs_create(mnt_user_ns(path->mnt), d_inode(parent), + path->dentry, mode, true); + dput(parent); + if (error) { + fput(f); + return ERR_PTR(error); + } + + error = vfs_open(path, f); + if (error) { + fput(f); + return ERR_PTR(error); + } + + return f; +} +EXPORT_SYMBOL(dentry_create); + struct file *open_with_fake_path(const struct path *path, int flags, struct inode *inode, const struct cred *cred) { diff --git a/include/linux/fs.h b/include/linux/fs.h index aa6c1bbdb8c4..b848355b5e6c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2640,6 +2640,8 @@ static inline struct file *file_open_root_mnt(struct vfsmount *mnt, name, flags, mode); } extern struct file * dentry_open(const struct path *, int, const struct cred *); +extern struct file *dentry_create(const struct path *path, int flags, + umode_t mode, const struct cred *cred); extern struct file * open_with_fake_path(const struct path *, int, struct inode*, const struct cred *); static inline struct file *file_clone_open(struct file *file)