Message ID | 20250220171205.12092-1-snitzer@kernel.org (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | nfsd: add the ability to enable use of RWF_DONTCACHE for all nfsd IO | expand |
[ Adding NFSD reviewers ... ] On 2/20/25 12:12 PM, Mike Snitzer wrote: > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > by nfsd will be removed from the page cache upon completion." > > nfsd_dontcache is disabled by default. It may be enabled with: > echo Y > /sys/module/nfsd/parameters/nfsd_dontcache A per-export setting like an export option would be nicer. Also, does it make sense to make it a separate control for READ and one for WRITE? My trick knee suggests caching read results is still going to add significant value, but write, not so much. However, to add any such administrative control, I'd like to see some performance numbers. I think we need to enumerate the cases (I/O types) that are most interesting to examine: small memory NFS servers; lots of small unaligned I/O; server-side CPU per byte; storage interrupt rates; any others? And let's see some user/admin documentation (eg when should this setting be enabled? when would it be contra-indicated?) The same arguments that applied to Cedric's request to make maximum RPC size a tunable setting apply here. Do we want to carry a manual setting for this mechanism for a long time, or do we expect that the setting can become automatic/uninteresting after a period of experimentation? * It might be argued that putting these experimental tunables under /sys eliminates the support longevity question, since there aren't strict rules about removing files under /sys. > FOP_DONTCACHE must be advertised as supported by the underlying > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled > all IO will fail with -EOPNOTSUPP. It would be better all around if NFSD simply ignored the setting in the cases where the underlying file system doesn't implement DONTCACHE. > Signed-off-by: Mike Snitzer <snitzer@kernel.org> > --- > fs/nfsd/vfs.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 29cb7b812d71..d7e49004e93d 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) > return __nfsd_open(fhp, S_IFREG, may_flags, filp); > } > > +static bool nfsd_dontcache __read_mostly = false; > +module_param(nfsd_dontcache, bool, 0644); > +MODULE_PARM_DESC(nfsd_dontcache, > + "Any data read or written by nfsd will be removed from the page cache upon completion."); > + > /* > * Grab and keep cached pages associated with a file in the svc_rqst > * so that they can be passed to the network sendmsg routines > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > loff_t ppos = offset; > struct page *page; > ssize_t host_err; > + rwf_t flags = 0; > > v = 0; > total = *count; > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > } > WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); > > + if (nfsd_dontcache) > + flags |= RWF_DONTCACHE; > + > trace_nfsd_read_vector(rqstp, fhp, offset, *count); > iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); > - host_err = vfs_iter_read(file, &iter, &ppos, 0); > + host_err = vfs_iter_read(file, &iter, &ppos, flags); > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); > } > > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > if (stable && !fhp->fh_use_wgather) > flags |= RWF_SYNC; > > + if (nfsd_dontcache) > + flags |= RWF_DONTCACHE; > + > iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); > since = READ_ONCE(file->f_wb_err); > if (verf) > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > */ > bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > { > + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ > + return false; > + Urgh. So I've been mulling over simply removing the splice read path. - Less code, less complexity, smaller test matrix - How much of a performance loss would result? - Would such a change make it easier to pass whole folios from the file system directly to the network layer? > switch (svc_auth_flavor(rqstp)) { > case RPC_AUTH_GSS_KRB5I: > case RPC_AUTH_GSS_KRB5P:
On Thu, 2025-02-20 at 12:12 -0500, Mike Snitzer wrote: > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > by nfsd will be removed from the page cache upon completion." > > nfsd_dontcache is disabled by default. It may be enabled with: > echo Y > /sys/module/nfsd/parameters/nfsd_dontcache > > FOP_DONTCACHE must be advertised as supported by the underlying > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled > all IO will fail with -EOPNOTSUPP. > A little more explanation here would be good. What problem is this solving? In general we don't go for tunables like this unless there is just no alternative. What might help me understand this is to add some documentation that explains when an admin would want to enable this. > Signed-off-by: Mike Snitzer <snitzer@kernel.org> > --- > fs/nfsd/vfs.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 29cb7b812d71..d7e49004e93d 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) > return __nfsd_open(fhp, S_IFREG, may_flags, filp); > } > > +static bool nfsd_dontcache __read_mostly = false; > +module_param(nfsd_dontcache, bool, 0644); > +MODULE_PARM_DESC(nfsd_dontcache, > + "Any data read or written by nfsd will be removed from the page cache upon completion."); > + > /* > * Grab and keep cached pages associated with a file in the svc_rqst > * so that they can be passed to the network sendmsg routines > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > loff_t ppos = offset; > struct page *page; > ssize_t host_err; > + rwf_t flags = 0; > > v = 0; > total = *count; > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > } > WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); > > + if (nfsd_dontcache) > + flags |= RWF_DONTCACHE; > + > trace_nfsd_read_vector(rqstp, fhp, offset, *count); > iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); > - host_err = vfs_iter_read(file, &iter, &ppos, 0); > + host_err = vfs_iter_read(file, &iter, &ppos, flags); > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); > } > > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > if (stable && !fhp->fh_use_wgather) > flags |= RWF_SYNC; > > + if (nfsd_dontcache) > + flags |= RWF_DONTCACHE; > + > iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); > since = READ_ONCE(file->f_wb_err); > if (verf) > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > */ > bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > { > + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ > + return false; > + > switch (svc_auth_flavor(rqstp)) { > case RPC_AUTH_GSS_KRB5I: > case RPC_AUTH_GSS_KRB5P:
On 2/20/25 2:00 PM, Jeff Layton wrote: > On Thu, 2025-02-20 at 12:12 -0500, Mike Snitzer wrote: >> Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written >> by nfsd will be removed from the page cache upon completion." >> >> nfsd_dontcache is disabled by default. It may be enabled with: >> echo Y > /sys/module/nfsd/parameters/nfsd_dontcache >> >> FOP_DONTCACHE must be advertised as supported by the underlying >> filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled >> all IO will fail with -EOPNOTSUPP. >> > > A little more explanation here would be good. What problem is this > solving? In general we don't go for tunables like this unless there is > just no alternative. > > What might help me understand this is to add some documentation that > explains when an admin would want to enable this. Agreed: I might know why this is interesting, since Mike and I discussed it during bake-a-thon. But other reviewers don't, so it would be helpful to provide a little more context in the patch description. >> Signed-off-by: Mike Snitzer <snitzer@kernel.org> >> --- >> fs/nfsd/vfs.c | 17 ++++++++++++++++- >> 1 file changed, 16 insertions(+), 1 deletion(-) >> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c >> index 29cb7b812d71..d7e49004e93d 100644 >> --- a/fs/nfsd/vfs.c >> +++ b/fs/nfsd/vfs.c >> @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) >> return __nfsd_open(fhp, S_IFREG, may_flags, filp); >> } >> >> +static bool nfsd_dontcache __read_mostly = false; >> +module_param(nfsd_dontcache, bool, 0644); >> +MODULE_PARM_DESC(nfsd_dontcache, >> + "Any data read or written by nfsd will be removed from the page cache upon completion."); >> + >> /* >> * Grab and keep cached pages associated with a file in the svc_rqst >> * so that they can be passed to the network sendmsg routines >> @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, >> loff_t ppos = offset; >> struct page *page; >> ssize_t host_err; >> + rwf_t flags = 0; >> >> v = 0; >> total = *count; >> @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, >> } >> WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); >> >> + if (nfsd_dontcache) >> + flags |= RWF_DONTCACHE; >> + >> trace_nfsd_read_vector(rqstp, fhp, offset, *count); >> iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); >> - host_err = vfs_iter_read(file, &iter, &ppos, 0); >> + host_err = vfs_iter_read(file, &iter, &ppos, flags); >> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); >> } >> >> @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, >> if (stable && !fhp->fh_use_wgather) >> flags |= RWF_SYNC; >> >> + if (nfsd_dontcache) >> + flags |= RWF_DONTCACHE; >> + >> iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); >> since = READ_ONCE(file->f_wb_err); >> if (verf) >> @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, >> */ >> bool nfsd_read_splice_ok(struct svc_rqst *rqstp) >> { >> + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ >> + return false; >> + >> switch (svc_auth_flavor(rqstp)) { >> case RPC_AUTH_GSS_KRB5I: >> case RPC_AUTH_GSS_KRB5P: >
On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: > [ Adding NFSD reviewers ... ] > > On 2/20/25 12:12 PM, Mike Snitzer wrote: > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > > by nfsd will be removed from the page cache upon completion." > > > > nfsd_dontcache is disabled by default. It may be enabled with: > > echo Y > /sys/module/nfsd/parameters/nfsd_dontcache > > A per-export setting like an export option would be nicer. Also, does > it make sense to make it a separate control for READ and one for WRITE? > My trick knee suggests caching read results is still going to add > significant value, but write, not so much. My intent was to make 6.14's DONTCACHE feature able to be tested in the context of nfsd in a no-frills way. I realize adding the nfsd_dontcache knob skews toward too raw, lacks polish. But I'm inclined to expose such course-grained opt-in knobs to encourage others' discovery (and answers to some of the questions you pose below). I also hope to enlist all NFSD reviewers' help in categorizing/documenting where DONTCACHE helps/hurts. ;) And I agree that ultimately per-export control is needed. I'll take the time to implement that, hopeful to have something more suitable in time for LSF. > However, to add any such administrative control, I'd like to see some > performance numbers. I think we need to enumerate the cases (I/O types) > that are most interesting to examine: small memory NFS servers; lots of > small unaligned I/O; server-side CPU per byte; storage interrupt rates; > any others? > > And let's see some user/admin documentation (eg when should this setting > be enabled? when would it be contra-indicated?) > > The same arguments that applied to Cedric's request to make maximum RPC > size a tunable setting apply here. Do we want to carry a manual setting > for this mechanism for a long time, or do we expect that the setting can > become automatic/uninteresting after a period of experimentation? > > * It might be argued that putting these experimental tunables under /sys > eliminates the support longevity question, since there aren't strict > rules about removing files under /sys. Right, I do think a sysfs knob (that defaults to disabled, requires user opt-in) is a pretty useful and benign means to expose experimental functionality. And I agree with all you said needed above, I haven't had the time to focus on DONTCACHE since ~Decemeber, I just picked up my old patches from that time and decided to send the NFSD one since DONTCACHE has been merged for 6.14. > > FOP_DONTCACHE must be advertised as supported by the underlying > > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled > > all IO will fail with -EOPNOTSUPP. > > It would be better all around if NFSD simply ignored the setting in the > cases where the underlying file system doesn't implement DONTCACHE. I'll work on making it so. > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org> > > --- > > fs/nfsd/vfs.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > > index 29cb7b812d71..d7e49004e93d 100644 > > --- a/fs/nfsd/vfs.c > > +++ b/fs/nfsd/vfs.c > > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) > > return __nfsd_open(fhp, S_IFREG, may_flags, filp); > > } > > > > +static bool nfsd_dontcache __read_mostly = false; > > +module_param(nfsd_dontcache, bool, 0644); > > +MODULE_PARM_DESC(nfsd_dontcache, > > + "Any data read or written by nfsd will be removed from the page cache upon completion."); > > + > > /* > > * Grab and keep cached pages associated with a file in the svc_rqst > > * so that they can be passed to the network sendmsg routines > > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > > loff_t ppos = offset; > > struct page *page; > > ssize_t host_err; > > + rwf_t flags = 0; > > > > v = 0; > > total = *count; > > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > > } > > WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); > > > > + if (nfsd_dontcache) > > + flags |= RWF_DONTCACHE; > > + > > trace_nfsd_read_vector(rqstp, fhp, offset, *count); > > iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); > > - host_err = vfs_iter_read(file, &iter, &ppos, 0); > > + host_err = vfs_iter_read(file, &iter, &ppos, flags); > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); > > } > > > > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > > if (stable && !fhp->fh_use_wgather) > > flags |= RWF_SYNC; > > > > + if (nfsd_dontcache) > > + flags |= RWF_DONTCACHE; > > + > > iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); > > since = READ_ONCE(file->f_wb_err); > > if (verf) > > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > > */ > > bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > > { > > + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ > > + return false; > > + > > Urgh. Heh, yeah, bypassing splice was needed given dontcache hooks off vfs_iter_read. > So I've been mulling over simply removing the splice read path. > > - Less code, less complexity, smaller test matrix > > - How much of a performance loss would result? > > - Would such a change make it easier to pass whole folios from > the file system directly to the network layer? Good to know. Thanks, Mike
On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote: > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: > > [ Adding NFSD reviewers ... ] > > > > On 2/20/25 12:12 PM, Mike Snitzer wrote: > > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > > > by nfsd will be removed from the page cache upon completion." > > > > > > nfsd_dontcache is disabled by default. It may be enabled with: > > > echo Y > /sys/module/nfsd/parameters/nfsd_dontcache > > > > A per-export setting like an export option would be nicer. Also, does > > it make sense to make it a separate control for READ and one for WRITE? > > My trick knee suggests caching read results is still going to add > > significant value, but write, not so much. > > My intent was to make 6.14's DONTCACHE feature able to be tested in > the context of nfsd in a no-frills way. I realize adding the > nfsd_dontcache knob skews toward too raw, lacks polish. But I'm > inclined to expose such course-grained opt-in knobs to encourage > others' discovery (and answers to some of the questions you pose > below). I also hope to enlist all NFSD reviewers' help in > categorizing/documenting where DONTCACHE helps/hurts. ;) > > And I agree that ultimately per-export control is needed. I'll take > the time to implement that, hopeful to have something more suitable in > time for LSF. > Would it make more sense to hook DONTCACHE up to the IO_ADVISE operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar meaning? That would give the clients a way to do this on a per-open basis. > > However, to add any such administrative control, I'd like to see some > > performance numbers. I think we need to enumerate the cases (I/O types) > > that are most interesting to examine: small memory NFS servers; lots of > > small unaligned I/O; server-side CPU per byte; storage interrupt rates; > > any others? > > > > And let's see some user/admin documentation (eg when should this setting > > be enabled? when would it be contra-indicated?) > > > > The same arguments that applied to Cedric's request to make maximum RPC > > size a tunable setting apply here. Do we want to carry a manual setting > > for this mechanism for a long time, or do we expect that the setting can > > become automatic/uninteresting after a period of experimentation? > > > > * It might be argued that putting these experimental tunables under /sys > > eliminates the support longevity question, since there aren't strict > > rules about removing files under /sys. Isn't /sys covered by the same ABI guarantees? I know debugfs isn't, but I'm not sure about /sys. > > Right, I do think a sysfs knob (that defaults to disabled, requires > user opt-in) is a pretty useful and benign means to expose > experimental functionality. > > And I agree with all you said needed above, I haven't had the time to > focus on DONTCACHE since ~Decemeber, I just picked up my old patches > from that time and decided to send the NFSD one since DONTCACHE has > been merged for 6.14. > > > > > > FOP_DONTCACHE must be advertised as supported by the underlying > > > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled > > > all IO will fail with -EOPNOTSUPP. > > > > It would be better all around if NFSD simply ignored the setting in the > > cases where the underlying file system doesn't implement DONTCACHE. > > I'll work on making it so. > > > > > > > > Signed-off-by: Mike Snitzer <snitzer@kernel.org> > > > --- > > > fs/nfsd/vfs.c | 17 ++++++++++++++++- > > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > > > index 29cb7b812d71..d7e49004e93d 100644 > > > --- a/fs/nfsd/vfs.c > > > +++ b/fs/nfsd/vfs.c > > > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) > > > return __nfsd_open(fhp, S_IFREG, may_flags, filp); > > > } > > > > > > +static bool nfsd_dontcache __read_mostly = false; > > > +module_param(nfsd_dontcache, bool, 0644); > > > +MODULE_PARM_DESC(nfsd_dontcache, > > > + "Any data read or written by nfsd will be removed from the page cache upon completion."); > > > + > > > /* > > > * Grab and keep cached pages associated with a file in the svc_rqst > > > * so that they can be passed to the network sendmsg routines > > > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > > > loff_t ppos = offset; > > > struct page *page; > > > ssize_t host_err; > > > + rwf_t flags = 0; > > > > > > v = 0; > > > total = *count; > > > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > > > } > > > WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); > > > > > > + if (nfsd_dontcache) > > > + flags |= RWF_DONTCACHE; > > > + > > > trace_nfsd_read_vector(rqstp, fhp, offset, *count); > > > iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); > > > - host_err = vfs_iter_read(file, &iter, &ppos, 0); > > > + host_err = vfs_iter_read(file, &iter, &ppos, flags); > > > return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); > > > } > > > > > > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > > > if (stable && !fhp->fh_use_wgather) > > > flags |= RWF_SYNC; > > > > > > + if (nfsd_dontcache) > > > + flags |= RWF_DONTCACHE; > > > + > > > iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); > > > since = READ_ONCE(file->f_wb_err); > > > if (verf) > > > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > > > */ > > > bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > > > { > > > + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ > > > + return false; > > > + > > > > Urgh. > > Heh, yeah, bypassing splice was needed given dontcache hooks off vfs_iter_read. > > > So I've been mulling over simply removing the splice read path. > > > > - Less code, less complexity, smaller test matrix > > > > - How much of a performance loss would result? > > > > - Would such a change make it easier to pass whole folios from > > the file system directly to the network layer? > > Good to know. >
On Thu, Feb 20, 2025 at 02:15:05PM -0500, Chuck Lever wrote: > On 2/20/25 2:00 PM, Jeff Layton wrote: > > On Thu, 2025-02-20 at 12:12 -0500, Mike Snitzer wrote: > >> Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > >> by nfsd will be removed from the page cache upon completion." > >> > >> nfsd_dontcache is disabled by default. It may be enabled with: > >> echo Y > /sys/module/nfsd/parameters/nfsd_dontcache > >> > >> FOP_DONTCACHE must be advertised as supported by the underlying > >> filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled > >> all IO will fail with -EOPNOTSUPP. > >> > > > > A little more explanation here would be good. What problem is this > > solving? In general we don't go for tunables like this unless there is > > just no alternative. > > > > What might help me understand this is to add some documentation that > > explains when an admin would want to enable this. > > Agreed: I might know why this is interesting, since Mike and I discussed > it during bake-a-thon. But other reviewers don't, so it would be helpful > to provide a little more context in the patch description. Buffered IO can bring about some serious inefficiencies due to page reclaim -- we (at Hammerspace) have seen some really nasty pathological cliffs (particularly with NFSD) due to buffered IO. I agree that all needs to be unpacked and detailed. To be continued... ;) But in general, Jens details the problem DONTCACHE solves in his various commit headers associated with DONTCACHE, highlight v6.14-rc1 commits include: overview: 9ad6344568cc mm/filemap: change filemap_create_folio() to take a struct kiocb read results: 8026e49bff9b mm/filemap: add read support for RWF_DONTCACHE write results with XFS (not yet upstream): https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.11&id=1facf34cb1b7101fa2226c1856dad651e47d916f Mike > > > >> Signed-off-by: Mike Snitzer <snitzer@kernel.org> > >> --- > >> fs/nfsd/vfs.c | 17 ++++++++++++++++- > >> 1 file changed, 16 insertions(+), 1 deletion(-) > >> > >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > >> index 29cb7b812d71..d7e49004e93d 100644 > >> --- a/fs/nfsd/vfs.c > >> +++ b/fs/nfsd/vfs.c > >> @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) > >> return __nfsd_open(fhp, S_IFREG, may_flags, filp); > >> } > >> > >> +static bool nfsd_dontcache __read_mostly = false; > >> +module_param(nfsd_dontcache, bool, 0644); > >> +MODULE_PARM_DESC(nfsd_dontcache, > >> + "Any data read or written by nfsd will be removed from the page cache upon completion."); > >> + > >> /* > >> * Grab and keep cached pages associated with a file in the svc_rqst > >> * so that they can be passed to the network sendmsg routines > >> @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > >> loff_t ppos = offset; > >> struct page *page; > >> ssize_t host_err; > >> + rwf_t flags = 0; > >> > >> v = 0; > >> total = *count; > >> @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, > >> } > >> WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); > >> > >> + if (nfsd_dontcache) > >> + flags |= RWF_DONTCACHE; > >> + > >> trace_nfsd_read_vector(rqstp, fhp, offset, *count); > >> iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); > >> - host_err = vfs_iter_read(file, &iter, &ppos, 0); > >> + host_err = vfs_iter_read(file, &iter, &ppos, flags); > >> return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); > >> } > >> > >> @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > >> if (stable && !fhp->fh_use_wgather) > >> flags |= RWF_SYNC; > >> > >> + if (nfsd_dontcache) > >> + flags |= RWF_DONTCACHE; > >> + > >> iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); > >> since = READ_ONCE(file->f_wb_err); > >> if (verf) > >> @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > >> */ > >> bool nfsd_read_splice_ok(struct svc_rqst *rqstp) > >> { > >> + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ > >> + return false; > >> + > >> switch (svc_auth_flavor(rqstp)) { > >> case RPC_AUTH_GSS_KRB5I: > >> case RPC_AUTH_GSS_KRB5P: > > > > > -- > Chuck Lever
On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote: > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote: > > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: > > > [ Adding NFSD reviewers ... ] > > > > > > On 2/20/25 12:12 PM, Mike Snitzer wrote: > > > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > > > > by nfsd will be removed from the page cache upon completion." > > > > > > > > nfsd_dontcache is disabled by default. It may be enabled with: > > > > echo Y > /sys/module/nfsd/parameters/nfsd_dontcache > > > > > > A per-export setting like an export option would be nicer. Also, does > > > it make sense to make it a separate control for READ and one for WRITE? > > > My trick knee suggests caching read results is still going to add > > > significant value, but write, not so much. > > > > My intent was to make 6.14's DONTCACHE feature able to be tested in > > the context of nfsd in a no-frills way. I realize adding the > > nfsd_dontcache knob skews toward too raw, lacks polish. But I'm > > inclined to expose such course-grained opt-in knobs to encourage > > others' discovery (and answers to some of the questions you pose > > below). I also hope to enlist all NFSD reviewers' help in > > categorizing/documenting where DONTCACHE helps/hurts. ;) > > > > And I agree that ultimately per-export control is needed. I'll take > > the time to implement that, hopeful to have something more suitable in > > time for LSF. > > > > Would it make more sense to hook DONTCACHE up to the IO_ADVISE > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar > meaning? That would give the clients a way to do this on a per-open > basis. Just thinking aloud here but: Using a DONTCACHE scalpel on a per open basis quite likely wouldn't provide the required page reclaim relief if the server is being hammered with normal buffered IO. Sure that particular DONTCACHE IO wouldn't contribute to the problem but it would still be impacted by those not opting to use DONTCACHE on entry to the server due to needing pages for its DONTCACHE buffered IO. > > > However, to add any such administrative control, I'd like to see some > > > performance numbers. I think we need to enumerate the cases (I/O types) > > > that are most interesting to examine: small memory NFS servers; lots of > > > small unaligned I/O; server-side CPU per byte; storage interrupt rates; > > > any others? > > > > > > And let's see some user/admin documentation (eg when should this setting > > > be enabled? when would it be contra-indicated?) > > > > > > The same arguments that applied to Cedric's request to make maximum RPC > > > size a tunable setting apply here. Do we want to carry a manual setting > > > for this mechanism for a long time, or do we expect that the setting can > > > become automatic/uninteresting after a period of experimentation? > > > > > > * It might be argued that putting these experimental tunables under /sys > > > eliminates the support longevity question, since there aren't strict > > > rules about removing files under /sys. > > Isn't /sys covered by the same ABI guarantees? I know debugfs isn't, > but I'm not sure about /sys. Only if you add them to the ABI docs as supported (at least that is my experience relative to various block limits knobs, etc). But yeah, invariably that invites a cat and mouse game of users using the knob and then complaining loudly if/when it goes away. Mike
On 2/21/25 10:02 AM, Mike Snitzer wrote: > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: >> * It might be argued that putting these experimental tunables under /sys >> eliminates the support longevity question, since there aren't strict >> rules about removing files under /sys. > > Right, I do think a sysfs knob (that defaults to disabled, requires > user opt-in) is a pretty useful and benign means to expose > experimental functionality. Seems like we want to figure out a blessed way to add this kind of experimental "hidden" tunable in a way that can be easily removed once we have the answers we need. I'd really like to keep the documented administrative interface as straightforward as possible, but I agree that having a way to experiment is valuable.
On Fri, 2025-02-21 at 10:36 -0500, Mike Snitzer wrote: > On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote: > > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote: > > > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: > > > > [ Adding NFSD reviewers ... ] > > > > > > > > On 2/20/25 12:12 PM, Mike Snitzer wrote: > > > > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written > > > > > by nfsd will be removed from the page cache upon completion." > > > > > > > > > > nfsd_dontcache is disabled by default. It may be enabled with: > > > > > echo Y > /sys/module/nfsd/parameters/nfsd_dontcache > > > > > > > > A per-export setting like an export option would be nicer. Also, does > > > > it make sense to make it a separate control for READ and one for WRITE? > > > > My trick knee suggests caching read results is still going to add > > > > significant value, but write, not so much. > > > > > > My intent was to make 6.14's DONTCACHE feature able to be tested in > > > the context of nfsd in a no-frills way. I realize adding the > > > nfsd_dontcache knob skews toward too raw, lacks polish. But I'm > > > inclined to expose such course-grained opt-in knobs to encourage > > > others' discovery (and answers to some of the questions you pose > > > below). I also hope to enlist all NFSD reviewers' help in > > > categorizing/documenting where DONTCACHE helps/hurts. ;) > > > > > > And I agree that ultimately per-export control is needed. I'll take > > > the time to implement that, hopeful to have something more suitable in > > > time for LSF. > > > > > > > Would it make more sense to hook DONTCACHE up to the IO_ADVISE > > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar > > meaning? That would give the clients a way to do this on a per-open > > basis. > > Just thinking aloud here but: Using a DONTCACHE scalpel on a per open > basis quite likely wouldn't provide the required page reclaim relief > if the server is being hammered with normal buffered IO. Sure that > particular DONTCACHE IO wouldn't contribute to the problem but it > would still be impacted by those not opting to use DONTCACHE on entry > to the server due to needing pages for its DONTCACHE buffered IO. > Actually, now that I read the spec, it looks like you could just embed an IO_ADVISE operation in the read compound: PUTFH + IO_ADVISE(for the range that you're reading) + READ() operation That said, that does nothing for v3 reads, which I imagine you're interested in hooking up here too. > > > > However, to add any such administrative control, I'd like to see some > > > > performance numbers. I think we need to enumerate the cases (I/O types) > > > > that are most interesting to examine: small memory NFS servers; lots of > > > > small unaligned I/O; server-side CPU per byte; storage interrupt rates; > > > > any others? > > > > > > > > And let's see some user/admin documentation (eg when should this setting > > > > be enabled? when would it be contra-indicated?) > > > > > > > > The same arguments that applied to Cedric's request to make maximum RPC > > > > size a tunable setting apply here. Do we want to carry a manual setting > > > > for this mechanism for a long time, or do we expect that the setting can > > > > become automatic/uninteresting after a period of experimentation? > > > > > > > > * It might be argued that putting these experimental tunables under /sys > > > > eliminates the support longevity question, since there aren't strict > > > > rules about removing files under /sys. > > > > Isn't /sys covered by the same ABI guarantees? I know debugfs isn't, > > but I'm not sure about /sys. > > Only if you add them to the ABI docs as supported (at least that is my > experience relative to various block limits knobs, etc). But yeah, > invariably that invites a cat and mouse game of users using the knob > and then complaining loudly if/when it goes away. > > Mike
On Fri, 2025-02-21 at 10:39 -0500, Chuck Lever wrote: > On 2/21/25 10:02 AM, Mike Snitzer wrote: > > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: > > > * It might be argued that putting these experimental tunables under /sys > > > eliminates the support longevity question, since there aren't strict > > > rules about removing files under /sys. > > > > Right, I do think a sysfs knob (that defaults to disabled, requires > > user opt-in) is a pretty useful and benign means to expose > > experimental functionality. > > Seems like we want to figure out a blessed way to add this kind of > experimental "hidden" tunable in a way that can be easily removed > once we have the answers we need. > > I'd really like to keep the documented administrative interface as > straightforward as possible, but I agree that having a way to > experiment is valuable. > We do have this fancy new netlink interface that is extensible. You could extend the "threads" call to add a server-wide boolean for this and then extend nfsdctl to set that value in some cases.
On 2/21/25 10:36 AM, Mike Snitzer wrote: > On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote: >> On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote: >>> My intent was to make 6.14's DONTCACHE feature able to be tested in >>> the context of nfsd in a no-frills way. I realize adding the >>> nfsd_dontcache knob skews toward too raw, lacks polish. But I'm >>> inclined to expose such course-grained opt-in knobs to encourage >>> others' discovery (and answers to some of the questions you pose >>> below). I also hope to enlist all NFSD reviewers' help in >>> categorizing/documenting where DONTCACHE helps/hurts. ;) >>> >>> And I agree that ultimately per-export control is needed. I'll take >>> the time to implement that, hopeful to have something more suitable in >>> time for LSF. >> >> Would it make more sense to hook DONTCACHE up to the IO_ADVISE >> operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar >> meaning? That would give the clients a way to do this on a per-open >> basis. > > Just thinking aloud here but: Using a DONTCACHE scalpel on a per open > basis quite likely wouldn't provide the required page reclaim relief > if the server is being hammered with normal buffered IO. Sure that > particular DONTCACHE IO wouldn't contribute to the problem but it > would still be impacted by those not opting to use DONTCACHE on entry > to the server due to needing pages for its DONTCACHE buffered IO. For this initial work, which is to provide a mechanism for experimentation, IMO exposing the setting to clients won't be all that helpful. But there are some applications/workloads on clients where exposure could be beneficial -- for instance, a backup job, where NFSD would benefit by knowing it doesn't have to maintain the job's written data in its page cache. I regard that as a later evolutionary improvement, though. Jorge proposed adding the NFSv4.2 IO_ADVISE operation to NFSD, but I think we first need to a) work out and document appropriate semantics for each hint, because the spec does not provide specifics, and b) perform some extensive benchmarking to understand their value and impact.
On 2/21/25 10:46 AM, Jeff Layton wrote: > On Fri, 2025-02-21 at 10:39 -0500, Chuck Lever wrote: >> On 2/21/25 10:02 AM, Mike Snitzer wrote: >>> On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote: >>>> * It might be argued that putting these experimental tunables under /sys >>>> eliminates the support longevity question, since there aren't strict >>>> rules about removing files under /sys. >>> >>> Right, I do think a sysfs knob (that defaults to disabled, requires >>> user opt-in) is a pretty useful and benign means to expose >>> experimental functionality. >> >> Seems like we want to figure out a blessed way to add this kind of >> experimental "hidden" tunable in a way that can be easily removed >> once we have the answers we need. >> >> I'd really like to keep the documented administrative interface as >> straightforward as possible, but I agree that having a way to >> experiment is valuable. > > We do have this fancy new netlink interface that is extensible. > > You could extend the "threads" call to add a server-wide boolean for > this and then extend nfsdctl to set that value in some cases. If we don't have other options that are more familiar outside of the NFSD world (someone mentioned debugfs), then adding a netlink operation that is explicitly documented as "access to experimental temporary features that no-one can rely on existing" seems like a good alternative. We can make our own rules there.
On Fri, 2025-02-21 at 10:46 -0500, Chuck Lever wrote: > On 2/21/25 10:36 AM, Mike Snitzer wrote: > > On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote: > > > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote: > > > > My intent was to make 6.14's DONTCACHE feature able to be > > > > tested in > > > > the context of nfsd in a no-frills way. I realize adding the > > > > nfsd_dontcache knob skews toward too raw, lacks polish. But > > > > I'm > > > > inclined to expose such course-grained opt-in knobs to > > > > encourage > > > > others' discovery (and answers to some of the questions you > > > > pose > > > > below). I also hope to enlist all NFSD reviewers' help in > > > > categorizing/documenting where DONTCACHE helps/hurts. ;) > > > > > > > > And I agree that ultimately per-export control is needed. I'll > > > > take > > > > the time to implement that, hopeful to have something more > > > > suitable in > > > > time for LSF. > > > > > > Would it make more sense to hook DONTCACHE up to the IO_ADVISE > > > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has > > > similar > > > meaning? That would give the clients a way to do this on a per- > > > open > > > basis. > > > > Just thinking aloud here but: Using a DONTCACHE scalpel on a per > > open > > basis quite likely wouldn't provide the required page reclaim > > relief > > if the server is being hammered with normal buffered IO. Sure that > > particular DONTCACHE IO wouldn't contribute to the problem but it > > would still be impacted by those not opting to use DONTCACHE on > > entry > > to the server due to needing pages for its DONTCACHE buffered IO. > > For this initial work, which is to provide a mechanism for > experimentation, IMO exposing the setting to clients won't be all > that helpful. > > But there are some applications/workloads on clients where exposure > could be beneficial -- for instance, a backup job, where NFSD would > benefit by knowing it doesn't have to maintain the job's written data > in > its page cache. I regard that as a later evolutionary improvement, > though. > > Jorge proposed adding the NFSv4.2 IO_ADVISE operation to NFSD, but I > think we first need to a) work out and document appropriate semantics > for each hint, because the spec does not provide specifics, and b) > perform some extensive benchmarking to understand their value and > impact. > > That puts the onus on the application running on the client to decide the caching semantics of the server which: A. Is a terrible idea™. The application may know how it wants to use the cached data, and be able to somewhat confidently manage its own pagecache. However in almost all cases, it will have no basis for understanding how the server should manage its cache. The latter really is a job for the sysadmin to figure out. B. Is impractical, because even if you can figure out a policy, it requires rewriting the application to manage the server cache. C. Will require additional APIs on the NFSv4.2 client to expose the IO_ADVISE operation. You cannot just map it to posix_fadvise() and/or posix_madvise(), because IO_ADVISE is designed to manage a completely different caching layer. At best, we might be able to rally one or two more distributed filesystems to implement similar functionality and share an API, however there is no chance this API will be useful for ordinary filesystems.
On Fri, 2025-02-21 at 16:13 +0000, Trond Myklebust wrote: > On Fri, 2025-02-21 at 10:46 -0500, Chuck Lever wrote: > > On 2/21/25 10:36 AM, Mike Snitzer wrote: > > > On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote: > > > > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote: > > > > > My intent was to make 6.14's DONTCACHE feature able to be > > > > > tested in > > > > > the context of nfsd in a no-frills way. I realize adding the > > > > > nfsd_dontcache knob skews toward too raw, lacks polish. But > > > > > I'm > > > > > inclined to expose such course-grained opt-in knobs to > > > > > encourage > > > > > others' discovery (and answers to some of the questions you > > > > > pose > > > > > below). I also hope to enlist all NFSD reviewers' help in > > > > > categorizing/documenting where DONTCACHE helps/hurts. ;) > > > > > > > > > > And I agree that ultimately per-export control is needed. I'll > > > > > take > > > > > the time to implement that, hopeful to have something more > > > > > suitable in > > > > > time for LSF. > > > > > > > > Would it make more sense to hook DONTCACHE up to the IO_ADVISE > > > > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has > > > > similar > > > > meaning? That would give the clients a way to do this on a per- > > > > open > > > > basis. > > > > > > Just thinking aloud here but: Using a DONTCACHE scalpel on a per > > > open > > > basis quite likely wouldn't provide the required page reclaim > > > relief > > > if the server is being hammered with normal buffered IO. Sure that > > > particular DONTCACHE IO wouldn't contribute to the problem but it > > > would still be impacted by those not opting to use DONTCACHE on > > > entry > > > to the server due to needing pages for its DONTCACHE buffered IO. > > > > For this initial work, which is to provide a mechanism for > > experimentation, IMO exposing the setting to clients won't be all > > that helpful. > > > > But there are some applications/workloads on clients where exposure > > could be beneficial -- for instance, a backup job, where NFSD would > > benefit by knowing it doesn't have to maintain the job's written data > > in > > its page cache. I regard that as a later evolutionary improvement, > > though. > > > > Jorge proposed adding the NFSv4.2 IO_ADVISE operation to NFSD, but I > > think we first need to a) work out and document appropriate semantics > > for each hint, because the spec does not provide specifics, and b) > > perform some extensive benchmarking to understand their value and > > impact. > > > > > > That puts the onus on the application running on the client to decide > the caching semantics of the server which: > A. Is a terrible idea™. The application may know how it wants to use > the cached data, and be able to somewhat confidently manage its > own pagecache. However in almost all cases, it will have no basis > for understanding how the server should manage its cache. The > latter really is a job for the sysadmin to figure out. > B. Is impractical, because even if you can figure out a policy, it > requires rewriting the application to manage the server cache. > C. Will require additional APIs on the NFSv4.2 client to expose the > IO_ADVISE operation. You cannot just map it to posix_fadvise() > and/or posix_madvise(), because IO_ADVISE is designed to manage a > completely different caching layer. At best, we might be able to > rally one or two more distributed filesystems to implement > similar functionality and share an API, however there is no > chance this API will be useful for ordinary filesystems. > You could map this to RWF_DONTCACHE itself. I know that's really intended as a hint to the local kernel, but it seems reasonable that if the application is giving the kernel a DONTCACHE hint, we could pass that along to the server as well. The server is under no obligation to do anything with it, just like the kernel with RWF_DONTCACHE. We could put an IO_ADVISE in a READ or READ_PLUS compound like so: PUTFH + IO_ADVISE(IO_ADVISE_NOREUSE for ranges being read) + READ_PLUS or READ ... On the server, we could track those ranges in the compound and enable RWF_DONTCACHE for any subsequent reads or writes. All that said, I don't object to some sort of mechanism to turn this on more globally, particularly since that would allow us to use this with v3 I/O as well.
On Fri, 2025-02-21 at 13:42 -0500, Jeff Layton wrote: > > All that said, I don't object to some sort of mechanism to turn this > on > more globally, particularly since that would allow us to use this > with > v3 I/O as well. Personally, I think an export option would give the most flexibility. That would allow the sysadmin to, for instance, put a set of libraries + executables that are shared by all clients on one volume which is exported in the normal fashion. Then putting files that are typically just accessed by a single client, or that are too huge to fit in cache memory on a separate volume that could be exported with the new option set.
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 29cb7b812d71..d7e49004e93d 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp) return __nfsd_open(fhp, S_IFREG, may_flags, filp); } +static bool nfsd_dontcache __read_mostly = false; +module_param(nfsd_dontcache, bool, 0644); +MODULE_PARM_DESC(nfsd_dontcache, + "Any data read or written by nfsd will be removed from the page cache upon completion."); + /* * Grab and keep cached pages associated with a file in the svc_rqst * so that they can be passed to the network sendmsg routines @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t ppos = offset; struct page *page; ssize_t host_err; + rwf_t flags = 0; v = 0; total = *count; @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp, } WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec)); + if (nfsd_dontcache) + flags |= RWF_DONTCACHE; + trace_nfsd_read_vector(rqstp, fhp, offset, *count); iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count); - host_err = vfs_iter_read(file, &iter, &ppos, 0); + host_err = vfs_iter_read(file, &iter, &ppos, flags); return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err); } @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (stable && !fhp->fh_use_wgather) flags |= RWF_SYNC; + if (nfsd_dontcache) + flags |= RWF_DONTCACHE; + iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt); since = READ_ONCE(file->f_wb_err); if (verf) @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, */ bool nfsd_read_splice_ok(struct svc_rqst *rqstp) { + if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */ + return false; + switch (svc_auth_flavor(rqstp)) { case RPC_AUTH_GSS_KRB5I: case RPC_AUTH_GSS_KRB5P:
Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written by nfsd will be removed from the page cache upon completion." nfsd_dontcache is disabled by default. It may be enabled with: echo Y > /sys/module/nfsd/parameters/nfsd_dontcache FOP_DONTCACHE must be advertised as supported by the underlying filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled all IO will fail with -EOPNOTSUPP. Signed-off-by: Mike Snitzer <snitzer@kernel.org> --- fs/nfsd/vfs.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)