diff mbox series

nfsd: add the ability to enable use of RWF_DONTCACHE for all nfsd IO

Message ID 20250220171205.12092-1-snitzer@kernel.org (mailing list archive)
State Handled Elsewhere
Headers show
Series nfsd: add the ability to enable use of RWF_DONTCACHE for all nfsd IO | expand

Commit Message

Mike Snitzer Feb. 20, 2025, 5:12 p.m. UTC
Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
by nfsd will be removed from the page cache upon completion."

nfsd_dontcache is disabled by default.  It may be enabled with:
  echo Y > /sys/module/nfsd/parameters/nfsd_dontcache

FOP_DONTCACHE must be advertised as supported by the underlying
filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
all IO will fail with -EOPNOTSUPP.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfsd/vfs.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Comments

Chuck Lever Feb. 20, 2025, 6:17 p.m. UTC | #1
[ Adding NFSD reviewers ... ]

On 2/20/25 12:12 PM, Mike Snitzer wrote:
> Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> by nfsd will be removed from the page cache upon completion."
> 
> nfsd_dontcache is disabled by default.  It may be enabled with:
>   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache

A per-export setting like an export option would be nicer. Also, does
it make sense to make it a separate control for READ and one for WRITE?
My trick knee suggests caching read results is still going to add
significant value, but write, not so much.

However, to add any such administrative control, I'd like to see some
performance numbers. I think we need to enumerate the cases (I/O types)
that are most interesting to examine: small memory NFS servers; lots of
small unaligned I/O; server-side CPU per byte; storage interrupt rates;
any others?

And let's see some user/admin documentation (eg when should this setting
be enabled? when would it be contra-indicated?)

The same arguments that applied to Cedric's request to make maximum RPC
size a tunable setting apply here. Do we want to carry a manual setting
for this mechanism for a long time, or do we expect that the setting can
become automatic/uninteresting after a period of experimentation?

* It might be argued that putting these experimental tunables under /sys
  eliminates the support longevity question, since there aren't strict
  rules about removing files under /sys.


> FOP_DONTCACHE must be advertised as supported by the underlying
> filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
> all IO will fail with -EOPNOTSUPP.

It would be better all around if NFSD simply ignored the setting in the
cases where the underlying file system doesn't implement DONTCACHE.


> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/vfs.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 29cb7b812d71..d7e49004e93d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
>  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
>  }
>  
> +static bool nfsd_dontcache __read_mostly = false;
> +module_param(nfsd_dontcache, bool, 0644);
> +MODULE_PARM_DESC(nfsd_dontcache,
> +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
> +
>  /*
>   * Grab and keep cached pages associated with a file in the svc_rqst
>   * so that they can be passed to the network sendmsg routines
> @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	loff_t ppos = offset;
>  	struct page *page;
>  	ssize_t host_err;
> +	rwf_t flags = 0;
>  
>  	v = 0;
>  	total = *count;
> @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	}
>  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
>  
> +	if (nfsd_dontcache)
> +		flags |= RWF_DONTCACHE;
> +
>  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
> -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
> +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>  	if (stable && !fhp->fh_use_wgather)
>  		flags |= RWF_SYNC;
>  
> +	if (nfsd_dontcache)
> +		flags |= RWF_DONTCACHE;
> +
>  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
>  	since = READ_ONCE(file->f_wb_err);
>  	if (verf)
> @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>   */
>  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
>  {
> +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
> +		return false;
> +

Urgh.

So I've been mulling over simply removing the splice read path.

 - Less code, less complexity, smaller test matrix

 - How much of a performance loss would result?

 - Would such a change make it easier to pass whole folios from
   the file system directly to the network layer?


>  	switch (svc_auth_flavor(rqstp)) {
>  	case RPC_AUTH_GSS_KRB5I:
>  	case RPC_AUTH_GSS_KRB5P:
Jeff Layton Feb. 20, 2025, 7 p.m. UTC | #2
On Thu, 2025-02-20 at 12:12 -0500, Mike Snitzer wrote:
> Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> by nfsd will be removed from the page cache upon completion."
> 
> nfsd_dontcache is disabled by default.  It may be enabled with:
>   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
>
> FOP_DONTCACHE must be advertised as supported by the underlying
> filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
> all IO will fail with -EOPNOTSUPP.
> 

A little more explanation here would be good. What problem is this
solving? In general we don't go for tunables like this unless there is
just no alternative.

What might help me understand this is to add some documentation that
explains when an admin would want to enable this.


> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> ---
>  fs/nfsd/vfs.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 29cb7b812d71..d7e49004e93d 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
>  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
>  }
>  
> +static bool nfsd_dontcache __read_mostly = false;
> +module_param(nfsd_dontcache, bool, 0644);
> +MODULE_PARM_DESC(nfsd_dontcache,
> +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
> +
>  /*
>   * Grab and keep cached pages associated with a file in the svc_rqst
>   * so that they can be passed to the network sendmsg routines
> @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	loff_t ppos = offset;
>  	struct page *page;
>  	ssize_t host_err;
> +	rwf_t flags = 0;
>  
>  	v = 0;
>  	total = *count;
> @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	}
>  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
>  
> +	if (nfsd_dontcache)
> +		flags |= RWF_DONTCACHE;
> +
>  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
> -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
> +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>  }
>  
> @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>  	if (stable && !fhp->fh_use_wgather)
>  		flags |= RWF_SYNC;
>  
> +	if (nfsd_dontcache)
> +		flags |= RWF_DONTCACHE;
> +
>  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
>  	since = READ_ONCE(file->f_wb_err);
>  	if (verf)
> @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>   */
>  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
>  {
> +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
> +		return false;
> +
>  	switch (svc_auth_flavor(rqstp)) {
>  	case RPC_AUTH_GSS_KRB5I:
>  	case RPC_AUTH_GSS_KRB5P:
Chuck Lever Feb. 20, 2025, 7:15 p.m. UTC | #3
On 2/20/25 2:00 PM, Jeff Layton wrote:
> On Thu, 2025-02-20 at 12:12 -0500, Mike Snitzer wrote:
>> Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
>> by nfsd will be removed from the page cache upon completion."
>>
>> nfsd_dontcache is disabled by default.  It may be enabled with:
>>   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
>>
>> FOP_DONTCACHE must be advertised as supported by the underlying
>> filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
>> all IO will fail with -EOPNOTSUPP.
>>
> 
> A little more explanation here would be good. What problem is this
> solving? In general we don't go for tunables like this unless there is
> just no alternative.
> 
> What might help me understand this is to add some documentation that
> explains when an admin would want to enable this.

Agreed: I might know why this is interesting, since Mike and I discussed
it during bake-a-thon. But other reviewers don't, so it would be helpful
to provide a little more context in the patch description.


>> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
>> ---
>>  fs/nfsd/vfs.c | 17 ++++++++++++++++-
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 29cb7b812d71..d7e49004e93d 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
>>  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
>>  }
>>  
>> +static bool nfsd_dontcache __read_mostly = false;
>> +module_param(nfsd_dontcache, bool, 0644);
>> +MODULE_PARM_DESC(nfsd_dontcache,
>> +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
>> +
>>  /*
>>   * Grab and keep cached pages associated with a file in the svc_rqst
>>   * so that they can be passed to the network sendmsg routines
>> @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  	loff_t ppos = offset;
>>  	struct page *page;
>>  	ssize_t host_err;
>> +	rwf_t flags = 0;
>>  
>>  	v = 0;
>>  	total = *count;
>> @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>  	}
>>  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
>>  
>> +	if (nfsd_dontcache)
>> +		flags |= RWF_DONTCACHE;
>> +
>>  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
>>  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
>> -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
>> +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
>>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
>>  }
>>  
>> @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>>  	if (stable && !fhp->fh_use_wgather)
>>  		flags |= RWF_SYNC;
>>  
>> +	if (nfsd_dontcache)
>> +		flags |= RWF_DONTCACHE;
>> +
>>  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
>>  	since = READ_ONCE(file->f_wb_err);
>>  	if (verf)
>> @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
>>   */
>>  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
>>  {
>> +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
>> +		return false;
>> +
>>  	switch (svc_auth_flavor(rqstp)) {
>>  	case RPC_AUTH_GSS_KRB5I:
>>  	case RPC_AUTH_GSS_KRB5P:
>
Mike Snitzer Feb. 21, 2025, 3:02 p.m. UTC | #4
On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
> [ Adding NFSD reviewers ... ]
> 
> On 2/20/25 12:12 PM, Mike Snitzer wrote:
> > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> > by nfsd will be removed from the page cache upon completion."
> > 
> > nfsd_dontcache is disabled by default.  It may be enabled with:
> >   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
> 
> A per-export setting like an export option would be nicer. Also, does
> it make sense to make it a separate control for READ and one for WRITE?
> My trick knee suggests caching read results is still going to add
> significant value, but write, not so much.

My intent was to make 6.14's DONTCACHE feature able to be tested in
the context of nfsd in a no-frills way.  I realize adding the
nfsd_dontcache knob skews toward too raw, lacks polish.  But I'm
inclined to expose such course-grained opt-in knobs to encourage
others' discovery (and answers to some of the questions you pose
below).  I also hope to enlist all NFSD reviewers' help in
categorizing/documenting where DONTCACHE helps/hurts. ;)

And I agree that ultimately per-export control is needed.  I'll take
the time to implement that, hopeful to have something more suitable in
time for LSF.

> However, to add any such administrative control, I'd like to see some
> performance numbers. I think we need to enumerate the cases (I/O types)
> that are most interesting to examine: small memory NFS servers; lots of
> small unaligned I/O; server-side CPU per byte; storage interrupt rates;
> any others?
> 
> And let's see some user/admin documentation (eg when should this setting
> be enabled? when would it be contra-indicated?)
> 
> The same arguments that applied to Cedric's request to make maximum RPC
> size a tunable setting apply here. Do we want to carry a manual setting
> for this mechanism for a long time, or do we expect that the setting can
> become automatic/uninteresting after a period of experimentation?
> 
> * It might be argued that putting these experimental tunables under /sys
>   eliminates the support longevity question, since there aren't strict
>   rules about removing files under /sys.

Right, I do think a sysfs knob (that defaults to disabled, requires
user opt-in) is a pretty useful and benign means to expose
experimental functionality.

And I agree with all you said needed above, I haven't had the time to
focus on DONTCACHE since ~Decemeber, I just picked up my old patches
from that time and decided to send the NFSD one since DONTCACHE has
been merged for 6.14.
 
> > FOP_DONTCACHE must be advertised as supported by the underlying
> > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
> > all IO will fail with -EOPNOTSUPP.
> 
> It would be better all around if NFSD simply ignored the setting in the
> cases where the underlying file system doesn't implement DONTCACHE.

I'll work on making it so.

> 
> 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > ---
> >  fs/nfsd/vfs.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 29cb7b812d71..d7e49004e93d 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
> >  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
> >  }
> >  
> > +static bool nfsd_dontcache __read_mostly = false;
> > +module_param(nfsd_dontcache, bool, 0644);
> > +MODULE_PARM_DESC(nfsd_dontcache,
> > +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
> > +
> >  /*
> >   * Grab and keep cached pages associated with a file in the svc_rqst
> >   * so that they can be passed to the network sendmsg routines
> > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	loff_t ppos = offset;
> >  	struct page *page;
> >  	ssize_t host_err;
> > +	rwf_t flags = 0;
> >  
> >  	v = 0;
> >  	total = *count;
> > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >  	}
> >  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
> >  
> > +	if (nfsd_dontcache)
> > +		flags |= RWF_DONTCACHE;
> > +
> >  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> >  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
> > -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
> > +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
> >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >  }
> >  
> > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> >  	if (stable && !fhp->fh_use_wgather)
> >  		flags |= RWF_SYNC;
> >  
> > +	if (nfsd_dontcache)
> > +		flags |= RWF_DONTCACHE;
> > +
> >  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
> >  	since = READ_ONCE(file->f_wb_err);
> >  	if (verf)
> > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> >   */
> >  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
> >  {
> > +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
> > +		return false;
> > +
> 
> Urgh.

Heh, yeah, bypassing splice was needed given dontcache hooks off vfs_iter_read.

> So I've been mulling over simply removing the splice read path.
> 
>  - Less code, less complexity, smaller test matrix
> 
>  - How much of a performance loss would result?
> 
>  - Would such a change make it easier to pass whole folios from
>    the file system directly to the network layer?

Good to know.

Thanks,
Mike
Jeff Layton Feb. 21, 2025, 3:25 p.m. UTC | #5
On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote:
> On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
> > [ Adding NFSD reviewers ... ]
> > 
> > On 2/20/25 12:12 PM, Mike Snitzer wrote:
> > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> > > by nfsd will be removed from the page cache upon completion."
> > > 
> > > nfsd_dontcache is disabled by default.  It may be enabled with:
> > >   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
> > 
> > A per-export setting like an export option would be nicer. Also, does
> > it make sense to make it a separate control for READ and one for WRITE?
> > My trick knee suggests caching read results is still going to add
> > significant value, but write, not so much.
> 
> My intent was to make 6.14's DONTCACHE feature able to be tested in
> the context of nfsd in a no-frills way.  I realize adding the
> nfsd_dontcache knob skews toward too raw, lacks polish.  But I'm
> inclined to expose such course-grained opt-in knobs to encourage
> others' discovery (and answers to some of the questions you pose
> below).  I also hope to enlist all NFSD reviewers' help in
> categorizing/documenting where DONTCACHE helps/hurts. ;)
> 
> And I agree that ultimately per-export control is needed.  I'll take
> the time to implement that, hopeful to have something more suitable in
> time for LSF.
> 

Would it make more sense to hook DONTCACHE up to the IO_ADVISE
operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar
meaning? That would give the clients a way to do this on a per-open
basis.

> > However, to add any such administrative control, I'd like to see some
> > performance numbers. I think we need to enumerate the cases (I/O types)
> > that are most interesting to examine: small memory NFS servers; lots of
> > small unaligned I/O; server-side CPU per byte; storage interrupt rates;
> > any others?
> > 
> > And let's see some user/admin documentation (eg when should this setting
> > be enabled? when would it be contra-indicated?)
> > 
> > The same arguments that applied to Cedric's request to make maximum RPC
> > size a tunable setting apply here. Do we want to carry a manual setting
> > for this mechanism for a long time, or do we expect that the setting can
> > become automatic/uninteresting after a period of experimentation?
> > 
> > * It might be argued that putting these experimental tunables under /sys
> >   eliminates the support longevity question, since there aren't strict
> >   rules about removing files under /sys.

Isn't /sys covered by the same ABI guarantees? I know debugfs isn't,
but I'm not sure about /sys.

> 
> Right, I do think a sysfs knob (that defaults to disabled, requires
> user opt-in) is a pretty useful and benign means to expose
> experimental functionality.
> 
> And I agree with all you said needed above, I haven't had the time to
> focus on DONTCACHE since ~Decemeber, I just picked up my old patches
> from that time and decided to send the NFSD one since DONTCACHE has
> been merged for 6.14.
> 
>
>  
> > > FOP_DONTCACHE must be advertised as supported by the underlying
> > > filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
> > > all IO will fail with -EOPNOTSUPP.
> > 
> > It would be better all around if NFSD simply ignored the setting in the
> > cases where the underlying file system doesn't implement DONTCACHE.
> 
> I'll work on making it so.
> 
> > 
> > 
> > > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> > > ---
> > >  fs/nfsd/vfs.c | 17 ++++++++++++++++-
> > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > index 29cb7b812d71..d7e49004e93d 100644
> > > --- a/fs/nfsd/vfs.c
> > > +++ b/fs/nfsd/vfs.c
> > > @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
> > >  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
> > >  }
> > >  
> > > +static bool nfsd_dontcache __read_mostly = false;
> > > +module_param(nfsd_dontcache, bool, 0644);
> > > +MODULE_PARM_DESC(nfsd_dontcache,
> > > +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
> > > +
> > >  /*
> > >   * Grab and keep cached pages associated with a file in the svc_rqst
> > >   * so that they can be passed to the network sendmsg routines
> > > @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >  	loff_t ppos = offset;
> > >  	struct page *page;
> > >  	ssize_t host_err;
> > > +	rwf_t flags = 0;
> > >  
> > >  	v = 0;
> > >  	total = *count;
> > > @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > >  	}
> > >  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
> > >  
> > > +	if (nfsd_dontcache)
> > > +		flags |= RWF_DONTCACHE;
> > > +
> > >  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> > >  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
> > > -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
> > > +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
> > >  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> > >  }
> > >  
> > > @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> > >  	if (stable && !fhp->fh_use_wgather)
> > >  		flags |= RWF_SYNC;
> > >  
> > > +	if (nfsd_dontcache)
> > > +		flags |= RWF_DONTCACHE;
> > > +
> > >  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
> > >  	since = READ_ONCE(file->f_wb_err);
> > >  	if (verf)
> > > @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> > >   */
> > >  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
> > >  {
> > > +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
> > > +		return false;
> > > +
> > 
> > Urgh.
> 
> Heh, yeah, bypassing splice was needed given dontcache hooks off vfs_iter_read.
> 
> > So I've been mulling over simply removing the splice read path.
> > 
> >  - Less code, less complexity, smaller test matrix
> > 
> >  - How much of a performance loss would result?
> > 
> >  - Would such a change make it easier to pass whole folios from
> >    the file system directly to the network layer?
> 
> Good to know.
>
Mike Snitzer Feb. 21, 2025, 3:25 p.m. UTC | #6
On Thu, Feb 20, 2025 at 02:15:05PM -0500, Chuck Lever wrote:
> On 2/20/25 2:00 PM, Jeff Layton wrote:
> > On Thu, 2025-02-20 at 12:12 -0500, Mike Snitzer wrote:
> >> Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> >> by nfsd will be removed from the page cache upon completion."
> >>
> >> nfsd_dontcache is disabled by default.  It may be enabled with:
> >>   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
> >>
> >> FOP_DONTCACHE must be advertised as supported by the underlying
> >> filesystem (e.g. XFS), otherwise if/when 'nfsd_dontcache' is enabled
> >> all IO will fail with -EOPNOTSUPP.
> >>
> > 
> > A little more explanation here would be good. What problem is this
> > solving? In general we don't go for tunables like this unless there is
> > just no alternative.
> > 
> > What might help me understand this is to add some documentation that
> > explains when an admin would want to enable this.
> 
> Agreed: I might know why this is interesting, since Mike and I discussed
> it during bake-a-thon. But other reviewers don't, so it would be helpful
> to provide a little more context in the patch description.

Buffered IO can bring about some serious inefficiencies due to page
reclaim -- we (at Hammerspace) have seen some really nasty pathological
cliffs (particularly with NFSD) due to buffered IO.  I agree that
all needs to be unpacked and detailed.  To be continued... ;)

But in general, Jens details the problem DONTCACHE solves in his various
commit headers associated with DONTCACHE, highlight v6.14-rc1 commits
include:

overview:
9ad6344568cc mm/filemap: change filemap_create_folio() to take a struct kiocb

read results:
8026e49bff9b mm/filemap: add read support for RWF_DONTCACHE

write results with XFS (not yet upstream):
https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.11&id=1facf34cb1b7101fa2226c1856dad651e47d916f

Mike


> 
> 
> >> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> >> ---
> >>  fs/nfsd/vfs.c | 17 ++++++++++++++++-
> >>  1 file changed, 16 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >> index 29cb7b812d71..d7e49004e93d 100644
> >> --- a/fs/nfsd/vfs.c
> >> +++ b/fs/nfsd/vfs.c
> >> @@ -955,6 +955,11 @@ nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
> >>  	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
> >>  }
> >>  
> >> +static bool nfsd_dontcache __read_mostly = false;
> >> +module_param(nfsd_dontcache, bool, 0644);
> >> +MODULE_PARM_DESC(nfsd_dontcache,
> >> +		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
> >> +
> >>  /*
> >>   * Grab and keep cached pages associated with a file in the svc_rqst
> >>   * so that they can be passed to the network sendmsg routines
> >> @@ -1084,6 +1089,7 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>  	loff_t ppos = offset;
> >>  	struct page *page;
> >>  	ssize_t host_err;
> >> +	rwf_t flags = 0;
> >>  
> >>  	v = 0;
> >>  	total = *count;
> >> @@ -1097,9 +1103,12 @@ __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >>  	}
> >>  	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
> >>  
> >> +	if (nfsd_dontcache)
> >> +		flags |= RWF_DONTCACHE;
> >> +
> >>  	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
> >>  	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
> >> -	host_err = vfs_iter_read(file, &iter, &ppos, 0);
> >> +	host_err = vfs_iter_read(file, &iter, &ppos, flags);
> >>  	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
> >>  }
> >>  
> >> @@ -1186,6 +1195,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> >>  	if (stable && !fhp->fh_use_wgather)
> >>  		flags |= RWF_SYNC;
> >>  
> >> +	if (nfsd_dontcache)
> >> +		flags |= RWF_DONTCACHE;
> >> +
> >>  	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
> >>  	since = READ_ONCE(file->f_wb_err);
> >>  	if (verf)
> >> @@ -1237,6 +1249,9 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
> >>   */
> >>  bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
> >>  {
> >> +	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
> >> +		return false;
> >> +
> >>  	switch (svc_auth_flavor(rqstp)) {
> >>  	case RPC_AUTH_GSS_KRB5I:
> >>  	case RPC_AUTH_GSS_KRB5P:
> > 
> 
> 
> -- 
> Chuck Lever
Mike Snitzer Feb. 21, 2025, 3:36 p.m. UTC | #7
On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote:
> On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote:
> > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
> > > [ Adding NFSD reviewers ... ]
> > > 
> > > On 2/20/25 12:12 PM, Mike Snitzer wrote:
> > > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> > > > by nfsd will be removed from the page cache upon completion."
> > > > 
> > > > nfsd_dontcache is disabled by default.  It may be enabled with:
> > > >   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
> > > 
> > > A per-export setting like an export option would be nicer. Also, does
> > > it make sense to make it a separate control for READ and one for WRITE?
> > > My trick knee suggests caching read results is still going to add
> > > significant value, but write, not so much.
> > 
> > My intent was to make 6.14's DONTCACHE feature able to be tested in
> > the context of nfsd in a no-frills way.  I realize adding the
> > nfsd_dontcache knob skews toward too raw, lacks polish.  But I'm
> > inclined to expose such course-grained opt-in knobs to encourage
> > others' discovery (and answers to some of the questions you pose
> > below).  I also hope to enlist all NFSD reviewers' help in
> > categorizing/documenting where DONTCACHE helps/hurts. ;)
> > 
> > And I agree that ultimately per-export control is needed.  I'll take
> > the time to implement that, hopeful to have something more suitable in
> > time for LSF.
> > 
> 
> Would it make more sense to hook DONTCACHE up to the IO_ADVISE
> operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar
> meaning? That would give the clients a way to do this on a per-open
> basis.

Just thinking aloud here but: Using a DONTCACHE scalpel on a per open
basis quite likely wouldn't provide the required page reclaim relief
if the server is being hammered with normal buffered IO.  Sure that
particular DONTCACHE IO wouldn't contribute to the problem but it
would still be impacted by those not opting to use DONTCACHE on entry
to the server due to needing pages for its DONTCACHE buffered IO.

> > > However, to add any such administrative control, I'd like to see some
> > > performance numbers. I think we need to enumerate the cases (I/O types)
> > > that are most interesting to examine: small memory NFS servers; lots of
> > > small unaligned I/O; server-side CPU per byte; storage interrupt rates;
> > > any others?
> > > 
> > > And let's see some user/admin documentation (eg when should this setting
> > > be enabled? when would it be contra-indicated?)
> > > 
> > > The same arguments that applied to Cedric's request to make maximum RPC
> > > size a tunable setting apply here. Do we want to carry a manual setting
> > > for this mechanism for a long time, or do we expect that the setting can
> > > become automatic/uninteresting after a period of experimentation?
> > > 
> > > * It might be argued that putting these experimental tunables under /sys
> > >   eliminates the support longevity question, since there aren't strict
> > >   rules about removing files under /sys.
> 
> Isn't /sys covered by the same ABI guarantees? I know debugfs isn't,
> but I'm not sure about /sys.

Only if you add them to the ABI docs as supported (at least that is my
experience relative to various block limits knobs, etc).  But yeah,
invariably that invites a cat and mouse game of users using the knob
and then complaining loudly if/when it goes away.

Mike
Chuck Lever Feb. 21, 2025, 3:39 p.m. UTC | #8
On 2/21/25 10:02 AM, Mike Snitzer wrote:
> On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
>> * It might be argued that putting these experimental tunables under /sys
>>   eliminates the support longevity question, since there aren't strict
>>   rules about removing files under /sys.
> 
> Right, I do think a sysfs knob (that defaults to disabled, requires
> user opt-in) is a pretty useful and benign means to expose
> experimental functionality.

Seems like we want to figure out a blessed way to add this kind of
experimental "hidden" tunable in a way that can be easily removed
once we have the answers we need.

I'd really like to keep the documented administrative interface as
straightforward as possible, but I agree that having a way to
experiment is valuable.
Jeff Layton Feb. 21, 2025, 3:42 p.m. UTC | #9
On Fri, 2025-02-21 at 10:36 -0500, Mike Snitzer wrote:
> On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote:
> > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote:
> > > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
> > > > [ Adding NFSD reviewers ... ]
> > > > 
> > > > On 2/20/25 12:12 PM, Mike Snitzer wrote:
> > > > > Add nfsd 'nfsd_dontcache' modparam so that "Any data read or written
> > > > > by nfsd will be removed from the page cache upon completion."
> > > > > 
> > > > > nfsd_dontcache is disabled by default.  It may be enabled with:
> > > > >   echo Y > /sys/module/nfsd/parameters/nfsd_dontcache
> > > > 
> > > > A per-export setting like an export option would be nicer. Also, does
> > > > it make sense to make it a separate control for READ and one for WRITE?
> > > > My trick knee suggests caching read results is still going to add
> > > > significant value, but write, not so much.
> > > 
> > > My intent was to make 6.14's DONTCACHE feature able to be tested in
> > > the context of nfsd in a no-frills way.  I realize adding the
> > > nfsd_dontcache knob skews toward too raw, lacks polish.  But I'm
> > > inclined to expose such course-grained opt-in knobs to encourage
> > > others' discovery (and answers to some of the questions you pose
> > > below).  I also hope to enlist all NFSD reviewers' help in
> > > categorizing/documenting where DONTCACHE helps/hurts. ;)
> > > 
> > > And I agree that ultimately per-export control is needed.  I'll take
> > > the time to implement that, hopeful to have something more suitable in
> > > time for LSF.
> > > 
> > 
> > Would it make more sense to hook DONTCACHE up to the IO_ADVISE
> > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar
> > meaning? That would give the clients a way to do this on a per-open
> > basis.
> 
> Just thinking aloud here but: Using a DONTCACHE scalpel on a per open
> basis quite likely wouldn't provide the required page reclaim relief
> if the server is being hammered with normal buffered IO.  Sure that
> particular DONTCACHE IO wouldn't contribute to the problem but it
> would still be impacted by those not opting to use DONTCACHE on entry
> to the server due to needing pages for its DONTCACHE buffered IO.
> 

Actually, now that I read the spec, it looks like you could just embed
an IO_ADVISE operation in the read compound:

    PUTFH + IO_ADVISE(for the range that you're reading) + READ() operation

That said, that does nothing for v3 reads, which I imagine you're
interested in hooking up here too.

> > > > However, to add any such administrative control, I'd like to see some
> > > > performance numbers. I think we need to enumerate the cases (I/O types)
> > > > that are most interesting to examine: small memory NFS servers; lots of
> > > > small unaligned I/O; server-side CPU per byte; storage interrupt rates;
> > > > any others?
> > > > 
> > > > And let's see some user/admin documentation (eg when should this setting
> > > > be enabled? when would it be contra-indicated?)
> > > > 
> > > > The same arguments that applied to Cedric's request to make maximum RPC
> > > > size a tunable setting apply here. Do we want to carry a manual setting
> > > > for this mechanism for a long time, or do we expect that the setting can
> > > > become automatic/uninteresting after a period of experimentation?
> > > > 
> > > > * It might be argued that putting these experimental tunables under /sys
> > > >   eliminates the support longevity question, since there aren't strict
> > > >   rules about removing files under /sys.
> > 
> > Isn't /sys covered by the same ABI guarantees? I know debugfs isn't,
> > but I'm not sure about /sys.
> 
> Only if you add them to the ABI docs as supported (at least that is my
> experience relative to various block limits knobs, etc).  But yeah,
> invariably that invites a cat and mouse game of users using the knob
> and then complaining loudly if/when it goes away.
> 
> Mike
Jeff Layton Feb. 21, 2025, 3:46 p.m. UTC | #10
On Fri, 2025-02-21 at 10:39 -0500, Chuck Lever wrote:
> On 2/21/25 10:02 AM, Mike Snitzer wrote:
> > On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
> > > * It might be argued that putting these experimental tunables under /sys
> > >   eliminates the support longevity question, since there aren't strict
> > >   rules about removing files under /sys.
> > 
> > Right, I do think a sysfs knob (that defaults to disabled, requires
> > user opt-in) is a pretty useful and benign means to expose
> > experimental functionality.
> 
> Seems like we want to figure out a blessed way to add this kind of
> experimental "hidden" tunable in a way that can be easily removed
> once we have the answers we need.
> 
> I'd really like to keep the documented administrative interface as
> straightforward as possible, but I agree that having a way to
> experiment is valuable.
> 

We do have this fancy new netlink interface that is extensible.

You could extend the "threads" call to add a server-wide boolean for
this and then extend nfsdctl to set that value in some cases.
Chuck Lever Feb. 21, 2025, 3:46 p.m. UTC | #11
On 2/21/25 10:36 AM, Mike Snitzer wrote:
> On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote:
>> On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote:
>>> My intent was to make 6.14's DONTCACHE feature able to be tested in
>>> the context of nfsd in a no-frills way.  I realize adding the
>>> nfsd_dontcache knob skews toward too raw, lacks polish.  But I'm
>>> inclined to expose such course-grained opt-in knobs to encourage
>>> others' discovery (and answers to some of the questions you pose
>>> below).  I also hope to enlist all NFSD reviewers' help in
>>> categorizing/documenting where DONTCACHE helps/hurts. ;)
>>>
>>> And I agree that ultimately per-export control is needed.  I'll take
>>> the time to implement that, hopeful to have something more suitable in
>>> time for LSF.
>>
>> Would it make more sense to hook DONTCACHE up to the IO_ADVISE
>> operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has similar
>> meaning? That would give the clients a way to do this on a per-open
>> basis.
> 
> Just thinking aloud here but: Using a DONTCACHE scalpel on a per open
> basis quite likely wouldn't provide the required page reclaim relief
> if the server is being hammered with normal buffered IO.  Sure that
> particular DONTCACHE IO wouldn't contribute to the problem but it
> would still be impacted by those not opting to use DONTCACHE on entry
> to the server due to needing pages for its DONTCACHE buffered IO.

For this initial work, which is to provide a mechanism for
experimentation, IMO exposing the setting to clients won't be all
that helpful.

But there are some applications/workloads on clients where exposure
could be beneficial -- for instance, a backup job, where NFSD would
benefit by knowing it doesn't have to maintain the job's written data in
its page cache. I regard that as a later evolutionary improvement,
though.

Jorge proposed adding the NFSv4.2 IO_ADVISE operation to NFSD, but I
think we first need to a) work out and document appropriate semantics
for each hint, because the spec does not provide specifics, and b)
perform some extensive benchmarking to understand their value and
impact.
Chuck Lever Feb. 21, 2025, 3:50 p.m. UTC | #12
On 2/21/25 10:46 AM, Jeff Layton wrote:
> On Fri, 2025-02-21 at 10:39 -0500, Chuck Lever wrote:
>> On 2/21/25 10:02 AM, Mike Snitzer wrote:
>>> On Thu, Feb 20, 2025 at 01:17:42PM -0500, Chuck Lever wrote:
>>>> * It might be argued that putting these experimental tunables under /sys
>>>>   eliminates the support longevity question, since there aren't strict
>>>>   rules about removing files under /sys.
>>>
>>> Right, I do think a sysfs knob (that defaults to disabled, requires
>>> user opt-in) is a pretty useful and benign means to expose
>>> experimental functionality.
>>
>> Seems like we want to figure out a blessed way to add this kind of
>> experimental "hidden" tunable in a way that can be easily removed
>> once we have the answers we need.
>>
>> I'd really like to keep the documented administrative interface as
>> straightforward as possible, but I agree that having a way to
>> experiment is valuable.
> 
> We do have this fancy new netlink interface that is extensible.
> 
> You could extend the "threads" call to add a server-wide boolean for
> this and then extend nfsdctl to set that value in some cases.

If we don't have other options that are more familiar outside of
the NFSD world (someone mentioned debugfs), then adding a netlink
operation that is explicitly documented as "access to experimental
temporary features that no-one can rely on existing" seems like a good
alternative. We can make our own rules there.
Trond Myklebust Feb. 21, 2025, 4:13 p.m. UTC | #13
On Fri, 2025-02-21 at 10:46 -0500, Chuck Lever wrote:
> On 2/21/25 10:36 AM, Mike Snitzer wrote:
> > On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote:
> > > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote:
> > > > My intent was to make 6.14's DONTCACHE feature able to be
> > > > tested in
> > > > the context of nfsd in a no-frills way.  I realize adding the
> > > > nfsd_dontcache knob skews toward too raw, lacks polish.  But
> > > > I'm
> > > > inclined to expose such course-grained opt-in knobs to
> > > > encourage
> > > > others' discovery (and answers to some of the questions you
> > > > pose
> > > > below).  I also hope to enlist all NFSD reviewers' help in
> > > > categorizing/documenting where DONTCACHE helps/hurts. ;)
> > > > 
> > > > And I agree that ultimately per-export control is needed.  I'll
> > > > take
> > > > the time to implement that, hopeful to have something more
> > > > suitable in
> > > > time for LSF.
> > > 
> > > Would it make more sense to hook DONTCACHE up to the IO_ADVISE
> > > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has
> > > similar
> > > meaning? That would give the clients a way to do this on a per-
> > > open
> > > basis.
> > 
> > Just thinking aloud here but: Using a DONTCACHE scalpel on a per
> > open
> > basis quite likely wouldn't provide the required page reclaim
> > relief
> > if the server is being hammered with normal buffered IO.  Sure that
> > particular DONTCACHE IO wouldn't contribute to the problem but it
> > would still be impacted by those not opting to use DONTCACHE on
> > entry
> > to the server due to needing pages for its DONTCACHE buffered IO.
> 
> For this initial work, which is to provide a mechanism for
> experimentation, IMO exposing the setting to clients won't be all
> that helpful.
> 
> But there are some applications/workloads on clients where exposure
> could be beneficial -- for instance, a backup job, where NFSD would
> benefit by knowing it doesn't have to maintain the job's written data
> in
> its page cache. I regard that as a later evolutionary improvement,
> though.
> 
> Jorge proposed adding the NFSv4.2 IO_ADVISE operation to NFSD, but I
> think we first need to a) work out and document appropriate semantics
> for each hint, because the spec does not provide specifics, and b)
> perform some extensive benchmarking to understand their value and
> impact.
> 
> 

That puts the onus on the application running on the client to decide
the caching semantics of the server which:
   A. Is a terrible idea™. The application may know how it wants to use
      the cached data, and be able to somewhat confidently manage its
      own pagecache. However in almost all cases, it will have no basis
      for understanding how the server should manage its cache. The
      latter really is a job for the sysadmin to figure out.
   B. Is impractical, because even if you can figure out a policy, it
      requires rewriting the application to manage the server cache.
   C. Will require additional APIs on the NFSv4.2 client to expose the
      IO_ADVISE operation. You cannot just map it to posix_fadvise()
      and/or posix_madvise(), because IO_ADVISE is designed to manage a
      completely different caching layer. At best, we might be able to
      rally one or two more distributed filesystems to implement
      similar functionality and share an API, however there is no
      chance this API will be useful for ordinary filesystems.
Jeff Layton Feb. 21, 2025, 6:42 p.m. UTC | #14
On Fri, 2025-02-21 at 16:13 +0000, Trond Myklebust wrote:
> On Fri, 2025-02-21 at 10:46 -0500, Chuck Lever wrote:
> > On 2/21/25 10:36 AM, Mike Snitzer wrote:
> > > On Fri, Feb 21, 2025 at 10:25:03AM -0500, Jeff Layton wrote:
> > > > On Fri, 2025-02-21 at 10:02 -0500, Mike Snitzer wrote:
> > > > > My intent was to make 6.14's DONTCACHE feature able to be
> > > > > tested in
> > > > > the context of nfsd in a no-frills way.  I realize adding the
> > > > > nfsd_dontcache knob skews toward too raw, lacks polish.  But
> > > > > I'm
> > > > > inclined to expose such course-grained opt-in knobs to
> > > > > encourage
> > > > > others' discovery (and answers to some of the questions you
> > > > > pose
> > > > > below).  I also hope to enlist all NFSD reviewers' help in
> > > > > categorizing/documenting where DONTCACHE helps/hurts. ;)
> > > > > 
> > > > > And I agree that ultimately per-export control is needed.  I'll
> > > > > take
> > > > > the time to implement that, hopeful to have something more
> > > > > suitable in
> > > > > time for LSF.
> > > > 
> > > > Would it make more sense to hook DONTCACHE up to the IO_ADVISE
> > > > operation in RFC7862? IO_ADVISE4_NOREUSE sounds like it has
> > > > similar
> > > > meaning? That would give the clients a way to do this on a per-
> > > > open
> > > > basis.
> > > 
> > > Just thinking aloud here but: Using a DONTCACHE scalpel on a per
> > > open
> > > basis quite likely wouldn't provide the required page reclaim
> > > relief
> > > if the server is being hammered with normal buffered IO.  Sure that
> > > particular DONTCACHE IO wouldn't contribute to the problem but it
> > > would still be impacted by those not opting to use DONTCACHE on
> > > entry
> > > to the server due to needing pages for its DONTCACHE buffered IO.
> > 
> > For this initial work, which is to provide a mechanism for
> > experimentation, IMO exposing the setting to clients won't be all
> > that helpful.
> > 
> > But there are some applications/workloads on clients where exposure
> > could be beneficial -- for instance, a backup job, where NFSD would
> > benefit by knowing it doesn't have to maintain the job's written data
> > in
> > its page cache. I regard that as a later evolutionary improvement,
> > though.
> > 
> > Jorge proposed adding the NFSv4.2 IO_ADVISE operation to NFSD, but I
> > think we first need to a) work out and document appropriate semantics
> > for each hint, because the spec does not provide specifics, and b)
> > perform some extensive benchmarking to understand their value and
> > impact.
> > 
> > 
> 
> That puts the onus on the application running on the client to decide
> the caching semantics of the server which:
>    A. Is a terrible idea™. The application may know how it wants to use
>       the cached data, and be able to somewhat confidently manage its
>       own pagecache. However in almost all cases, it will have no basis
>       for understanding how the server should manage its cache. The
>       latter really is a job for the sysadmin to figure out.
>    B. Is impractical, because even if you can figure out a policy, it
>       requires rewriting the application to manage the server cache.
>    C. Will require additional APIs on the NFSv4.2 client to expose the
>       IO_ADVISE operation. You cannot just map it to posix_fadvise()
>       and/or posix_madvise(), because IO_ADVISE is designed to manage a
>       completely different caching layer. At best, we might be able to
>       rally one or two more distributed filesystems to implement
>       similar functionality and share an API, however there is no
>       chance this API will be useful for ordinary filesystems.
> 

You could map this to RWF_DONTCACHE itself. I know that's really
intended as a hint to the local kernel, but it seems reasonable that if
the application is giving the kernel a DONTCACHE hint, we could pass
that along to the server as well. The server is under no obligation to
do anything with it, just like the kernel with RWF_DONTCACHE.

We could put an IO_ADVISE in a READ or READ_PLUS compound like so:

    PUTFH + IO_ADVISE(IO_ADVISE_NOREUSE for ranges being read) + READ_PLUS or READ ...

On the server, we could track those ranges in the compound and enable
RWF_DONTCACHE for any subsequent reads or writes.

All that said, I don't object to some sort of mechanism to turn this on
more globally, particularly since that would allow us to use this with
v3 I/O as well.
Trond Myklebust Feb. 21, 2025, 7:18 p.m. UTC | #15
On Fri, 2025-02-21 at 13:42 -0500, Jeff Layton wrote:
> 
> All that said, I don't object to some sort of mechanism to turn this
> on
> more globally, particularly since that would allow us to use this
> with
> v3 I/O as well.

Personally, I think an export option would give the most flexibility.

That would allow the sysadmin to, for instance, put a set of libraries
+ executables that are shared by all clients on one volume which is
exported in the normal fashion. Then putting files that are typically
just accessed by a single client, or that are too huge to fit in cache
memory on a separate volume that could be exported with the new option
set.
diff mbox series

Patch

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 29cb7b812d71..d7e49004e93d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -955,6 +955,11 @@  nfsd_open_verified(struct svc_fh *fhp, int may_flags, struct file **filp)
 	return __nfsd_open(fhp, S_IFREG, may_flags, filp);
 }
 
+static bool nfsd_dontcache __read_mostly = false;
+module_param(nfsd_dontcache, bool, 0644);
+MODULE_PARM_DESC(nfsd_dontcache,
+		 "Any data read or written by nfsd will be removed from the page cache upon completion.");
+
 /*
  * Grab and keep cached pages associated with a file in the svc_rqst
  * so that they can be passed to the network sendmsg routines
@@ -1084,6 +1089,7 @@  __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	loff_t ppos = offset;
 	struct page *page;
 	ssize_t host_err;
+	rwf_t flags = 0;
 
 	v = 0;
 	total = *count;
@@ -1097,9 +1103,12 @@  __be32 nfsd_iter_read(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	}
 	WARN_ON_ONCE(v > ARRAY_SIZE(rqstp->rq_vec));
 
+	if (nfsd_dontcache)
+		flags |= RWF_DONTCACHE;
+
 	trace_nfsd_read_vector(rqstp, fhp, offset, *count);
 	iov_iter_kvec(&iter, ITER_DEST, rqstp->rq_vec, v, *count);
-	host_err = vfs_iter_read(file, &iter, &ppos, 0);
+	host_err = vfs_iter_read(file, &iter, &ppos, flags);
 	return nfsd_finish_read(rqstp, fhp, file, offset, count, eof, host_err);
 }
 
@@ -1186,6 +1195,9 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
 	if (stable && !fhp->fh_use_wgather)
 		flags |= RWF_SYNC;
 
+	if (nfsd_dontcache)
+		flags |= RWF_DONTCACHE;
+
 	iov_iter_kvec(&iter, ITER_SOURCE, vec, vlen, *cnt);
 	since = READ_ONCE(file->f_wb_err);
 	if (verf)
@@ -1237,6 +1249,9 @@  nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
  */
 bool nfsd_read_splice_ok(struct svc_rqst *rqstp)
 {
+	if (nfsd_dontcache) /* force the use of vfs_iter_read for reads */
+		return false;
+
 	switch (svc_auth_flavor(rqstp)) {
 	case RPC_AUTH_GSS_KRB5I:
 	case RPC_AUTH_GSS_KRB5P: