diff mbox series

nfsd/filecache: add nfsd_file_acquire_gc_cached

Message ID 20241024185526.76146-1-snitzer@kernel.org (mailing list archive)
State New
Headers show
Series nfsd/filecache: add nfsd_file_acquire_gc_cached | expand

Commit Message

Mike Snitzer Oct. 24, 2024, 6:55 p.m. UTC
Rather than make nfsd_file_do_acquire() more complex (by training
it to conditionally skip both fh_verify() and nfsd_file allocation
and contruction) introduce nfsd_file_acquire_gc_cached() -- which
duplicates the minimalist subset of nfsd_file_do_acquire() needed to
achieve nfsd_file lookup using an opaque @inode_key.

nfsd_file_acquire_gc_cached() only returns a cached and GC'd nfsd_file
obtained using the opaque @inode_key, established from a previous call
to nfsd_file_do_acquire_local() that originally added the GC'd
nfsd_file to the filecache.

Update nfsd_open_local_fh to store @inode_key in @nfs_fh so later
calls can check if it maps to an open GC'd nfsd_file in the filecache
using nfsd_file_acquire_gc_cached().  Its nfsd_file_lookup_locked()
call will only find a match if @cred matches the nfsd_file's nf_cred.

And care is taken to clear the inode_key in nfsd_file_free() if the
nfsd_file has a non-NULL nf_inodep (which is a pointer to the address
of the opaque inode_key stored in the nfs_fh).  This avoids any risk
of re-using the old inode_key for a different nfsd_file.

This commit's cached nfsd_file lookup dramatically speeds up LOCALIO
performance, especially for small 4K O_DIRECT IO, e.g.:

before: read: IOPS=376k,  BW=1469MiB/s (1541MB/s)(28.7GiB/20001msec)
after:  read: IOPS=1037k, BW=4050MiB/s (4247MB/s)(79.1GiB/20002msec)

Note that LOCALIO calls nfsd_open_local_fh() for every IO it issues to
the underlying filesystem using the returned nfsd_file.  This is why
caching the opaque 'inode_key' in 'struct nfs_fh' is so helpful for
LOCALIO to quickly return the cached nfsd_file.  Whereas regular NFS
avoids fh_verify()'s costly duplicate lookups of the underlying
filesystem's dentry by caching it in 'fh_dentry' of 'struct svc_fh'.
LOCALIO cannot take the same approach, of storing the dentry, without
creating object lifetime issues associated with dentry references
holding server mounts open and preventing unmounts.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
---
 fs/nfs/inode.c             |  3 ++
 fs/nfs_common/nfslocalio.c |  2 +-
 fs/nfsd/filecache.c        | 78 ++++++++++++++++++++++++++++++++++++++
 fs/nfsd/filecache.h        |  7 ++++
 fs/nfsd/localio.c          | 46 +++++++++++++++++++---
 include/linux/nfs.h        |  4 ++
 include/linux/nfslocalio.h |  6 +--
 7 files changed, 136 insertions(+), 10 deletions(-)

Comments

NeilBrown Oct. 25, 2024, 3 a.m. UTC | #1
On Fri, 25 Oct 2024, Mike Snitzer wrote:
> Rather than make nfsd_file_do_acquire() more complex (by training
> it to conditionally skip both fh_verify() and nfsd_file allocation
> and contruction) introduce nfsd_file_acquire_gc_cached() -- which
> duplicates the minimalist subset of nfsd_file_do_acquire() needed to
> achieve nfsd_file lookup using an opaque @inode_key.
> 
> nfsd_file_acquire_gc_cached() only returns a cached and GC'd nfsd_file
> obtained using the opaque @inode_key, established from a previous call
> to nfsd_file_do_acquire_local() that originally added the GC'd
> nfsd_file to the filecache.
> 
> Update nfsd_open_local_fh to store @inode_key in @nfs_fh so later
> calls can check if it maps to an open GC'd nfsd_file in the filecache
> using nfsd_file_acquire_gc_cached().  Its nfsd_file_lookup_locked()
> call will only find a match if @cred matches the nfsd_file's nf_cred.
> 
> And care is taken to clear the inode_key in nfsd_file_free() if the
> nfsd_file has a non-NULL nf_inodep (which is a pointer to the address
> of the opaque inode_key stored in the nfs_fh).  This avoids any risk
> of re-using the old inode_key for a different nfsd_file.
> 
> This commit's cached nfsd_file lookup dramatically speeds up LOCALIO
> performance, especially for small 4K O_DIRECT IO, e.g.:
> 
> before: read: IOPS=376k,  BW=1469MiB/s (1541MB/s)(28.7GiB/20001msec)
> after:  read: IOPS=1037k, BW=4050MiB/s (4247MB/s)(79.1GiB/20002msec)
> 
> Note that LOCALIO calls nfsd_open_local_fh() for every IO it issues to
> the underlying filesystem using the returned nfsd_file.  This is why
> caching the opaque 'inode_key' in 'struct nfs_fh' is so helpful for
> LOCALIO to quickly return the cached nfsd_file.  Whereas regular NFS
> avoids fh_verify()'s costly duplicate lookups of the underlying
> filesystem's dentry by caching it in 'fh_dentry' of 'struct svc_fh'.
> LOCALIO cannot take the same approach, of storing the dentry, without
> creating object lifetime issues associated with dentry references
> holding server mounts open and preventing unmounts.
> 
> Signed-off-by: Mike Snitzer <snitzer@kernel.org>

I think this is a good idea.  If we are to avoid a complete lookup for
every IO we need some back-pointer from the nfsd filecache to something
in nfs so that a cached lookup can be invalidated.  Various other
schemes have been suggested before.  This one seems particularly simple.

I'm wondering about the request for a garbage-collected nfsd_file
though.  For NFSv3 that makes sense.  For NFSv4 we would expect the file
to already be open as a non-garbage-collected nfsd_file and opening it
again seems wasteful.  That doesn't need to be fixed for this patch and
maybe doesn't need to be fixed at all, but it seemed worth highlighting.

More below

> ---
>  fs/nfs/inode.c             |  3 ++
>  fs/nfs_common/nfslocalio.c |  2 +-
>  fs/nfsd/filecache.c        | 78 ++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/filecache.h        |  7 ++++
>  fs/nfsd/localio.c          | 46 +++++++++++++++++++---
>  include/linux/nfs.h        |  4 ++
>  include/linux/nfslocalio.h |  6 +--
>  7 files changed, 136 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index cc7a32b32676..3051d65e3a89 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -2413,6 +2413,9 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
>  #endif /* CONFIG_NFS_V4 */
>  #ifdef CONFIG_NFS_V4_2
>  	nfsi->xattr_cache = NULL;
> +#endif
> +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> +	nfsi->fh.inode_key = NULL;
>  #endif
>  	nfs_netfs_inode_init(nfsi);
>  
> diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
> index 09404d142d1a..bacebaa1e15c 100644
> --- a/fs/nfs_common/nfslocalio.c
> +++ b/fs/nfs_common/nfslocalio.c
> @@ -130,7 +130,7 @@ EXPORT_SYMBOL_GPL(nfs_uuid_invalidate_one_client);
>  
>  struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *uuid,
>  		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
> -		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
> +		   struct nfs_fh *nfs_fh, const fmode_t fmode)
>  {
>  	struct net *net;
>  	struct nfsd_file *localio;
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 1408166222c5..5ab978ac3555 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -221,6 +221,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
>  	INIT_LIST_HEAD(&nf->nf_gc);
>  	nf->nf_birthtime = ktime_get();
>  	nf->nf_file = NULL;
> +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> +	nf->nf_inodep = NULL;
> +#endif

All these "#if IS_ENABLED" are ugly.  I wonder if we could get rid of
them.
Using __GFP_ZERO for the alloc would work here, but might be an unwanted
cost.  Maybe nf_inodep could be ignored if nf_file is NULL.

>  	nf->nf_cred = get_current_cred();
>  	nf->nf_net = net;
>  	nf->nf_flags = want_gc ?
> @@ -285,6 +288,12 @@ nfsd_file_free(struct nfsd_file *nf)
>  		nfsd_file_check_write_error(nf);
>  		nfsd_filp_close(nf->nf_file);
>  	}
> +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> +	if (nf->nf_inodep) {
> +		*(nf->nf_inodep) = NULL;
> +		nf->nf_inodep = NULL;
> +	}
> +#endif

This one is harder to hide.  We don't really need the final assignment
though so maybe we could

#define NF_INODEP(nf) (nf->nf_inodep)
or
#define NF_INODEP(nf) (NULL)

in a header (where #if are more acceptable), then make this code:

 if (NF_INODEP(nf))
	*NF_INODEP(nf) = NULL;

Is that better or worse I wonder.

>  
>  	/*
>  	 * If this item is still linked via nf_lru, that's a bug.
> @@ -1255,6 +1264,75 @@ nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
>  	return beres;
>  }
>  
> +/**
> + * nfsd_file_acquire_cached - Get cached GC'd open file using inode
> + * @net: The network namespace in which to perform a lookup
> + * @cred: the user credential with which to validate access
> + * @inode_key: inode to use as opaque lookup key
> + * @may_flags: NFSD_MAY_ settings for the file
> + * @pnf: OUT: found cached GC'd "struct nfsd_file" object
> + *
> + * Rather than make nfsd_file_do_acquire more complex (by training
> + * it to conditionally skip fh_verify(), nfsd_file allocation and
> + * contruction) duplicate the minimalist subset of it that is
> + * needed to achieve nfsd_file lookup using the opaque @inode_key.
> + *
> + * The nfsd_file object returned by this API is reference-counted
> + * and garbage-collected. The object is retained for a few
> + * seconds after the final nfsd_file_put() in case the caller
> + * wants to re-use it.
> + *
> + * Return values:
> + *   %nfs_ok - @pnf points to an nfsd_file with its reference
> + *   count boosted.
> + *
> + * On error, an nfsstat value in network byte order is returned.
> + */
> +__be32
> +nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
> +			 void *inode_key, unsigned int may_flags,
> +			 struct nfsd_file **pnf)
> +{
> +	unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
> +	struct nfsd_file *nf;
> +	__be32 status;
> +
> +	rcu_read_lock();
> +	nf = nfsd_file_lookup_locked(net, cred, inode_key, need, true);
> +	rcu_read_unlock();
> +
> +	if (unlikely(!nf))
> +		return nfserr_noent;
> +
> +	/*
> +	 * If the nf is on the LRU then it holds an extra reference
> +	 * that must be put if it's removed. It had better not be
> +	 * the last one however, since we should hold another.
> +	 */
> +	if (nfsd_file_lru_remove(nf))
> +		WARN_ON_ONCE(refcount_dec_and_test(&nf->nf_ref));

Just use refcount_dec(&nf->nf_ref).  It will provide a warning of the
count reaches zero.  In general you should never need warnings when
using refcount as it generates the needed warnings itself.

> +
> +	if (WARN_ON_ONCE(test_bit(NFSD_FILE_PENDING, &nf->nf_flags) ||
> +			 !test_bit(NFSD_FILE_HASHED, &nf->nf_flags))) {
> +		status = nfserr_inval;
> +		goto error;
> +	}

Do we really want the above?  I guess you were following the pattern in
nfsd_file_do_acquire() which waits for FILE_PENDING and then re-tests
FILE_HASHED (nfsd_file_lookup_locked() already tested it).
I guess it doesn't hurt, but I'm not sure it helps.

> +	this_cpu_inc(nfsd_file_cache_hits);
> +
> +	status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
> +	if (status != nfs_ok) {
> +error:
> +		nfsd_file_put(nf);
> +		nf = NULL;
> +	} else {
> +		this_cpu_inc(nfsd_file_acquisitions);
> +		nfsd_file_check_write_error(nf);
> +		*pnf = nf;
> +	}
> +	trace_nfsd_file_acquire(NULL, inode_key, may_flags, nf, status);
> +	return status;
> +}
> +
>  /**
>   * nfsd_file_acquire_opened - Get a struct nfsd_file using existing open file
>   * @rqstp: the RPC transaction being executed
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index cadf3c2689c4..e000f6988dc8 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -47,6 +47,10 @@ struct nfsd_file {
>  	struct list_head	nf_gc;
>  	struct rcu_head		nf_rcu;
>  	ktime_t			nf_birthtime;
> +
> +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> +	void **			nf_inodep;
> +#endif
>  };
>  
>  int nfsd_file_cache_init(void);
> @@ -71,5 +75,8 @@ __be32 nfsd_file_acquire_opened(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  __be32 nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
>  			       struct auth_domain *client, struct svc_fh *fhp,
>  			       unsigned int may_flags, struct nfsd_file **pnf);
> +__be32 nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
> +			       void *inode_key, unsigned int may_flags,
> +			       struct nfsd_file **pnf);
>  int nfsd_file_cache_stats_show(struct seq_file *m, void *v);
>  #endif /* _FS_NFSD_FILECACHE_H */
> diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
> index f441cb9f74d5..34a229409117 100644
> --- a/fs/nfsd/localio.c
> +++ b/fs/nfsd/localio.c
> @@ -58,33 +58,67 @@ void nfsd_localio_ops_init(void)
>  struct nfsd_file *
>  nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
>  		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
> -		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
> +		   struct nfs_fh *nfs_fh, const fmode_t fmode)
>  {
>  	int mayflags = NFSD_MAY_LOCALIO;
>  	struct svc_cred rq_cred;
>  	struct svc_fh fh;
>  	struct nfsd_file *localio;
> +	void *nf_inode_key;
>  	__be32 beres;
>  
>  	if (nfs_fh->size > NFS4_FHSIZE)
>  		return ERR_PTR(-EINVAL);
>  
> -	/* nfs_fh -> svc_fh */
> -	fh_init(&fh, NFS4_FHSIZE);
> -	fh.fh_handle.fh_size = nfs_fh->size;
> -	memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
> -
>  	if (fmode & FMODE_READ)
>  		mayflags |= NFSD_MAY_READ;
>  	if (fmode & FMODE_WRITE)
>  		mayflags |= NFSD_MAY_WRITE;
>  
> +	/*
> +	 * Check if 'inode_key' stored in @nfs_fh maps to an
> +	 * open nfsd_file in the filecache (from a previous
> +	 * nfsd_open_local_fh).
> +	 *
> +	 * The 'inode_key' may have become stale (due to nfsd_file
> +	 * being free'd by filecache GC) so the lookup will fail
> +	 * gracefully by falling back to using nfsd_file_acquire_local().
> +	 */
> +	nf_inode_key = READ_ONCE(nfs_fh->inode_key);

I think you want the above in an rcu-locked region with
nfsd_file_acquire_cached().  Else the inode could be freed and
reallocated after the READ_ONCE and before the lookup.
Maybe pass the address if inode_key and have nfsd_file_acquire_cached()
deref after getting rcu_read_lock().

> +	if (nf_inode_key) {
> +		beres = nfsd_file_acquire_cached(net, cred,
> +						 nf_inode_key,
> +						 mayflags, &localio);
> +		if (beres == nfs_ok)
> +			return localio;
> +		/*
> +		 * reset stale nfs_fh->inode_key and fallthru
> +		 * to use normal nfsd_file_acquire_local().
> +		 */
> +		nfs_fh->inode_key = NULL;
> +	}
> +
> +	/* nfs_fh -> svc_fh */
> +	fh_init(&fh, NFS4_FHSIZE);
> +	fh.fh_handle.fh_size = nfs_fh->size;
> +	memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
> +
>  	svcauth_map_clnt_to_svc_cred_local(rpc_clnt, cred, &rq_cred);
>  
>  	beres = nfsd_file_acquire_local(net, &rq_cred, dom,
>  					&fh, mayflags, &localio);
>  	if (beres)
>  		localio = ERR_PTR(nfs_stat_to_errno(be32_to_cpu(beres)));
> +	else {
> +		/*
> +		 * opaque 'inode_key' has a 1:1 mapping to both an
> +		 * nfsd_file and nfs_fh struct (And the nfs_fh is shared
> +		 * by all NFS client threads. So there is no risk of
> +		 * storing competing addresses in nfsd_file->nf_inodep
> +		 */
> +		localio->nf_inodep = (void **) &nfs_fh->inode_key;
> +		nfs_fh->inode_key = localio->nf_inode;
> +	}
>  
>  	fh_put(&fh);
>  	if (rq_cred.cr_group_info)
> diff --git a/include/linux/nfs.h b/include/linux/nfs.h
> index 9ad727ddfedb..56c894575c70 100644
> --- a/include/linux/nfs.h
> +++ b/include/linux/nfs.h
> @@ -29,6 +29,10 @@
>  struct nfs_fh {
>  	unsigned short		size;
>  	unsigned char		data[NFS_MAXFHSIZE];
> +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> +	/* 'inode_key' is an opaque key used for nfsd_file hash lookups */
> +	void *			inode_key;
> +#endif

I wonder if this is really the right place to store the inode_key.
'struct nfs_fh' appears in lots of places and they only place where you
wan the inode_key is inside the nfs_inode.  Maybe store an augmented
nfs_fh in there...


Thanks,
NeilBrown

>  };
>  
>  /*
> diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
> index 3982fea79919..619eb1961ed6 100644
> --- a/include/linux/nfslocalio.h
> +++ b/include/linux/nfslocalio.h
> @@ -43,7 +43,7 @@ void nfs_uuid_invalidate_one_client(nfs_uuid_t *nfs_uuid);
>  /* localio needs to map filehandle -> struct nfsd_file */
>  extern struct nfsd_file *
>  nfsd_open_local_fh(struct net *, struct auth_domain *, struct rpc_clnt *,
> -		   const struct cred *, const struct nfs_fh *,
> +		   const struct cred *, struct nfs_fh *,
>  		   const fmode_t) __must_hold(rcu);
>  
>  struct nfsd_localio_operations {
> @@ -53,7 +53,7 @@ struct nfsd_localio_operations {
>  						struct auth_domain *,
>  						struct rpc_clnt *,
>  						const struct cred *,
> -						const struct nfs_fh *,
> +						struct nfs_fh *,
>  						const fmode_t);
>  	void (*nfsd_file_put_local)(struct nfsd_file *);
>  	struct file *(*nfsd_file_file)(struct nfsd_file *);
> @@ -64,7 +64,7 @@ extern const struct nfsd_localio_operations *nfs_to;
>  
>  struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *,
>  		   struct rpc_clnt *, const struct cred *,
> -		   const struct nfs_fh *, const fmode_t);
> +		   struct nfs_fh *, const fmode_t);
>  
>  static inline void nfs_to_nfsd_file_put_local(struct nfsd_file *localio)
>  {
> -- 
> 2.44.0
> 
>
Chuck Lever III Oct. 25, 2024, 12:52 p.m. UTC | #2
> On Oct 24, 2024, at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
> 
> I'm wondering about the request for a garbage-collected nfsd_file
> though.  For NFSv3 that makes sense.  For NFSv4 we would expect the file
> to already be open as a non-garbage-collected nfsd_file and opening it
> again seems wasteful.  That doesn't need to be fixed for this patch and
> maybe doesn't need to be fixed at all, but it seemed worth highlighting.

I don't think using a GC'd nfsd_file for LOCALIO is a bug.

NFSv4 guarantees an OPEN to pin the nfsd_file, and a CLOSE
(or lease expiry) to release it. There's no desire for or
need for garbage collection.

NFSv3 and LOCALIO use each nfsd_file for the life of one I/O
operation, and that nfsd_file is cached in the expectation
that another I/O operation on the same file will happen with
frequent temporarl proximity. Garbage collection is needed
for both cases.


--
Chuck Lever
Trond Myklebust Oct. 25, 2024, 1:31 p.m. UTC | #3
On Fri, 2024-10-25 at 12:52 +0000, Chuck Lever III wrote:
> 
> 
> > On Oct 24, 2024, at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > I'm wondering about the request for a garbage-collected nfsd_file
> > though.  For NFSv3 that makes sense.  For NFSv4 we would expect the
> > file
> > to already be open as a non-garbage-collected nfsd_file and opening
> > it
> > again seems wasteful.  That doesn't need to be fixed for this patch
> > and
> > maybe doesn't need to be fixed at all, but it seemed worth
> > highlighting.
> 
> I don't think using a GC'd nfsd_file for LOCALIO is a bug.
> 
> NFSv4 guarantees an OPEN to pin the nfsd_file, and a CLOSE
> (or lease expiry) to release it. There's no desire for or
> need for garbage collection.
> 
> NFSv3 and LOCALIO use each nfsd_file for the life of one I/O
> operation, and that nfsd_file is cached in the expectation
> that another I/O operation on the same file will happen with
> frequent temporarl proximity. Garbage collection is needed
> for both cases.
> 

No. There is a huge difference between the two. In the case of NFSv3,
the server has no idea whether or not there is further need for the
file (stateless), whereas in the localio case, the client is able to
tell exactly when the application holds the file open or not
(stateful).

The only reason for jumping through all these hoops in the case of
localio is the 'user may want to unmount' exception.

Is there any reason why we couldn't add a notification for that, to
give knfsd the opportunity to clear out any open file state? The
current approach appears to be flat out optimising for the exception.
Chuck Lever III Oct. 25, 2024, 1:51 p.m. UTC | #4
> On Oct 25, 2024, at 9:31 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Fri, 2024-10-25 at 12:52 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Oct 24, 2024, at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
>>> 
>>> I'm wondering about the request for a garbage-collected nfsd_file
>>> though.  For NFSv3 that makes sense.  For NFSv4 we would expect the
>>> file
>>> to already be open as a non-garbage-collected nfsd_file and opening
>>> it
>>> again seems wasteful.  That doesn't need to be fixed for this patch
>>> and
>>> maybe doesn't need to be fixed at all, but it seemed worth
>>> highlighting.
>> 
>> I don't think using a GC'd nfsd_file for LOCALIO is a bug.
>> 
>> NFSv4 guarantees an OPEN to pin the nfsd_file, and a CLOSE
>> (or lease expiry) to release it. There's no desire for or
>> need for garbage collection.
>> 
>> NFSv3 and LOCALIO use each nfsd_file for the life of one I/O
>> operation, and that nfsd_file is cached in the expectation
>> that another I/O operation on the same file will happen with
>> frequent temporarl proximity. Garbage collection is needed
>> for both cases.
>> 
> 
> No. There is a huge difference between the two. In the case of NFSv3,
> the server has no idea whether or not there is further need for the
> file (stateless), whereas in the localio case, the client is able to
> tell exactly when the application holds the file open or not
> (stateful).
> 
> The only reason for jumping through all these hoops in the case of
> localio is the 'user may want to unmount' exception.
> 
> Is there any reason why we couldn't add a notification for that, to
> give knfsd the opportunity to clear out any open file state? The
> current approach appears to be flat out optimising for the exception.

We've discussed this privately. A notification callback
is possible, but that's code that would have to be written,
and using GC nfsd_files was an expedient for getting LOCALIO
merged.

There are some corner cases that will need to be worked
through to help determine whether a callback is truly
a simpler design.


--
Chuck Lever
Jeff Layton Oct. 25, 2024, 2 p.m. UTC | #5
On Fri, 2024-10-25 at 13:31 +0000, Trond Myklebust wrote:
> On Fri, 2024-10-25 at 12:52 +0000, Chuck Lever III wrote:
> > 
> > 
> > > On Oct 24, 2024, at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
> > > 
> > > I'm wondering about the request for a garbage-collected nfsd_file
> > > though.  For NFSv3 that makes sense.  For NFSv4 we would expect the
> > > file
> > > to already be open as a non-garbage-collected nfsd_file and opening
> > > it
> > > again seems wasteful.  That doesn't need to be fixed for this patch
> > > and
> > > maybe doesn't need to be fixed at all, but it seemed worth
> > > highlighting.
> > 
> > I don't think using a GC'd nfsd_file for LOCALIO is a bug.
> > 
> > NFSv4 guarantees an OPEN to pin the nfsd_file, and a CLOSE
> > (or lease expiry) to release it. There's no desire for or
> > need for garbage collection.
> > 
> > NFSv3 and LOCALIO use each nfsd_file for the life of one I/O
> > operation, and that nfsd_file is cached in the expectation
> > that another I/O operation on the same file will happen with
> > frequent temporarl proximity. Garbage collection is needed
> > for both cases.
> > 
> 
> No. There is a huge difference between the two. In the case of NFSv3,
> the server has no idea whether or not there is further need for the
> file (stateless), whereas in the localio case, the client is able to
> tell exactly when the application holds the file open or not
> (stateful).
> 
> The only reason for jumping through all these hoops in the case of
> localio is the 'user may want to unmount' exception.
> 
> Is there any reason why we couldn't add a notification for that, to
> give knfsd the opportunity to clear out any open file state? The
> current approach appears to be flat out optimising for the exception.
> 
> 

No, and after discussing it with others here at the bake-a-thon, I
think that might be the best path forward here. At a high level:

Add a callback to the client that runs just before calling
nfsd_file_cache_purge() in expkey_flush(). The client would be expected
to return all of its nfsd_files "soon" and switch back to normal
network I/O. After that point, it can try to get a new nfsd_file and
start up localio at that point. With that change too you can switch to
using non-GC'ed nfsd_files. The client can just hold them open and do
I/O to them at will.

I think that's probably less brittle than trying to keep opaque inode
pointers around, and would probably mean better performance since you
won't be thrashing the filecache as much.
Mike Snitzer Oct. 25, 2024, 2:21 p.m. UTC | #6
On Fri, Oct 25, 2024 at 02:00:56PM +1100, NeilBrown wrote:
> On Fri, 25 Oct 2024, Mike Snitzer wrote:
> > Rather than make nfsd_file_do_acquire() more complex (by training
> > it to conditionally skip both fh_verify() and nfsd_file allocation
> > and contruction) introduce nfsd_file_acquire_gc_cached() -- which
> > duplicates the minimalist subset of nfsd_file_do_acquire() needed to
> > achieve nfsd_file lookup using an opaque @inode_key.
> > 
> > nfsd_file_acquire_gc_cached() only returns a cached and GC'd nfsd_file
> > obtained using the opaque @inode_key, established from a previous call
> > to nfsd_file_do_acquire_local() that originally added the GC'd
> > nfsd_file to the filecache.
> > 
> > Update nfsd_open_local_fh to store @inode_key in @nfs_fh so later
> > calls can check if it maps to an open GC'd nfsd_file in the filecache
> > using nfsd_file_acquire_gc_cached().  Its nfsd_file_lookup_locked()
> > call will only find a match if @cred matches the nfsd_file's nf_cred.
> > 
> > And care is taken to clear the inode_key in nfsd_file_free() if the
> > nfsd_file has a non-NULL nf_inodep (which is a pointer to the address
> > of the opaque inode_key stored in the nfs_fh).  This avoids any risk
> > of re-using the old inode_key for a different nfsd_file.
> > 
> > This commit's cached nfsd_file lookup dramatically speeds up LOCALIO
> > performance, especially for small 4K O_DIRECT IO, e.g.:
> > 
> > before: read: IOPS=376k,  BW=1469MiB/s (1541MB/s)(28.7GiB/20001msec)
> > after:  read: IOPS=1037k, BW=4050MiB/s (4247MB/s)(79.1GiB/20002msec)
> > 
> > Note that LOCALIO calls nfsd_open_local_fh() for every IO it issues to
> > the underlying filesystem using the returned nfsd_file.  This is why
> > caching the opaque 'inode_key' in 'struct nfs_fh' is so helpful for
> > LOCALIO to quickly return the cached nfsd_file.  Whereas regular NFS
> > avoids fh_verify()'s costly duplicate lookups of the underlying
> > filesystem's dentry by caching it in 'fh_dentry' of 'struct svc_fh'.
> > LOCALIO cannot take the same approach, of storing the dentry, without
> > creating object lifetime issues associated with dentry references
> > holding server mounts open and preventing unmounts.
> > 
> > Signed-off-by: Mike Snitzer <snitzer@kernel.org>
> 
> I think this is a good idea.  If we are to avoid a complete lookup for
> every IO we need some back-pointer from the nfsd filecache to something
> in nfs so that a cached lookup can be invalidated.  Various other
> schemes have been suggested before.  This one seems particularly simple.
> 
> I'm wondering about the request for a garbage-collected nfsd_file
> though.  For NFSv3 that makes sense.  For NFSv4 we would expect the file
> to already be open as a non-garbage-collected nfsd_file and opening it
> again seems wasteful.  That doesn't need to be fixed for this patch and
> maybe doesn't need to be fixed at all, but it seemed worth highlighting.

Maybe you're referring to the nfsd_file_acquire_gc_cached() kernel doc
text?  The code doesn't impose the nfsd_file must be a GC'd nfsd_file
(nor do I ever create an nfsd_file, if it isn't in the filecache then
it returns NULL).

GC'd just buys us a more likely chance of finding the nfsd_file in the
cache so it pairs nicely with GC having been requested/used when the
nfsd_file originally created and added to the cache.

Anyway, what follows in my reply is all moot given this thread has
teased out the desire to utilize a callback mechanism to allow the NFS
client to hold a longterm reference on the open nfsd_file.

BUT I will be folding in all the things covered below so I can at
least put nfsd_file_acquire_cached() out to pasture in more fully
formed state (should there ever be a need to revisit it)...

> More below
> 
> > ---
> >  fs/nfs/inode.c             |  3 ++
> >  fs/nfs_common/nfslocalio.c |  2 +-
> >  fs/nfsd/filecache.c        | 78 ++++++++++++++++++++++++++++++++++++++
> >  fs/nfsd/filecache.h        |  7 ++++
> >  fs/nfsd/localio.c          | 46 +++++++++++++++++++---
> >  include/linux/nfs.h        |  4 ++
> >  include/linux/nfslocalio.h |  6 +--
> >  7 files changed, 136 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> > index cc7a32b32676..3051d65e3a89 100644
> > --- a/fs/nfs/inode.c
> > +++ b/fs/nfs/inode.c
> > @@ -2413,6 +2413,9 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
> >  #endif /* CONFIG_NFS_V4 */
> >  #ifdef CONFIG_NFS_V4_2
> >  	nfsi->xattr_cache = NULL;
> > +#endif
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	nfsi->fh.inode_key = NULL;
> >  #endif
> >  	nfs_netfs_inode_init(nfsi);
> >  
> > diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
> > index 09404d142d1a..bacebaa1e15c 100644
> > --- a/fs/nfs_common/nfslocalio.c
> > +++ b/fs/nfs_common/nfslocalio.c
> > @@ -130,7 +130,7 @@ EXPORT_SYMBOL_GPL(nfs_uuid_invalidate_one_client);
> >  
> >  struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *uuid,
> >  		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
> > -		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
> > +		   struct nfs_fh *nfs_fh, const fmode_t fmode)
> >  {
> >  	struct net *net;
> >  	struct nfsd_file *localio;
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 1408166222c5..5ab978ac3555 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -221,6 +221,9 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
> >  	INIT_LIST_HEAD(&nf->nf_gc);
> >  	nf->nf_birthtime = ktime_get();
> >  	nf->nf_file = NULL;
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	nf->nf_inodep = NULL;
> > +#endif
> 
> All these "#if IS_ENABLED" are ugly.  I wonder if we could get rid of
> them.
> Using __GFP_ZERO for the alloc would work here, but might be an unwanted
> cost.  Maybe nf_inodep could be ignored if nf_file is NULL.

I did originally nest the nf_inodep backpointer reset within
nfsd_file_free()'s nf->nf_file check, will go back to that.

Then nf_inodep is otherwise ignored everywhere.

> >  	nf->nf_cred = get_current_cred();
> >  	nf->nf_net = net;
> >  	nf->nf_flags = want_gc ?
> > @@ -285,6 +288,12 @@ nfsd_file_free(struct nfsd_file *nf)
> >  		nfsd_file_check_write_error(nf);
> >  		nfsd_filp_close(nf->nf_file);
> >  	}
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	if (nf->nf_inodep) {
> > +		*(nf->nf_inodep) = NULL;
> > +		nf->nf_inodep = NULL;
> > +	}
> > +#endif
> 
> This one is harder to hide.  We don't really need the final assignment
> though so maybe we could
>
> #define NF_INODEP(nf) (nf->nf_inodep)
> or
> #define NF_INODEP(nf) (NULL)
> 
> in a header (where #if are more acceptable), then make this code:
> 
>  if (NF_INODEP(nf))
> 	*NF_INODEP(nf) = NULL;
> 
> Is that better or worse I wonder.

Not opposed to this, will do.

> >  
> >  	/*
> >  	 * If this item is still linked via nf_lru, that's a bug.
> > @@ -1255,6 +1264,75 @@ nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
> >  	return beres;
> >  }
> >  
> > +/**
> > + * nfsd_file_acquire_cached - Get cached GC'd open file using inode
> > + * @net: The network namespace in which to perform a lookup
> > + * @cred: the user credential with which to validate access
> > + * @inode_key: inode to use as opaque lookup key
> > + * @may_flags: NFSD_MAY_ settings for the file
> > + * @pnf: OUT: found cached GC'd "struct nfsd_file" object
> > + *
> > + * Rather than make nfsd_file_do_acquire more complex (by training
> > + * it to conditionally skip fh_verify(), nfsd_file allocation and
> > + * contruction) duplicate the minimalist subset of it that is
> > + * needed to achieve nfsd_file lookup using the opaque @inode_key.
> > + *
> > + * The nfsd_file object returned by this API is reference-counted
> > + * and garbage-collected. The object is retained for a few
> > + * seconds after the final nfsd_file_put() in case the caller
> > + * wants to re-use it.
> > + *
> > + * Return values:
> > + *   %nfs_ok - @pnf points to an nfsd_file with its reference
> > + *   count boosted.
> > + *
> > + * On error, an nfsstat value in network byte order is returned.
> > + */
> > +__be32
> > +nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
> > +			 void *inode_key, unsigned int may_flags,
> > +			 struct nfsd_file **pnf)
> > +{
> > +	unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
> > +	struct nfsd_file *nf;
> > +	__be32 status;
> > +
> > +	rcu_read_lock();
> > +	nf = nfsd_file_lookup_locked(net, cred, inode_key, need, true);
> > +	rcu_read_unlock();
> > +
> > +	if (unlikely(!nf))
> > +		return nfserr_noent;
> > +
> > +	/*
> > +	 * If the nf is on the LRU then it holds an extra reference
> > +	 * that must be put if it's removed. It had better not be
> > +	 * the last one however, since we should hold another.
> > +	 */
> > +	if (nfsd_file_lru_remove(nf))
> > +		WARN_ON_ONCE(refcount_dec_and_test(&nf->nf_ref));
> 
> Just use refcount_dec(&nf->nf_ref).  It will provide a warning if the
> count reaches zero.  In general you should never need warnings when
> using refcount as it generates the needed warnings itself.

OK, I lifted the code from nfsd_file_do_acquire() as a starting point;
so it should probably be cleaned up there (independent of this patch,
not it).

> > +
> > +	if (WARN_ON_ONCE(test_bit(NFSD_FILE_PENDING, &nf->nf_flags) ||
> > +			 !test_bit(NFSD_FILE_HASHED, &nf->nf_flags))) {
> > +		status = nfserr_inval;
> > +		goto error;
> > +	}
> 
> Do we really want the above?  I guess you were following the pattern in
> nfsd_file_do_acquire() which waits for FILE_PENDING and then re-tests
> FILE_HASHED (nfsd_file_lookup_locked() already tested it).
> I guess it doesn't hurt, but I'm not sure it helps.

Right, was just trying to maintain status-quo.  I think it doesn't
hurt, and may actually help given spin_lock isn't held around
nfsd_file_lookup_locked?  But the WARN_ON_ONCE should be removed.

> > diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
> > index f441cb9f74d5..34a229409117 100644
> > --- a/fs/nfsd/localio.c
> > +++ b/fs/nfsd/localio.c
> > @@ -58,33 +58,67 @@ void nfsd_localio_ops_init(void)
> >  struct nfsd_file *
> >  nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
> >  		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
> > -		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
> > +		   struct nfs_fh *nfs_fh, const fmode_t fmode)
> >  {
> >  	int mayflags = NFSD_MAY_LOCALIO;
> >  	struct svc_cred rq_cred;
> >  	struct svc_fh fh;
> >  	struct nfsd_file *localio;
> > +	void *nf_inode_key;
> >  	__be32 beres;
> >  
> >  	if (nfs_fh->size > NFS4_FHSIZE)
> >  		return ERR_PTR(-EINVAL);
> >  
> > -	/* nfs_fh -> svc_fh */
> > -	fh_init(&fh, NFS4_FHSIZE);
> > -	fh.fh_handle.fh_size = nfs_fh->size;
> > -	memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
> > -
> >  	if (fmode & FMODE_READ)
> >  		mayflags |= NFSD_MAY_READ;
> >  	if (fmode & FMODE_WRITE)
> >  		mayflags |= NFSD_MAY_WRITE;
> >  
> > +	/*
> > +	 * Check if 'inode_key' stored in @nfs_fh maps to an
> > +	 * open nfsd_file in the filecache (from a previous
> > +	 * nfsd_open_local_fh).
> > +	 *
> > +	 * The 'inode_key' may have become stale (due to nfsd_file
> > +	 * being free'd by filecache GC) so the lookup will fail
> > +	 * gracefully by falling back to using nfsd_file_acquire_local().
> > +	 */
> > +	nf_inode_key = READ_ONCE(nfs_fh->inode_key);
> 
> I think you want the above in an rcu-locked region with
> nfsd_file_acquire_cached().  Else the inode could be freed and
> reallocated after the READ_ONCE and before the lookup.
> Maybe pass the address of inode_key and have nfsd_file_acquire_cached()
> deref after getting rcu_read_lock().

Good point, will do.

> > +	if (nf_inode_key) {
> > +		beres = nfsd_file_acquire_cached(net, cred,
> > +						 nf_inode_key,
> > +						 mayflags, &localio);
> > +		if (beres == nfs_ok)
> > +			return localio;
> > +		/*
> > +		 * reset stale nfs_fh->inode_key and fallthru
> > +		 * to use normal nfsd_file_acquire_local().
> > +		 */
> > +		nfs_fh->inode_key = NULL;
> > +	}

> > diff --git a/include/linux/nfs.h b/include/linux/nfs.h
> > index 9ad727ddfedb..56c894575c70 100644
> > --- a/include/linux/nfs.h
> > +++ b/include/linux/nfs.h
> > @@ -29,6 +29,10 @@
> >  struct nfs_fh {
> >  	unsigned short		size;
> >  	unsigned char		data[NFS_MAXFHSIZE];
> > +#if IS_ENABLED(CONFIG_NFS_LOCALIO)
> > +	/* 'inode_key' is an opaque key used for nfsd_file hash lookups */
> > +	void *			inode_key;
> > +#endif
> 
> I wonder if this is really the right place to store the inode_key.
> 'struct nfs_fh' appears in lots of places and the only place where you
> want the inode_key is inside the nfs_inode.  Maybe store an augmented
> nfs_fh in there...

Yeah, I went with nfs_fh because it served my needs (and nfs_fh is
passed to nfs_open_local_fh).  But very much open to suggestions.
While I'm not yet sure about your nfs_inode suggestion, I'll certainly
take a closer look after addressing your other feedback.

Thanks for your review!

Mike
NeilBrown Oct. 27, 2024, 9:49 p.m. UTC | #7
On Sat, 26 Oct 2024, Jeff Layton wrote:
> On Fri, 2024-10-25 at 13:31 +0000, Trond Myklebust wrote:
> > On Fri, 2024-10-25 at 12:52 +0000, Chuck Lever III wrote:
> > > 
> > > 
> > > > On Oct 24, 2024, at 11:00 PM, NeilBrown <neilb@suse.de> wrote:
> > > > 
> > > > I'm wondering about the request for a garbage-collected nfsd_file
> > > > though.  For NFSv3 that makes sense.  For NFSv4 we would expect the
> > > > file
> > > > to already be open as a non-garbage-collected nfsd_file and opening
> > > > it
> > > > again seems wasteful.  That doesn't need to be fixed for this patch
> > > > and
> > > > maybe doesn't need to be fixed at all, but it seemed worth
> > > > highlighting.
> > > 
> > > I don't think using a GC'd nfsd_file for LOCALIO is a bug.
> > > 
> > > NFSv4 guarantees an OPEN to pin the nfsd_file, and a CLOSE
> > > (or lease expiry) to release it. There's no desire for or
> > > need for garbage collection.
> > > 
> > > NFSv3 and LOCALIO use each nfsd_file for the life of one I/O
> > > operation, and that nfsd_file is cached in the expectation
> > > that another I/O operation on the same file will happen with
> > > frequent temporarl proximity. Garbage collection is needed
> > > for both cases.
> > > 
> > 
> > No. There is a huge difference between the two. In the case of NFSv3,
> > the server has no idea whether or not there is further need for the
> > file (stateless), whereas in the localio case, the client is able to
> > tell exactly when the application holds the file open or not
> > (stateful).
> > 
> > The only reason for jumping through all these hoops in the case of
> > localio is the 'user may want to unmount' exception.
> > 
> > Is there any reason why we couldn't add a notification for that, to
> > give knfsd the opportunity to clear out any open file state? The
> > current approach appears to be flat out optimising for the exception.
> > 
> > 
> 
> No, and after discussing it with others here at the bake-a-thon, I
> think that might be the best path forward here. At a high level:
> 
> Add a callback to the client that runs just before calling
> nfsd_file_cache_purge() in expkey_flush(). The client would be expected
> to return all of its nfsd_files "soon" and switch back to normal
> network I/O. After that point, it can try to get a new nfsd_file and
> start up localio at that point. With that change too you can switch to
> using non-GC'ed nfsd_files. The client can just hold them open and do
> I/O to them at will.

I don't think a "callback" is the best approach - certainly not in the
sense of a function pointer that nfs gives to nfsd.  Rather we want some
protocol, mediated by nfs-local, which allows nfsd to invalidate a thing
and nfs to know when it has been invalidated and to signal its release.

I suspect the "thing" is an nfsd_file.  I think a suitable protocol
involves SRCU.  There would be one SRCU state for all of nfsd.

nfs gets a reference to an nfsd_file and holds it as long as it likes.
Before accessing nf_file (or nf_inode) it gets srcu_read_lock() on nfsd
and checks that the nfsd_file is still valid - probably if it is still
hashed.  If not it drops the reference.  If it is valid it can safely
use nf_file for an IO, and then srcu_read_unlock() to release it.

nfsd_file_cache_purge() add a synchronize_srcu() call just before
calling nfsd_file_dispose_list()

We would need two different refcounts - on for localio to use which must
be augmented with srcu and which doesn't prevent nfsd_file_cond_queue()
from queuing for deletion, and the current one which does prevent queuing.

The new refcount would be a kref and when when it reaches zero the
nfsd_file is freed.  So nfsd_file_slab_free() wouldn't free the
nfsd_file but would kref_put() it.

NeilBrown

> 
> I think that's probably less brittle than trying to keep opaque inode
> pointers around, and would probably mean better performance since you
> won't be thrashing the filecache as much.
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
>
diff mbox series

Patch

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index cc7a32b32676..3051d65e3a89 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -2413,6 +2413,9 @@  struct inode *nfs_alloc_inode(struct super_block *sb)
 #endif /* CONFIG_NFS_V4 */
 #ifdef CONFIG_NFS_V4_2
 	nfsi->xattr_cache = NULL;
+#endif
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	nfsi->fh.inode_key = NULL;
 #endif
 	nfs_netfs_inode_init(nfsi);
 
diff --git a/fs/nfs_common/nfslocalio.c b/fs/nfs_common/nfslocalio.c
index 09404d142d1a..bacebaa1e15c 100644
--- a/fs/nfs_common/nfslocalio.c
+++ b/fs/nfs_common/nfslocalio.c
@@ -130,7 +130,7 @@  EXPORT_SYMBOL_GPL(nfs_uuid_invalidate_one_client);
 
 struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *uuid,
 		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
-		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
+		   struct nfs_fh *nfs_fh, const fmode_t fmode)
 {
 	struct net *net;
 	struct nfsd_file *localio;
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 1408166222c5..5ab978ac3555 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -221,6 +221,9 @@  nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
 	INIT_LIST_HEAD(&nf->nf_gc);
 	nf->nf_birthtime = ktime_get();
 	nf->nf_file = NULL;
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	nf->nf_inodep = NULL;
+#endif
 	nf->nf_cred = get_current_cred();
 	nf->nf_net = net;
 	nf->nf_flags = want_gc ?
@@ -285,6 +288,12 @@  nfsd_file_free(struct nfsd_file *nf)
 		nfsd_file_check_write_error(nf);
 		nfsd_filp_close(nf->nf_file);
 	}
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	if (nf->nf_inodep) {
+		*(nf->nf_inodep) = NULL;
+		nf->nf_inodep = NULL;
+	}
+#endif
 
 	/*
 	 * If this item is still linked via nf_lru, that's a bug.
@@ -1255,6 +1264,75 @@  nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
 	return beres;
 }
 
+/**
+ * nfsd_file_acquire_cached - Get cached GC'd open file using inode
+ * @net: The network namespace in which to perform a lookup
+ * @cred: the user credential with which to validate access
+ * @inode_key: inode to use as opaque lookup key
+ * @may_flags: NFSD_MAY_ settings for the file
+ * @pnf: OUT: found cached GC'd "struct nfsd_file" object
+ *
+ * Rather than make nfsd_file_do_acquire more complex (by training
+ * it to conditionally skip fh_verify(), nfsd_file allocation and
+ * contruction) duplicate the minimalist subset of it that is
+ * needed to achieve nfsd_file lookup using the opaque @inode_key.
+ *
+ * The nfsd_file object returned by this API is reference-counted
+ * and garbage-collected. The object is retained for a few
+ * seconds after the final nfsd_file_put() in case the caller
+ * wants to re-use it.
+ *
+ * Return values:
+ *   %nfs_ok - @pnf points to an nfsd_file with its reference
+ *   count boosted.
+ *
+ * On error, an nfsstat value in network byte order is returned.
+ */
+__be32
+nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
+			 void *inode_key, unsigned int may_flags,
+			 struct nfsd_file **pnf)
+{
+	unsigned char need = may_flags & NFSD_FILE_MAY_MASK;
+	struct nfsd_file *nf;
+	__be32 status;
+
+	rcu_read_lock();
+	nf = nfsd_file_lookup_locked(net, cred, inode_key, need, true);
+	rcu_read_unlock();
+
+	if (unlikely(!nf))
+		return nfserr_noent;
+
+	/*
+	 * If the nf is on the LRU then it holds an extra reference
+	 * that must be put if it's removed. It had better not be
+	 * the last one however, since we should hold another.
+	 */
+	if (nfsd_file_lru_remove(nf))
+		WARN_ON_ONCE(refcount_dec_and_test(&nf->nf_ref));
+
+	if (WARN_ON_ONCE(test_bit(NFSD_FILE_PENDING, &nf->nf_flags) ||
+			 !test_bit(NFSD_FILE_HASHED, &nf->nf_flags))) {
+		status = nfserr_inval;
+		goto error;
+	}
+	this_cpu_inc(nfsd_file_cache_hits);
+
+	status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags));
+	if (status != nfs_ok) {
+error:
+		nfsd_file_put(nf);
+		nf = NULL;
+	} else {
+		this_cpu_inc(nfsd_file_acquisitions);
+		nfsd_file_check_write_error(nf);
+		*pnf = nf;
+	}
+	trace_nfsd_file_acquire(NULL, inode_key, may_flags, nf, status);
+	return status;
+}
+
 /**
  * nfsd_file_acquire_opened - Get a struct nfsd_file using existing open file
  * @rqstp: the RPC transaction being executed
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index cadf3c2689c4..e000f6988dc8 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -47,6 +47,10 @@  struct nfsd_file {
 	struct list_head	nf_gc;
 	struct rcu_head		nf_rcu;
 	ktime_t			nf_birthtime;
+
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	void **			nf_inodep;
+#endif
 };
 
 int nfsd_file_cache_init(void);
@@ -71,5 +75,8 @@  __be32 nfsd_file_acquire_opened(struct svc_rqst *rqstp, struct svc_fh *fhp,
 __be32 nfsd_file_acquire_local(struct net *net, struct svc_cred *cred,
 			       struct auth_domain *client, struct svc_fh *fhp,
 			       unsigned int may_flags, struct nfsd_file **pnf);
+__be32 nfsd_file_acquire_cached(struct net *net, const struct cred *cred,
+			       void *inode_key, unsigned int may_flags,
+			       struct nfsd_file **pnf);
 int nfsd_file_cache_stats_show(struct seq_file *m, void *v);
 #endif /* _FS_NFSD_FILECACHE_H */
diff --git a/fs/nfsd/localio.c b/fs/nfsd/localio.c
index f441cb9f74d5..34a229409117 100644
--- a/fs/nfsd/localio.c
+++ b/fs/nfsd/localio.c
@@ -58,33 +58,67 @@  void nfsd_localio_ops_init(void)
 struct nfsd_file *
 nfsd_open_local_fh(struct net *net, struct auth_domain *dom,
 		   struct rpc_clnt *rpc_clnt, const struct cred *cred,
-		   const struct nfs_fh *nfs_fh, const fmode_t fmode)
+		   struct nfs_fh *nfs_fh, const fmode_t fmode)
 {
 	int mayflags = NFSD_MAY_LOCALIO;
 	struct svc_cred rq_cred;
 	struct svc_fh fh;
 	struct nfsd_file *localio;
+	void *nf_inode_key;
 	__be32 beres;
 
 	if (nfs_fh->size > NFS4_FHSIZE)
 		return ERR_PTR(-EINVAL);
 
-	/* nfs_fh -> svc_fh */
-	fh_init(&fh, NFS4_FHSIZE);
-	fh.fh_handle.fh_size = nfs_fh->size;
-	memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
-
 	if (fmode & FMODE_READ)
 		mayflags |= NFSD_MAY_READ;
 	if (fmode & FMODE_WRITE)
 		mayflags |= NFSD_MAY_WRITE;
 
+	/*
+	 * Check if 'inode_key' stored in @nfs_fh maps to an
+	 * open nfsd_file in the filecache (from a previous
+	 * nfsd_open_local_fh).
+	 *
+	 * The 'inode_key' may have become stale (due to nfsd_file
+	 * being free'd by filecache GC) so the lookup will fail
+	 * gracefully by falling back to using nfsd_file_acquire_local().
+	 */
+	nf_inode_key = READ_ONCE(nfs_fh->inode_key);
+	if (nf_inode_key) {
+		beres = nfsd_file_acquire_cached(net, cred,
+						 nf_inode_key,
+						 mayflags, &localio);
+		if (beres == nfs_ok)
+			return localio;
+		/*
+		 * reset stale nfs_fh->inode_key and fallthru
+		 * to use normal nfsd_file_acquire_local().
+		 */
+		nfs_fh->inode_key = NULL;
+	}
+
+	/* nfs_fh -> svc_fh */
+	fh_init(&fh, NFS4_FHSIZE);
+	fh.fh_handle.fh_size = nfs_fh->size;
+	memcpy(fh.fh_handle.fh_raw, nfs_fh->data, nfs_fh->size);
+
 	svcauth_map_clnt_to_svc_cred_local(rpc_clnt, cred, &rq_cred);
 
 	beres = nfsd_file_acquire_local(net, &rq_cred, dom,
 					&fh, mayflags, &localio);
 	if (beres)
 		localio = ERR_PTR(nfs_stat_to_errno(be32_to_cpu(beres)));
+	else {
+		/*
+		 * opaque 'inode_key' has a 1:1 mapping to both an
+		 * nfsd_file and nfs_fh struct (And the nfs_fh is shared
+		 * by all NFS client threads. So there is no risk of
+		 * storing competing addresses in nfsd_file->nf_inodep
+		 */
+		localio->nf_inodep = (void **) &nfs_fh->inode_key;
+		nfs_fh->inode_key = localio->nf_inode;
+	}
 
 	fh_put(&fh);
 	if (rq_cred.cr_group_info)
diff --git a/include/linux/nfs.h b/include/linux/nfs.h
index 9ad727ddfedb..56c894575c70 100644
--- a/include/linux/nfs.h
+++ b/include/linux/nfs.h
@@ -29,6 +29,10 @@ 
 struct nfs_fh {
 	unsigned short		size;
 	unsigned char		data[NFS_MAXFHSIZE];
+#if IS_ENABLED(CONFIG_NFS_LOCALIO)
+	/* 'inode_key' is an opaque key used for nfsd_file hash lookups */
+	void *			inode_key;
+#endif
 };
 
 /*
diff --git a/include/linux/nfslocalio.h b/include/linux/nfslocalio.h
index 3982fea79919..619eb1961ed6 100644
--- a/include/linux/nfslocalio.h
+++ b/include/linux/nfslocalio.h
@@ -43,7 +43,7 @@  void nfs_uuid_invalidate_one_client(nfs_uuid_t *nfs_uuid);
 /* localio needs to map filehandle -> struct nfsd_file */
 extern struct nfsd_file *
 nfsd_open_local_fh(struct net *, struct auth_domain *, struct rpc_clnt *,
-		   const struct cred *, const struct nfs_fh *,
+		   const struct cred *, struct nfs_fh *,
 		   const fmode_t) __must_hold(rcu);
 
 struct nfsd_localio_operations {
@@ -53,7 +53,7 @@  struct nfsd_localio_operations {
 						struct auth_domain *,
 						struct rpc_clnt *,
 						const struct cred *,
-						const struct nfs_fh *,
+						struct nfs_fh *,
 						const fmode_t);
 	void (*nfsd_file_put_local)(struct nfsd_file *);
 	struct file *(*nfsd_file_file)(struct nfsd_file *);
@@ -64,7 +64,7 @@  extern const struct nfsd_localio_operations *nfs_to;
 
 struct nfsd_file *nfs_open_local_fh(nfs_uuid_t *,
 		   struct rpc_clnt *, const struct cred *,
-		   const struct nfs_fh *, const fmode_t);
+		   struct nfs_fh *, const fmode_t);
 
 static inline void nfs_to_nfsd_file_put_local(struct nfsd_file *localio)
 {