diff mbox series

[v2,7/9] NFSD: Use rhashtable for managing nfs4_file objects

Message ID 166507324882.1802.884870684212914640.stgit@manet.1015granger.net (mailing list archive)
State New, archived
Headers show
Series A course adjustment, maybe... | expand

Commit Message

Chuck Lever Oct. 6, 2022, 4:20 p.m. UTC
fh_match() is expensive to use for hash chains that contain more
than a few objects. With common workloads, I see multiple thousands
of objects stored in file_hashtbl[], which always has only 256
buckets.

Replace it with an rhashtable, which dynamically resizes its bucket
array to keep hash chains short.

This also enables the removal of the use of state_lock to serialize
operations on the new rhashtable.

The result is an improvement in the latency of NFSv4 operations
and the reduction of nfsd CPU utilization due to the cache misses
of walking long hash chains in file_hashtbl.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfs4state.c |  235 ++++++++++++++++++++++++++++++++++++---------------
 fs/nfsd/state.h     |    5 -
 2 files changed, 165 insertions(+), 75 deletions(-)

Comments

NeilBrown Oct. 11, 2022, 12:16 a.m. UTC | #1
On Fri, 07 Oct 2022, Chuck Lever wrote:
>  
> -static unsigned int file_hashval(struct svc_fh *fh)
> +/*
> + * The returned hash value is based solely on the address of an in-code
> + * inode, a pointer to a slab-allocated object. The entropy in such a
> + * pointer is concentrated in its middle bits.

I think you need more justification than that for throwing away some of
the entropy, even if you don't think it is much.
Presumably you think hashing 32 bits is faster than hashing 64 bits.
Maybe it is, but is it worth it?

rhashtable depends heavily on having a strong hash function.  In
particular if any bucket ends up with more than 16 elements it will
choose a new seed and rehash.  If you deliberately remove some bits that
it might have been used to spread those 16 out, then you are asking for
trouble.

We know that two different file handles can refer to the same inode
("rarely"), and you deliberately place them in the same hash bucket.
So if an attacker arranges to access 17 files with the same inode but
different file handles, then the hashtable will be continuously
rehashed.

The preferred approach when you require things to share a hash chain is
to use an rhl table.  This allows multiple instances with the same key.
You would then key the rhl-table with the inode, and search a
linked-list to find the entry with the desired file handle.  This would
be no worse in search time than the current code for aliased inodes, but
less susceptible to attack.

> +/**
> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
> + * @arg: search criteria
> + * @ptr: cache item to check
> + *
> + * Return values:
> + *   %0 - Item matches search criteria
> + *   %1 - Item does not match search criteria

I *much* prefer %-ESRCH for "does not match search criteria".  It is
self-documenting.  Any non-zero value will do.

NeilBrown
Chuck Lever Oct. 11, 2022, 12:56 p.m. UTC | #2
> On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Fri, 07 Oct 2022, Chuck Lever wrote:
>> 
>> -static unsigned int file_hashval(struct svc_fh *fh)
>> +/*
>> + * The returned hash value is based solely on the address of an in-code
>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>> + * pointer is concentrated in its middle bits.
> 
> I think you need more justification than that for throwing away some of
> the entropy, even if you don't think it is much.

We might have that justification:

https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/

Actually I believe we are not discarding /any/ entropy in
this function. The bits we discard are invariant.

And, note that this is exactly the same situation we just merged
in the filecache overhaul, and is a common trope amongst other
hash tables that base their function on the inode's address.


> Presumably you think hashing 32 bits is faster than hashing 64 bits.
> Maybe it is, but is it worth it?
> 
> rhashtable depends heavily on having a strong hash function.  In
> particular if any bucket ends up with more than 16 elements it will
> choose a new seed and rehash.  If you deliberately remove some bits that
> it might have been used to spread those 16 out, then you are asking for
> trouble.
> 
> We know that two different file handles can refer to the same inode
> ("rarely"), and you deliberately place them in the same hash bucket.
> So if an attacker arranges to access 17 files with the same inode but
> different file handles, then the hashtable will be continuously
> rehashed.
> 
> The preferred approach when you require things to share a hash chain is
> to use an rhl table.

Again, this is the same situation for the filecache. Do you
believe it is worth reworking that? I'm guessing "yes".


> This allows multiple instances with the same key.
> You would then key the rhl-table with the inode, and search a
> linked-list to find the entry with the desired file handle.  This would
> be no worse in search time than the current code for aliased inodes, but
> less susceptible to attack.
> 
>> +/**
>> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
>> + * @arg: search criteria
>> + * @ptr: cache item to check
>> + *
>> + * Return values:
>> + *   %0 - Item matches search criteria
>> + *   %1 - Item does not match search criteria
> 
> I *much* prefer %-ESRCH for "does not match search criteria".  It is
> self-documenting.  Any non-zero value will do.

Noted, but returning 1 appears to be the typical arrangement for
existing obj_cmpfn methods in most other areas.


--
Chuck Lever
NeilBrown Oct. 11, 2022, 11:37 p.m. UTC | #3
On Tue, 11 Oct 2022, Chuck Lever III wrote:
> > On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > On Fri, 07 Oct 2022, Chuck Lever wrote:
> >> 
> >> -static unsigned int file_hashval(struct svc_fh *fh)
> >> +/*
> >> + * The returned hash value is based solely on the address of an in-code
> >> + * inode, a pointer to a slab-allocated object. The entropy in such a
> >> + * pointer is concentrated in its middle bits.
> > 
> > I think you need more justification than that for throwing away some of
> > the entropy, even if you don't think it is much.
> 
> We might have that justification:
> 
> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
> 
> Actually I believe we are not discarding /any/ entropy in
> this function. The bits we discard are invariant.
> 

Ok, I can accept that this:

+	k = ptr >> L1_CACHE_SHIFT;

only discards invariant bits, but how can you justify this:

+	k &= 0x00ffffff;

??

And given that you pass it all to jhash anyway, why not just pass all of
it?


> And, note that this is exactly the same situation we just merged
> in the filecache overhaul, and is a common trope amongst other
> hash tables that base their function on the inode's address.
> 
> 
> > Presumably you think hashing 32 bits is faster than hashing 64 bits.
> > Maybe it is, but is it worth it?
> > 
> > rhashtable depends heavily on having a strong hash function.  In
> > particular if any bucket ends up with more than 16 elements it will
> > choose a new seed and rehash.  If you deliberately remove some bits that
> > it might have been used to spread those 16 out, then you are asking for
> > trouble.
> > 
> > We know that two different file handles can refer to the same inode
> > ("rarely"), and you deliberately place them in the same hash bucket.
> > So if an attacker arranges to access 17 files with the same inode but
> > different file handles, then the hashtable will be continuously
> > rehashed.
> > 
> > The preferred approach when you require things to share a hash chain is
> > to use an rhl table.
> 
> Again, this is the same situation for the filecache. Do you
> believe it is worth reworking that? I'm guessing "yes".

As a matter of principle: yes.
rhashtable is designed to assume that hash collisions are bad and can be
changed by choosing a different seed.
rhl_tables was explicitly added for cases when people wanted multiple
elements to hash to the same value.

The chance of it causing a problem without an attack are admittedly
tiny.  Attacks are only possible with subtree_check enabled, or if the
underlying filesystem does something "clever" with file handles, so
there wouldn't be many situations where an attack would even be
possible.  But if it were possible, it would likely be easy.
The cost of the attack would be a minor-to-modest performance impact.

So it is hard to argue "this code is dangerous and must be changed", but
we have different tools for a reason, and I believe that rhl-tables is
the right tool for this job.

> 
> 
> > This allows multiple instances with the same key.
> > You would then key the rhl-table with the inode, and search a
> > linked-list to find the entry with the desired file handle.  This would
> > be no worse in search time than the current code for aliased inodes, but
> > less susceptible to attack.
> > 
> >> +/**
> >> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria
> >> + * @arg: search criteria
> >> + * @ptr: cache item to check
> >> + *
> >> + * Return values:
> >> + *   %0 - Item matches search criteria
> >> + *   %1 - Item does not match search criteria
> > 
> > I *much* prefer %-ESRCH for "does not match search criteria".  It is
> > self-documenting.  Any non-zero value will do.
> 
> Noted, but returning 1 appears to be the typical arrangement for
> existing obj_cmpfn methods in most other areas.

Some people have no imagination :-)

[Same argument could be used against implementing kernel modules in Rust
 - Using C appears to be the typical arrangement for existing modules)

Thanks,
NeilBrown
NeilBrown Oct. 12, 2022, 6:45 a.m. UTC | #4
On Tue, 11 Oct 2022, Chuck Lever III wrote:
> > On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > On Fri, 07 Oct 2022, Chuck Lever wrote:
> >> 
> >> -static unsigned int file_hashval(struct svc_fh *fh)
> >> +/*
> >> + * The returned hash value is based solely on the address of an in-code
> >> + * inode, a pointer to a slab-allocated object. The entropy in such a
> >> + * pointer is concentrated in its middle bits.
> > 
> > I think you need more justification than that for throwing away some of
> > the entropy, even if you don't think it is much.
> 
> We might have that justification:
> 
> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
> 
> Actually I believe we are not discarding /any/ entropy in
> this function. The bits we discard are invariant.
> 
> And, note that this is exactly the same situation we just merged
> in the filecache overhaul, and is a common trope amongst other
> hash tables that base their function on the inode's address.

Common?  I searched for ">> *L1_CACHE_SHIFT".  Apart from the nfsd
filecache you mentioned I find two.  One in quota and one in reiserfs.
Both work with traditional hash tables which are more forgiving of
longer chains.
Do you have other evidence of this being a common trope?

Thanks,
NeilBrown
Chuck Lever Oct. 12, 2022, 3:01 p.m. UTC | #5
> On Oct 11, 2022, at 7:37 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Tue, 11 Oct 2022, Chuck Lever III wrote:
>>> On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote:
>>> 
>>> On Fri, 07 Oct 2022, Chuck Lever wrote:
>>>> 
>>>> -static unsigned int file_hashval(struct svc_fh *fh)
>>>> +/*
>>>> + * The returned hash value is based solely on the address of an in-code
>>>> + * inode, a pointer to a slab-allocated object. The entropy in such a
>>>> + * pointer is concentrated in its middle bits.
>>> 
>>> I think you need more justification than that for throwing away some of
>>> the entropy, even if you don't think it is much.
>> 
>> We might have that justification:
>> 
>> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/
>> 
>> Actually I believe we are not discarding /any/ entropy in
>> this function. The bits we discard are invariant.
>> 
> 
> Ok, I can accept that this:
> 
> +	k = ptr >> L1_CACHE_SHIFT;

> I searched for ">> *L1_CACHE_SHIFT".  Apart from the nfsd
> filecache you mentioned I find two.  One in quota and one in reiserfs.
> Both work with traditional hash tables which are more forgiving of
> longer chains.
> Do you have other evidence of this being a common trope?


This approach is based on the hash function in fs/inode.c,
which uses integer division instead of a shift.

 509 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 510 {
 511         unsigned long tmp;
 512 
 513         tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
 514                         L1_CACHE_BYTES;
 515         tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
 516         return tmp & i_hash_mask;
 517 }


> only discards invariant bits, but how can you justify this:
> 
> +	k &= 0x00ffffff;
> 
> ??

After shifting an address, the top byte generally contains
invariant bits as well.


> And given that you pass it all to jhash anyway, why not just pass all of
> it?

I don't think this is a big deal, but these functions are
basically the same as what was recently merged without
complaint. It's not a high priority to revisit those.

It might be worth a clean-up to share this hash function
between the two hash tables... at that point we might
consider removing the extra mask.


>> And, note that this is exactly the same situation we just merged
>> in the filecache overhaul, and is a common trope amongst other
>> hash tables that base their function on the inode's address.
>> 
>> 
>>> Presumably you think hashing 32 bits is faster than hashing 64 bits.
>>> Maybe it is, but is it worth it?
>>> 
>>> rhashtable depends heavily on having a strong hash function.  In
>>> particular if any bucket ends up with more than 16 elements it will
>>> choose a new seed and rehash.  If you deliberately remove some bits that
>>> it might have been used to spread those 16 out, then you are asking for
>>> trouble.
>>> 
>>> We know that two different file handles can refer to the same inode
>>> ("rarely"), and you deliberately place them in the same hash bucket.
>>> So if an attacker arranges to access 17 files with the same inode but
>>> different file handles, then the hashtable will be continuously
>>> rehashed.
>>> 
>>> The preferred approach when you require things to share a hash chain is
>>> to use an rhl table.
>> 
>> Again, this is the same situation for the filecache. Do you
>> believe it is worth reworking that? I'm guessing "yes".
> 
> As a matter of principle: yes.
> rhashtable is designed to assume that hash collisions are bad and can be
> changed by choosing a different seed.
> rhl_tables was explicitly added for cases when people wanted multiple
> elements to hash to the same value.
> 
> The chance of it causing a problem without an attack are admittedly
> tiny.  Attacks are only possible with subtree_check enabled, or if the
> underlying filesystem does something "clever" with file handles, so
> there wouldn't be many situations where an attack would even be
> possible.  But if it were possible, it would likely be easy.
> The cost of the attack would be a minor-to-modest performance impact.
> 
> So it is hard to argue "this code is dangerous and must be changed", but
> we have different tools for a reason, and I believe that rhl-tables is
> the right tool for this job.

Agreed. I wasn't suggesting it's an emergency situation, but it's
something that should get fixed at some point if there's a problem
with it, even a minor one.

I think I stopped at the non-list variant of rhashtable because
using rhl was more complex, and the non-list variant seemed to
work fine. There's no architectural reason either file_hashtbl
or the filecache must use the non-list variant.

In any event, it's worth taking the trouble now to change the
nfs4_file implementation proposed here as you suggest.


--
Chuck Lever
NeilBrown Oct. 12, 2022, 9:18 p.m. UTC | #6
On Thu, 13 Oct 2022, Chuck Lever III wrote:
> 
> I think I stopped at the non-list variant of rhashtable because
> using rhl was more complex, and the non-list variant seemed to
> work fine. There's no architectural reason either file_hashtbl
> or the filecache must use the non-list variant.
> 
> In any event, it's worth taking the trouble now to change the
> nfs4_file implementation proposed here as you suggest.

If you like you could leave it as-is for now and I can provide a patch
to convert to rhl-tables later (won't be until late October).
There is one thing I would need to understand though: why are the
nfsd_files per-filehandle instead of per-inode?  There is probably a
good reason, but I cannot think of one.

Thanks,
NeilBrown
Chuck Lever Oct. 13, 2022, 2:19 p.m. UTC | #7
> On Oct 12, 2022, at 5:18 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Thu, 13 Oct 2022, Chuck Lever III wrote:
>> 
>> I think I stopped at the non-list variant of rhashtable because
>> using rhl was more complex, and the non-list variant seemed to
>> work fine. There's no architectural reason either file_hashtbl
>> or the filecache must use the non-list variant.
>> 
>> In any event, it's worth taking the trouble now to change the
>> nfs4_file implementation proposed here as you suggest.
> 
> If you like you could leave it as-is for now and I can provide a patch
> to convert to rhl-tables later (won't be until late October).
> There is one thing I would need to understand though: why are the
> nfsd_files per-filehandle instead of per-inode?  There is probably a
> good reason, but I cannot think of one.

I'm not clear on your offer: do you mean converting the nfsd_file
cache from rhashtable to rhl, or converting the proposed nfs4_file
rework? I had planned to do the latter myself and post a refresh.

The nfsd_file_acquire API is the only place that seems to want a
filehandle, and it's just to lookup the underlying inode. Perhaps
I don't understand your question?


--
Chuck Lever
NeilBrown Oct. 13, 2022, 10:14 p.m. UTC | #8
On Fri, 14 Oct 2022, Chuck Lever III wrote:
> 
> > On Oct 12, 2022, at 5:18 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > On Thu, 13 Oct 2022, Chuck Lever III wrote:
> >> 
> >> I think I stopped at the non-list variant of rhashtable because
> >> using rhl was more complex, and the non-list variant seemed to
> >> work fine. There's no architectural reason either file_hashtbl
> >> or the filecache must use the non-list variant.
> >> 
> >> In any event, it's worth taking the trouble now to change the
> >> nfs4_file implementation proposed here as you suggest.
> > 
> > If you like you could leave it as-is for now and I can provide a patch
> > to convert to rhl-tables later (won't be until late October).
> > There is one thing I would need to understand though: why are the
> > nfsd_files per-filehandle instead of per-inode?  There is probably a
> > good reason, but I cannot think of one.
> 
> I'm not clear on your offer: do you mean converting the nfsd_file
> cache from rhashtable to rhl, or converting the proposed nfs4_file
> rework? I had planned to do the latter myself and post a refresh.

Either/both.  Of course if you do the refresh, then I'll just review it.

> 
> The nfsd_file_acquire API is the only place that seems to want a
> filehandle, and it's just to lookup the underlying inode. Perhaps
> I don't understand your question?

Sorry, I meant nfs4_files, not nfsd_file: find_file() and find_or_add_file().
Why is there one nfs4_file per filehandle

I see that there can be several nfsd_file per inode - in different
network namespaces, or with different credentials or different access
modes.  That really needs to be fixed.

Thanks,
NeilBrown
Chuck Lever Oct. 14, 2022, 12:48 p.m. UTC | #9
> On Oct 13, 2022, at 6:14 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Fri, 14 Oct 2022, Chuck Lever III wrote:
>> 
>>> On Oct 12, 2022, at 5:18 PM, NeilBrown <neilb@suse.de> wrote:
>>> 
>>> On Thu, 13 Oct 2022, Chuck Lever III wrote:
>>>> 
>>>> I think I stopped at the non-list variant of rhashtable because
>>>> using rhl was more complex, and the non-list variant seemed to
>>>> work fine. There's no architectural reason either file_hashtbl
>>>> or the filecache must use the non-list variant.
>>>> 
>>>> In any event, it's worth taking the trouble now to change the
>>>> nfs4_file implementation proposed here as you suggest.
>>> 
>>> If you like you could leave it as-is for now and I can provide a patch
>>> to convert to rhl-tables later (won't be until late October).
>>> There is one thing I would need to understand though: why are the
>>> nfsd_files per-filehandle instead of per-inode?  There is probably a
>>> good reason, but I cannot think of one.
>> 
>> I'm not clear on your offer: do you mean converting the nfsd_file
>> cache from rhashtable to rhl, or converting the proposed nfs4_file
>> rework? I had planned to do the latter myself and post a refresh.
> 
> Either/both.  Of course if you do the refresh, then I'll just review it.

Yep, I plan to repost, as part of addressing (your) review comments.


>> The nfsd_file_acquire API is the only place that seems to want a
>> filehandle, and it's just to lookup the underlying inode. Perhaps
>> I don't understand your question?
> 
> Sorry, I meant nfs4_files, not nfsd_file: find_file() and find_or_add_file().
> Why is there one nfs4_file per filehandle

I can't answer that (yet), but I suspect there is some semantic
association between the [current] filehandle and a particular
state ID that makes this a sensible arrangement.


> I see that there can be several nfsd_file per inode - in different
> network namespaces, or with different credentials or different access
> modes.

My impression is that is by design. Each namespace and unique
credential needs a distinct nfsd_file. Each nfsd_file acts like
an open file descriptor in user space.


> That really needs to be fixed.

I'm not sure I agree, but maybe "that" and "fixed" are doing some
heavy lifting. Jeff might be able to add some insight on design
purpose, and we can write a patch that adds that as a documenting
comment somewhere.

--
Chuck Lever
diff mbox series

Patch

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 2b850de288cf..bb52510a6a83 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -44,7 +44,9 @@ 
 #include <linux/jhash.h>
 #include <linux/string_helpers.h>
 #include <linux/fsnotify.h>
+#include <linux/rhashtable.h>
 #include <linux/nfs_ssc.h>
+
 #include "xdr4.h"
 #include "xdr4cb.h"
 #include "vfs.h"
@@ -84,6 +86,7 @@  static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner)
 static void nfs4_free_ol_stateid(struct nfs4_stid *stid);
 void nfsd4_end_grace(struct nfsd_net *nn);
 static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps);
+static void remove_nfs4_file(struct nfs4_file *fp);
 
 /* Locking: */
 
@@ -577,11 +580,8 @@  static void nfsd4_free_file_rcu(struct rcu_head *rcu)
 void
 put_nfs4_file(struct nfs4_file *fi)
 {
-	might_lock(&state_lock);
-
-	if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) {
-		hlist_del_rcu(&fi->fi_hash);
-		spin_unlock(&state_lock);
+	if (refcount_dec_and_test(&fi->fi_ref)) {
+		remove_nfs4_file(fi);
 		WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate));
 		WARN_ON_ONCE(!list_empty(&fi->fi_delegations));
 		call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu);
@@ -695,19 +695,85 @@  static unsigned int ownerstr_hashval(struct xdr_netobj *ownername)
 	return ret & OWNER_HASH_MASK;
 }
 
-/* hash table for nfs4_file */
-#define FILE_HASH_BITS                   8
-#define FILE_HASH_SIZE                  (1 << FILE_HASH_BITS)
+static struct rhashtable nfs4_file_rhashtbl ____cacheline_aligned_in_smp;
 
-static unsigned int file_hashval(struct svc_fh *fh)
+/*
+ * The returned hash value is based solely on the address of an in-code
+ * inode, a pointer to a slab-allocated object. The entropy in such a
+ * pointer is concentrated in its middle bits.
+ */
+static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed)
 {
-	struct inode *inode = d_inode(fh->fh_dentry);
+	unsigned long ptr = (unsigned long)inode;
+	u32 k;
 
-	/* XXX: why not (here & in file cache) use inode? */
-	return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS);
+	k = ptr >> L1_CACHE_SHIFT;
+	k &= 0x00ffffff;
+	return jhash2(&k, 1, seed);
 }
 
-static struct hlist_head file_hashtbl[FILE_HASH_SIZE];
+/**
+ * nfs4_file_key_hashfn - Compute the hash value of a lookup key
+ * @data: key on which to compute the hash value
+ * @len: rhash table's key_len parameter (unused)
+ * @seed: rhash table's random seed of the day
+ *
+ * Return value:
+ *   Computed 32-bit hash value
+ */
+static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct svc_fh *fhp = data;
+
+	return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed);
+}
+
+/**
+ * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object
+ * @data: object on which to compute the hash value
+ * @len: rhash table's key_len parameter (unused)
+ * @seed: rhash table's random seed of the day
+ *
+ * Return value:
+ *   Computed 32-bit hash value
+ */
+static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct nfs4_file *fi = data;
+
+	return nfs4_file_inode_hash(fi->fi_inode, seed);
+}
+
+/**
+ * nfs4_file_obj_cmpfn - Match a cache item against search criteria
+ * @arg: search criteria
+ * @ptr: cache item to check
+ *
+ * Return values:
+ *   %0 - Item matches search criteria
+ *   %1 - Item does not match search criteria
+ */
+static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg,
+			       const void *ptr)
+{
+	const struct svc_fh *fhp = arg->key;
+	const struct nfs4_file *fi = ptr;
+
+	return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1;
+}
+
+static const struct rhashtable_params nfs4_file_rhash_params = {
+	.key_len		= sizeof_field(struct nfs4_file, fi_inode),
+	.key_offset		= offsetof(struct nfs4_file, fi_inode),
+	.head_offset		= offsetof(struct nfs4_file, fi_rhash),
+	.hashfn			= nfs4_file_key_hashfn,
+	.obj_hashfn		= nfs4_file_obj_hashfn,
+	.obj_cmpfn		= nfs4_file_obj_cmpfn,
+
+	/* Reduce resizing churn on light workloads */
+	.min_size		= 512,		/* buckets */
+	.automatic_shrinking	= true,
+};
 
 /*
  * Check if courtesy clients have conflicting access and resolve it if possible
@@ -4251,11 +4317,8 @@  static struct nfs4_file *nfsd4_alloc_file(void)
 }
 
 /* OPEN Share state helper functions */
-static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
-				struct nfs4_file *fp)
+static void init_nfs4_file(const struct svc_fh *fh, struct nfs4_file *fp)
 {
-	lockdep_assert_held(&state_lock);
-
 	refcount_set(&fp->fi_ref, 1);
 	spin_lock_init(&fp->fi_lock);
 	INIT_LIST_HEAD(&fp->fi_stateids);
@@ -4273,7 +4336,6 @@  static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval,
 	INIT_LIST_HEAD(&fp->fi_lo_states);
 	atomic_set(&fp->fi_lo_recalls, 0);
 #endif
-	hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]);
 }
 
 void
@@ -4626,71 +4688,92 @@  move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
 		nfs4_put_stid(&last->st_stid);
 }
 
-/* search file_hashtbl[] for file */
-static struct nfs4_file *
-find_file_locked(struct svc_fh *fh, unsigned int hashval)
-{
-	struct nfs4_file *fp;
+/*
+ * On hash insertion, identify any inode aliases. They will
+ * all be in the same hash bucket because we hash by the
+ * address of the struct inode.
+ */
+static void check_nfs4_file_aliases(struct nfs4_file *new,
+				    const struct svc_fh *fhp)
+{
+	struct rhashtable *ht = &nfs4_file_rhashtbl;
+	struct rhash_lock_head __rcu *const *bkt;
+	struct rhashtable_compare_arg arg = {
+		.ht	= ht,
+		.key	= fhp,
+	};
+	struct bucket_table *tbl;
+	struct rhash_head *he;
+	unsigned int hash;
 
-	hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
-				lockdep_is_held(&state_lock)) {
-		if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
-			if (refcount_inc_not_zero(&fp->fi_ref))
-				return fp;
+	/*
+	 * rhashtable avoids large buckets, thus this loop stays
+	 * efficient.
+	 */
+	rcu_read_lock();
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+	hash = rht_key_hashfn(ht, tbl, fhp, nfs4_file_rhash_params);
+	bkt = rht_bucket(tbl, hash);
+	rht_for_each_rcu_from(he, rht_ptr_rcu(bkt), tbl, hash) {
+		struct nfs4_file *fi;
+
+		fi = rht_obj(ht, he);
+		if (nfs4_file_obj_cmpfn(&arg, fi) == 0)
+			continue;
+		if (d_inode(fhp->fh_dentry) == fi->fi_inode) {
+			fi->fi_aliased = true;
+			new->fi_aliased = true;
 		}
 	}
-	return NULL;
-}
-
-static struct nfs4_file *insert_file(struct nfs4_file *new, struct svc_fh *fh,
-				     unsigned int hashval)
-{
-	struct nfs4_file *fp;
-	struct nfs4_file *ret = NULL;
-	bool alias_found = false;
-
-	spin_lock(&state_lock);
-	hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash,
-				 lockdep_is_held(&state_lock)) {
-		if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) {
-			if (refcount_inc_not_zero(&fp->fi_ref))
-				ret = fp;
-		} else if (d_inode(fh->fh_dentry) == fp->fi_inode)
-			fp->fi_aliased = alias_found = true;
-	}
-	if (likely(ret == NULL)) {
-		nfsd4_init_file(fh, hashval, new);
-		new->fi_aliased = alias_found;
-		ret = new;
-	}
-	spin_unlock(&state_lock);
-	return ret;
+	rcu_read_unlock();
 }
 
-static struct nfs4_file * find_file(struct svc_fh *fh)
+static struct nfs4_file *find_nfs4_file(const struct svc_fh *fhp)
 {
-	struct nfs4_file *fp;
-	unsigned int hashval = file_hashval(fh);
+	struct nfs4_file *fi;
 
 	rcu_read_lock();
-	fp = find_file_locked(fh, hashval);
+	fi = rhashtable_lookup(&nfs4_file_rhashtbl, fhp,
+			       nfs4_file_rhash_params);
+	if (fi)
+		if (!refcount_inc_not_zero(&fi->fi_ref))
+			fi = NULL;
 	rcu_read_unlock();
-	return fp;
+	return fi;
 }
 
 static struct nfs4_file *
-find_or_add_file(struct nfs4_file *new, struct svc_fh *fh)
+find_or_insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp)
 {
-	struct nfs4_file *fp;
-	unsigned int hashval = file_hashval(fh);
+	struct nfs4_file *fi;
+	int err;
 
-	rcu_read_lock();
-	fp = find_file_locked(fh, hashval);
-	rcu_read_unlock();
-	if (fp)
-		return fp;
+	while (true) {
+		fi = find_nfs4_file(fhp);
+		if (fi)
+			break;
+
+		init_nfs4_file(fhp, new);
+		err = rhashtable_lookup_insert_key(&nfs4_file_rhashtbl, fhp,
+						   &new->fi_rhash,
+						   nfs4_file_rhash_params);
+		if (!err) {
+			fi = new;
+			check_nfs4_file_aliases(fi, fhp);
+			break;
+		}
+		if (err != -EEXIST) {
+			fi = NULL;
+			break;
+		}
+	}
+	return fi;
+}
 
-	return insert_file(new, fh, hashval);
+static void remove_nfs4_file(struct nfs4_file *fi)
+{
+	rhashtable_remove_fast(&nfs4_file_rhashtbl, &fi->fi_rhash,
+			       nfs4_file_rhash_params);
 }
 
 /*
@@ -4703,9 +4786,10 @@  nfs4_share_conflict(struct svc_fh *current_fh, unsigned int deny_type)
 	struct nfs4_file *fp;
 	__be32 ret = nfs_ok;
 
-	fp = find_file(current_fh);
+	fp = find_nfs4_file(current_fh);
 	if (!fp)
 		return ret;
+
 	/* Check for conflicting share reservations */
 	spin_lock(&fp->fi_lock);
 	if (fp->fi_share_deny & deny_type)
@@ -5548,7 +5632,9 @@  nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf
 	 * and check for delegations in the process of being recalled.
 	 * If not found, create the nfs4_file struct
 	 */
-	fp = find_or_add_file(open->op_file, current_fh);
+	fp = find_or_insert_nfs4_file(open->op_file, current_fh);
+	if (unlikely(!fp))
+		return nfserr_jukebox;
 	if (fp != open->op_file) {
 		status = nfs4_check_deleg(cl, open, &dp);
 		if (status)
@@ -7905,10 +7991,16 @@  nfs4_state_start(void)
 {
 	int ret;
 
-	ret = nfsd4_create_callback_queue();
+	ret = rhashtable_init(&nfs4_file_rhashtbl, &nfs4_file_rhash_params);
 	if (ret)
 		return ret;
 
+	ret = nfsd4_create_callback_queue();
+	if (ret) {
+		rhashtable_destroy(&nfs4_file_rhashtbl);
+		return ret;
+	}
+
 	set_max_delegations();
 	return 0;
 }
@@ -7939,6 +8031,7 @@  nfs4_state_shutdown_net(struct net *net)
 
 	nfsd4_client_tracking_exit(net);
 	nfs4_state_destroy_net(net);
+	rhashtable_destroy(&nfs4_file_rhashtbl);
 #ifdef CONFIG_NFSD_V4_2_INTER_SSC
 	nfsd4_ssc_shutdown_umount(nn);
 #endif
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index ae596dbf8667..879f085bc39e 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -536,16 +536,13 @@  struct nfs4_clnt_odstate {
  * inode can have multiple filehandles associated with it, so there is
  * (potentially) a many to one relationship between this struct and struct
  * inode.
- *
- * These are hashed by filehandle in the file_hashtbl, which is protected by
- * the global state_lock spinlock.
  */
 struct nfs4_file {
 	refcount_t		fi_ref;
 	struct inode *		fi_inode;
 	bool			fi_aliased;
 	spinlock_t		fi_lock;
-	struct hlist_node       fi_hash;	/* hash on fi_fhandle */
+	struct rhash_head	fi_rhash;
 	struct list_head        fi_stateids;
 	union {
 		struct list_head	fi_delegations;