Message ID | 166507324882.1802.884870684212914640.stgit@manet.1015granger.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | A course adjustment, maybe... | expand |
On Fri, 07 Oct 2022, Chuck Lever wrote: > > -static unsigned int file_hashval(struct svc_fh *fh) > +/* > + * The returned hash value is based solely on the address of an in-code > + * inode, a pointer to a slab-allocated object. The entropy in such a > + * pointer is concentrated in its middle bits. I think you need more justification than that for throwing away some of the entropy, even if you don't think it is much. Presumably you think hashing 32 bits is faster than hashing 64 bits. Maybe it is, but is it worth it? rhashtable depends heavily on having a strong hash function. In particular if any bucket ends up with more than 16 elements it will choose a new seed and rehash. If you deliberately remove some bits that it might have been used to spread those 16 out, then you are asking for trouble. We know that two different file handles can refer to the same inode ("rarely"), and you deliberately place them in the same hash bucket. So if an attacker arranges to access 17 files with the same inode but different file handles, then the hashtable will be continuously rehashed. The preferred approach when you require things to share a hash chain is to use an rhl table. This allows multiple instances with the same key. You would then key the rhl-table with the inode, and search a linked-list to find the entry with the desired file handle. This would be no worse in search time than the current code for aliased inodes, but less susceptible to attack. > +/** > + * nfs4_file_obj_cmpfn - Match a cache item against search criteria > + * @arg: search criteria > + * @ptr: cache item to check > + * > + * Return values: > + * %0 - Item matches search criteria > + * %1 - Item does not match search criteria I *much* prefer %-ESRCH for "does not match search criteria". It is self-documenting. Any non-zero value will do. NeilBrown
> On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote: > > On Fri, 07 Oct 2022, Chuck Lever wrote: >> >> -static unsigned int file_hashval(struct svc_fh *fh) >> +/* >> + * The returned hash value is based solely on the address of an in-code >> + * inode, a pointer to a slab-allocated object. The entropy in such a >> + * pointer is concentrated in its middle bits. > > I think you need more justification than that for throwing away some of > the entropy, even if you don't think it is much. We might have that justification: https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/ Actually I believe we are not discarding /any/ entropy in this function. The bits we discard are invariant. And, note that this is exactly the same situation we just merged in the filecache overhaul, and is a common trope amongst other hash tables that base their function on the inode's address. > Presumably you think hashing 32 bits is faster than hashing 64 bits. > Maybe it is, but is it worth it? > > rhashtable depends heavily on having a strong hash function. In > particular if any bucket ends up with more than 16 elements it will > choose a new seed and rehash. If you deliberately remove some bits that > it might have been used to spread those 16 out, then you are asking for > trouble. > > We know that two different file handles can refer to the same inode > ("rarely"), and you deliberately place them in the same hash bucket. > So if an attacker arranges to access 17 files with the same inode but > different file handles, then the hashtable will be continuously > rehashed. > > The preferred approach when you require things to share a hash chain is > to use an rhl table. Again, this is the same situation for the filecache. Do you believe it is worth reworking that? I'm guessing "yes". > This allows multiple instances with the same key. > You would then key the rhl-table with the inode, and search a > linked-list to find the entry with the desired file handle. This would > be no worse in search time than the current code for aliased inodes, but > less susceptible to attack. > >> +/** >> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria >> + * @arg: search criteria >> + * @ptr: cache item to check >> + * >> + * Return values: >> + * %0 - Item matches search criteria >> + * %1 - Item does not match search criteria > > I *much* prefer %-ESRCH for "does not match search criteria". It is > self-documenting. Any non-zero value will do. Noted, but returning 1 appears to be the typical arrangement for existing obj_cmpfn methods in most other areas. -- Chuck Lever
On Tue, 11 Oct 2022, Chuck Lever III wrote: > > On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote: > > > > On Fri, 07 Oct 2022, Chuck Lever wrote: > >> > >> -static unsigned int file_hashval(struct svc_fh *fh) > >> +/* > >> + * The returned hash value is based solely on the address of an in-code > >> + * inode, a pointer to a slab-allocated object. The entropy in such a > >> + * pointer is concentrated in its middle bits. > > > > I think you need more justification than that for throwing away some of > > the entropy, even if you don't think it is much. > > We might have that justification: > > https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/ > > Actually I believe we are not discarding /any/ entropy in > this function. The bits we discard are invariant. > Ok, I can accept that this: + k = ptr >> L1_CACHE_SHIFT; only discards invariant bits, but how can you justify this: + k &= 0x00ffffff; ?? And given that you pass it all to jhash anyway, why not just pass all of it? > And, note that this is exactly the same situation we just merged > in the filecache overhaul, and is a common trope amongst other > hash tables that base their function on the inode's address. > > > > Presumably you think hashing 32 bits is faster than hashing 64 bits. > > Maybe it is, but is it worth it? > > > > rhashtable depends heavily on having a strong hash function. In > > particular if any bucket ends up with more than 16 elements it will > > choose a new seed and rehash. If you deliberately remove some bits that > > it might have been used to spread those 16 out, then you are asking for > > trouble. > > > > We know that two different file handles can refer to the same inode > > ("rarely"), and you deliberately place them in the same hash bucket. > > So if an attacker arranges to access 17 files with the same inode but > > different file handles, then the hashtable will be continuously > > rehashed. > > > > The preferred approach when you require things to share a hash chain is > > to use an rhl table. > > Again, this is the same situation for the filecache. Do you > believe it is worth reworking that? I'm guessing "yes". As a matter of principle: yes. rhashtable is designed to assume that hash collisions are bad and can be changed by choosing a different seed. rhl_tables was explicitly added for cases when people wanted multiple elements to hash to the same value. The chance of it causing a problem without an attack are admittedly tiny. Attacks are only possible with subtree_check enabled, or if the underlying filesystem does something "clever" with file handles, so there wouldn't be many situations where an attack would even be possible. But if it were possible, it would likely be easy. The cost of the attack would be a minor-to-modest performance impact. So it is hard to argue "this code is dangerous and must be changed", but we have different tools for a reason, and I believe that rhl-tables is the right tool for this job. > > > > This allows multiple instances with the same key. > > You would then key the rhl-table with the inode, and search a > > linked-list to find the entry with the desired file handle. This would > > be no worse in search time than the current code for aliased inodes, but > > less susceptible to attack. > > > >> +/** > >> + * nfs4_file_obj_cmpfn - Match a cache item against search criteria > >> + * @arg: search criteria > >> + * @ptr: cache item to check > >> + * > >> + * Return values: > >> + * %0 - Item matches search criteria > >> + * %1 - Item does not match search criteria > > > > I *much* prefer %-ESRCH for "does not match search criteria". It is > > self-documenting. Any non-zero value will do. > > Noted, but returning 1 appears to be the typical arrangement for > existing obj_cmpfn methods in most other areas. Some people have no imagination :-) [Same argument could be used against implementing kernel modules in Rust - Using C appears to be the typical arrangement for existing modules) Thanks, NeilBrown
On Tue, 11 Oct 2022, Chuck Lever III wrote: > > On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote: > > > > On Fri, 07 Oct 2022, Chuck Lever wrote: > >> > >> -static unsigned int file_hashval(struct svc_fh *fh) > >> +/* > >> + * The returned hash value is based solely on the address of an in-code > >> + * inode, a pointer to a slab-allocated object. The entropy in such a > >> + * pointer is concentrated in its middle bits. > > > > I think you need more justification than that for throwing away some of > > the entropy, even if you don't think it is much. > > We might have that justification: > > https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/ > > Actually I believe we are not discarding /any/ entropy in > this function. The bits we discard are invariant. > > And, note that this is exactly the same situation we just merged > in the filecache overhaul, and is a common trope amongst other > hash tables that base their function on the inode's address. Common? I searched for ">> *L1_CACHE_SHIFT". Apart from the nfsd filecache you mentioned I find two. One in quota and one in reiserfs. Both work with traditional hash tables which are more forgiving of longer chains. Do you have other evidence of this being a common trope? Thanks, NeilBrown
> On Oct 11, 2022, at 7:37 PM, NeilBrown <neilb@suse.de> wrote: > > On Tue, 11 Oct 2022, Chuck Lever III wrote: >>> On Oct 10, 2022, at 8:16 PM, NeilBrown <neilb@suse.de> wrote: >>> >>> On Fri, 07 Oct 2022, Chuck Lever wrote: >>>> >>>> -static unsigned int file_hashval(struct svc_fh *fh) >>>> +/* >>>> + * The returned hash value is based solely on the address of an in-code >>>> + * inode, a pointer to a slab-allocated object. The entropy in such a >>>> + * pointer is concentrated in its middle bits. >>> >>> I think you need more justification than that for throwing away some of >>> the entropy, even if you don't think it is much. >> >> We might have that justification: >> >> https://lore.kernel.org/linux-nfs/YrUFbLJ5uVbWtZbf@ZenIV/ >> >> Actually I believe we are not discarding /any/ entropy in >> this function. The bits we discard are invariant. >> > > Ok, I can accept that this: > > + k = ptr >> L1_CACHE_SHIFT; > I searched for ">> *L1_CACHE_SHIFT". Apart from the nfsd > filecache you mentioned I find two. One in quota and one in reiserfs. > Both work with traditional hash tables which are more forgiving of > longer chains. > Do you have other evidence of this being a common trope? This approach is based on the hash function in fs/inode.c, which uses integer division instead of a shift. 509 static unsigned long hash(struct super_block *sb, unsigned long hashval) 510 { 511 unsigned long tmp; 512 513 tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) / 514 L1_CACHE_BYTES; 515 tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift); 516 return tmp & i_hash_mask; 517 } > only discards invariant bits, but how can you justify this: > > + k &= 0x00ffffff; > > ?? After shifting an address, the top byte generally contains invariant bits as well. > And given that you pass it all to jhash anyway, why not just pass all of > it? I don't think this is a big deal, but these functions are basically the same as what was recently merged without complaint. It's not a high priority to revisit those. It might be worth a clean-up to share this hash function between the two hash tables... at that point we might consider removing the extra mask. >> And, note that this is exactly the same situation we just merged >> in the filecache overhaul, and is a common trope amongst other >> hash tables that base their function on the inode's address. >> >> >>> Presumably you think hashing 32 bits is faster than hashing 64 bits. >>> Maybe it is, but is it worth it? >>> >>> rhashtable depends heavily on having a strong hash function. In >>> particular if any bucket ends up with more than 16 elements it will >>> choose a new seed and rehash. If you deliberately remove some bits that >>> it might have been used to spread those 16 out, then you are asking for >>> trouble. >>> >>> We know that two different file handles can refer to the same inode >>> ("rarely"), and you deliberately place them in the same hash bucket. >>> So if an attacker arranges to access 17 files with the same inode but >>> different file handles, then the hashtable will be continuously >>> rehashed. >>> >>> The preferred approach when you require things to share a hash chain is >>> to use an rhl table. >> >> Again, this is the same situation for the filecache. Do you >> believe it is worth reworking that? I'm guessing "yes". > > As a matter of principle: yes. > rhashtable is designed to assume that hash collisions are bad and can be > changed by choosing a different seed. > rhl_tables was explicitly added for cases when people wanted multiple > elements to hash to the same value. > > The chance of it causing a problem without an attack are admittedly > tiny. Attacks are only possible with subtree_check enabled, or if the > underlying filesystem does something "clever" with file handles, so > there wouldn't be many situations where an attack would even be > possible. But if it were possible, it would likely be easy. > The cost of the attack would be a minor-to-modest performance impact. > > So it is hard to argue "this code is dangerous and must be changed", but > we have different tools for a reason, and I believe that rhl-tables is > the right tool for this job. Agreed. I wasn't suggesting it's an emergency situation, but it's something that should get fixed at some point if there's a problem with it, even a minor one. I think I stopped at the non-list variant of rhashtable because using rhl was more complex, and the non-list variant seemed to work fine. There's no architectural reason either file_hashtbl or the filecache must use the non-list variant. In any event, it's worth taking the trouble now to change the nfs4_file implementation proposed here as you suggest. -- Chuck Lever
On Thu, 13 Oct 2022, Chuck Lever III wrote: > > I think I stopped at the non-list variant of rhashtable because > using rhl was more complex, and the non-list variant seemed to > work fine. There's no architectural reason either file_hashtbl > or the filecache must use the non-list variant. > > In any event, it's worth taking the trouble now to change the > nfs4_file implementation proposed here as you suggest. If you like you could leave it as-is for now and I can provide a patch to convert to rhl-tables later (won't be until late October). There is one thing I would need to understand though: why are the nfsd_files per-filehandle instead of per-inode? There is probably a good reason, but I cannot think of one. Thanks, NeilBrown
> On Oct 12, 2022, at 5:18 PM, NeilBrown <neilb@suse.de> wrote: > > On Thu, 13 Oct 2022, Chuck Lever III wrote: >> >> I think I stopped at the non-list variant of rhashtable because >> using rhl was more complex, and the non-list variant seemed to >> work fine. There's no architectural reason either file_hashtbl >> or the filecache must use the non-list variant. >> >> In any event, it's worth taking the trouble now to change the >> nfs4_file implementation proposed here as you suggest. > > If you like you could leave it as-is for now and I can provide a patch > to convert to rhl-tables later (won't be until late October). > There is one thing I would need to understand though: why are the > nfsd_files per-filehandle instead of per-inode? There is probably a > good reason, but I cannot think of one. I'm not clear on your offer: do you mean converting the nfsd_file cache from rhashtable to rhl, or converting the proposed nfs4_file rework? I had planned to do the latter myself and post a refresh. The nfsd_file_acquire API is the only place that seems to want a filehandle, and it's just to lookup the underlying inode. Perhaps I don't understand your question? -- Chuck Lever
On Fri, 14 Oct 2022, Chuck Lever III wrote: > > > On Oct 12, 2022, at 5:18 PM, NeilBrown <neilb@suse.de> wrote: > > > > On Thu, 13 Oct 2022, Chuck Lever III wrote: > >> > >> I think I stopped at the non-list variant of rhashtable because > >> using rhl was more complex, and the non-list variant seemed to > >> work fine. There's no architectural reason either file_hashtbl > >> or the filecache must use the non-list variant. > >> > >> In any event, it's worth taking the trouble now to change the > >> nfs4_file implementation proposed here as you suggest. > > > > If you like you could leave it as-is for now and I can provide a patch > > to convert to rhl-tables later (won't be until late October). > > There is one thing I would need to understand though: why are the > > nfsd_files per-filehandle instead of per-inode? There is probably a > > good reason, but I cannot think of one. > > I'm not clear on your offer: do you mean converting the nfsd_file > cache from rhashtable to rhl, or converting the proposed nfs4_file > rework? I had planned to do the latter myself and post a refresh. Either/both. Of course if you do the refresh, then I'll just review it. > > The nfsd_file_acquire API is the only place that seems to want a > filehandle, and it's just to lookup the underlying inode. Perhaps > I don't understand your question? Sorry, I meant nfs4_files, not nfsd_file: find_file() and find_or_add_file(). Why is there one nfs4_file per filehandle I see that there can be several nfsd_file per inode - in different network namespaces, or with different credentials or different access modes. That really needs to be fixed. Thanks, NeilBrown
> On Oct 13, 2022, at 6:14 PM, NeilBrown <neilb@suse.de> wrote: > > On Fri, 14 Oct 2022, Chuck Lever III wrote: >> >>> On Oct 12, 2022, at 5:18 PM, NeilBrown <neilb@suse.de> wrote: >>> >>> On Thu, 13 Oct 2022, Chuck Lever III wrote: >>>> >>>> I think I stopped at the non-list variant of rhashtable because >>>> using rhl was more complex, and the non-list variant seemed to >>>> work fine. There's no architectural reason either file_hashtbl >>>> or the filecache must use the non-list variant. >>>> >>>> In any event, it's worth taking the trouble now to change the >>>> nfs4_file implementation proposed here as you suggest. >>> >>> If you like you could leave it as-is for now and I can provide a patch >>> to convert to rhl-tables later (won't be until late October). >>> There is one thing I would need to understand though: why are the >>> nfsd_files per-filehandle instead of per-inode? There is probably a >>> good reason, but I cannot think of one. >> >> I'm not clear on your offer: do you mean converting the nfsd_file >> cache from rhashtable to rhl, or converting the proposed nfs4_file >> rework? I had planned to do the latter myself and post a refresh. > > Either/both. Of course if you do the refresh, then I'll just review it. Yep, I plan to repost, as part of addressing (your) review comments. >> The nfsd_file_acquire API is the only place that seems to want a >> filehandle, and it's just to lookup the underlying inode. Perhaps >> I don't understand your question? > > Sorry, I meant nfs4_files, not nfsd_file: find_file() and find_or_add_file(). > Why is there one nfs4_file per filehandle I can't answer that (yet), but I suspect there is some semantic association between the [current] filehandle and a particular state ID that makes this a sensible arrangement. > I see that there can be several nfsd_file per inode - in different > network namespaces, or with different credentials or different access > modes. My impression is that is by design. Each namespace and unique credential needs a distinct nfsd_file. Each nfsd_file acts like an open file descriptor in user space. > That really needs to be fixed. I'm not sure I agree, but maybe "that" and "fixed" are doing some heavy lifting. Jeff might be able to add some insight on design purpose, and we can write a patch that adds that as a documenting comment somewhere. -- Chuck Lever
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index 2b850de288cf..bb52510a6a83 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -44,7 +44,9 @@ #include <linux/jhash.h> #include <linux/string_helpers.h> #include <linux/fsnotify.h> +#include <linux/rhashtable.h> #include <linux/nfs_ssc.h> + #include "xdr4.h" #include "xdr4cb.h" #include "vfs.h" @@ -84,6 +86,7 @@ static bool check_for_locks(struct nfs4_file *fp, struct nfs4_lockowner *lowner) static void nfs4_free_ol_stateid(struct nfs4_stid *stid); void nfsd4_end_grace(struct nfsd_net *nn); static void _free_cpntf_state_locked(struct nfsd_net *nn, struct nfs4_cpntf_state *cps); +static void remove_nfs4_file(struct nfs4_file *fp); /* Locking: */ @@ -577,11 +580,8 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu) void put_nfs4_file(struct nfs4_file *fi) { - might_lock(&state_lock); - - if (refcount_dec_and_lock(&fi->fi_ref, &state_lock)) { - hlist_del_rcu(&fi->fi_hash); - spin_unlock(&state_lock); + if (refcount_dec_and_test(&fi->fi_ref)) { + remove_nfs4_file(fi); WARN_ON_ONCE(!list_empty(&fi->fi_clnt_odstate)); WARN_ON_ONCE(!list_empty(&fi->fi_delegations)); call_rcu(&fi->fi_rcu, nfsd4_free_file_rcu); @@ -695,19 +695,85 @@ static unsigned int ownerstr_hashval(struct xdr_netobj *ownername) return ret & OWNER_HASH_MASK; } -/* hash table for nfs4_file */ -#define FILE_HASH_BITS 8 -#define FILE_HASH_SIZE (1 << FILE_HASH_BITS) +static struct rhashtable nfs4_file_rhashtbl ____cacheline_aligned_in_smp; -static unsigned int file_hashval(struct svc_fh *fh) +/* + * The returned hash value is based solely on the address of an in-code + * inode, a pointer to a slab-allocated object. The entropy in such a + * pointer is concentrated in its middle bits. + */ +static u32 nfs4_file_inode_hash(const struct inode *inode, u32 seed) { - struct inode *inode = d_inode(fh->fh_dentry); + unsigned long ptr = (unsigned long)inode; + u32 k; - /* XXX: why not (here & in file cache) use inode? */ - return (unsigned int)hash_long(inode->i_ino, FILE_HASH_BITS); + k = ptr >> L1_CACHE_SHIFT; + k &= 0x00ffffff; + return jhash2(&k, 1, seed); } -static struct hlist_head file_hashtbl[FILE_HASH_SIZE]; +/** + * nfs4_file_key_hashfn - Compute the hash value of a lookup key + * @data: key on which to compute the hash value + * @len: rhash table's key_len parameter (unused) + * @seed: rhash table's random seed of the day + * + * Return value: + * Computed 32-bit hash value + */ +static u32 nfs4_file_key_hashfn(const void *data, u32 len, u32 seed) +{ + const struct svc_fh *fhp = data; + + return nfs4_file_inode_hash(d_inode(fhp->fh_dentry), seed); +} + +/** + * nfs4_file_obj_hashfn - Compute the hash value of an nfs4_file object + * @data: object on which to compute the hash value + * @len: rhash table's key_len parameter (unused) + * @seed: rhash table's random seed of the day + * + * Return value: + * Computed 32-bit hash value + */ +static u32 nfs4_file_obj_hashfn(const void *data, u32 len, u32 seed) +{ + const struct nfs4_file *fi = data; + + return nfs4_file_inode_hash(fi->fi_inode, seed); +} + +/** + * nfs4_file_obj_cmpfn - Match a cache item against search criteria + * @arg: search criteria + * @ptr: cache item to check + * + * Return values: + * %0 - Item matches search criteria + * %1 - Item does not match search criteria + */ +static int nfs4_file_obj_cmpfn(struct rhashtable_compare_arg *arg, + const void *ptr) +{ + const struct svc_fh *fhp = arg->key; + const struct nfs4_file *fi = ptr; + + return fh_match(&fi->fi_fhandle, &fhp->fh_handle) ? 0 : 1; +} + +static const struct rhashtable_params nfs4_file_rhash_params = { + .key_len = sizeof_field(struct nfs4_file, fi_inode), + .key_offset = offsetof(struct nfs4_file, fi_inode), + .head_offset = offsetof(struct nfs4_file, fi_rhash), + .hashfn = nfs4_file_key_hashfn, + .obj_hashfn = nfs4_file_obj_hashfn, + .obj_cmpfn = nfs4_file_obj_cmpfn, + + /* Reduce resizing churn on light workloads */ + .min_size = 512, /* buckets */ + .automatic_shrinking = true, +}; /* * Check if courtesy clients have conflicting access and resolve it if possible @@ -4251,11 +4317,8 @@ static struct nfs4_file *nfsd4_alloc_file(void) } /* OPEN Share state helper functions */ -static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval, - struct nfs4_file *fp) +static void init_nfs4_file(const struct svc_fh *fh, struct nfs4_file *fp) { - lockdep_assert_held(&state_lock); - refcount_set(&fp->fi_ref, 1); spin_lock_init(&fp->fi_lock); INIT_LIST_HEAD(&fp->fi_stateids); @@ -4273,7 +4336,6 @@ static void nfsd4_init_file(struct svc_fh *fh, unsigned int hashval, INIT_LIST_HEAD(&fp->fi_lo_states); atomic_set(&fp->fi_lo_recalls, 0); #endif - hlist_add_head_rcu(&fp->fi_hash, &file_hashtbl[hashval]); } void @@ -4626,71 +4688,92 @@ move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net) nfs4_put_stid(&last->st_stid); } -/* search file_hashtbl[] for file */ -static struct nfs4_file * -find_file_locked(struct svc_fh *fh, unsigned int hashval) -{ - struct nfs4_file *fp; +/* + * On hash insertion, identify any inode aliases. They will + * all be in the same hash bucket because we hash by the + * address of the struct inode. + */ +static void check_nfs4_file_aliases(struct nfs4_file *new, + const struct svc_fh *fhp) +{ + struct rhashtable *ht = &nfs4_file_rhashtbl; + struct rhash_lock_head __rcu *const *bkt; + struct rhashtable_compare_arg arg = { + .ht = ht, + .key = fhp, + }; + struct bucket_table *tbl; + struct rhash_head *he; + unsigned int hash; - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash, - lockdep_is_held(&state_lock)) { - if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) { - if (refcount_inc_not_zero(&fp->fi_ref)) - return fp; + /* + * rhashtable avoids large buckets, thus this loop stays + * efficient. + */ + rcu_read_lock(); + tbl = rht_dereference_rcu(ht->tbl, ht); + hash = rht_key_hashfn(ht, tbl, fhp, nfs4_file_rhash_params); + bkt = rht_bucket(tbl, hash); + rht_for_each_rcu_from(he, rht_ptr_rcu(bkt), tbl, hash) { + struct nfs4_file *fi; + + fi = rht_obj(ht, he); + if (nfs4_file_obj_cmpfn(&arg, fi) == 0) + continue; + if (d_inode(fhp->fh_dentry) == fi->fi_inode) { + fi->fi_aliased = true; + new->fi_aliased = true; } } - return NULL; -} - -static struct nfs4_file *insert_file(struct nfs4_file *new, struct svc_fh *fh, - unsigned int hashval) -{ - struct nfs4_file *fp; - struct nfs4_file *ret = NULL; - bool alias_found = false; - - spin_lock(&state_lock); - hlist_for_each_entry_rcu(fp, &file_hashtbl[hashval], fi_hash, - lockdep_is_held(&state_lock)) { - if (fh_match(&fp->fi_fhandle, &fh->fh_handle)) { - if (refcount_inc_not_zero(&fp->fi_ref)) - ret = fp; - } else if (d_inode(fh->fh_dentry) == fp->fi_inode) - fp->fi_aliased = alias_found = true; - } - if (likely(ret == NULL)) { - nfsd4_init_file(fh, hashval, new); - new->fi_aliased = alias_found; - ret = new; - } - spin_unlock(&state_lock); - return ret; + rcu_read_unlock(); } -static struct nfs4_file * find_file(struct svc_fh *fh) +static struct nfs4_file *find_nfs4_file(const struct svc_fh *fhp) { - struct nfs4_file *fp; - unsigned int hashval = file_hashval(fh); + struct nfs4_file *fi; rcu_read_lock(); - fp = find_file_locked(fh, hashval); + fi = rhashtable_lookup(&nfs4_file_rhashtbl, fhp, + nfs4_file_rhash_params); + if (fi) + if (!refcount_inc_not_zero(&fi->fi_ref)) + fi = NULL; rcu_read_unlock(); - return fp; + return fi; } static struct nfs4_file * -find_or_add_file(struct nfs4_file *new, struct svc_fh *fh) +find_or_insert_nfs4_file(struct nfs4_file *new, const struct svc_fh *fhp) { - struct nfs4_file *fp; - unsigned int hashval = file_hashval(fh); + struct nfs4_file *fi; + int err; - rcu_read_lock(); - fp = find_file_locked(fh, hashval); - rcu_read_unlock(); - if (fp) - return fp; + while (true) { + fi = find_nfs4_file(fhp); + if (fi) + break; + + init_nfs4_file(fhp, new); + err = rhashtable_lookup_insert_key(&nfs4_file_rhashtbl, fhp, + &new->fi_rhash, + nfs4_file_rhash_params); + if (!err) { + fi = new; + check_nfs4_file_aliases(fi, fhp); + break; + } + if (err != -EEXIST) { + fi = NULL; + break; + } + } + return fi; +} - return insert_file(new, fh, hashval); +static void remove_nfs4_file(struct nfs4_file *fi) +{ + rhashtable_remove_fast(&nfs4_file_rhashtbl, &fi->fi_rhash, + nfs4_file_rhash_params); } /* @@ -4703,9 +4786,10 @@ nfs4_share_conflict(struct svc_fh *current_fh, unsigned int deny_type) struct nfs4_file *fp; __be32 ret = nfs_ok; - fp = find_file(current_fh); + fp = find_nfs4_file(current_fh); if (!fp) return ret; + /* Check for conflicting share reservations */ spin_lock(&fp->fi_lock); if (fp->fi_share_deny & deny_type) @@ -5548,7 +5632,9 @@ nfsd4_process_open2(struct svc_rqst *rqstp, struct svc_fh *current_fh, struct nf * and check for delegations in the process of being recalled. * If not found, create the nfs4_file struct */ - fp = find_or_add_file(open->op_file, current_fh); + fp = find_or_insert_nfs4_file(open->op_file, current_fh); + if (unlikely(!fp)) + return nfserr_jukebox; if (fp != open->op_file) { status = nfs4_check_deleg(cl, open, &dp); if (status) @@ -7905,10 +7991,16 @@ nfs4_state_start(void) { int ret; - ret = nfsd4_create_callback_queue(); + ret = rhashtable_init(&nfs4_file_rhashtbl, &nfs4_file_rhash_params); if (ret) return ret; + ret = nfsd4_create_callback_queue(); + if (ret) { + rhashtable_destroy(&nfs4_file_rhashtbl); + return ret; + } + set_max_delegations(); return 0; } @@ -7939,6 +8031,7 @@ nfs4_state_shutdown_net(struct net *net) nfsd4_client_tracking_exit(net); nfs4_state_destroy_net(net); + rhashtable_destroy(&nfs4_file_rhashtbl); #ifdef CONFIG_NFSD_V4_2_INTER_SSC nfsd4_ssc_shutdown_umount(nn); #endif diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index ae596dbf8667..879f085bc39e 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -536,16 +536,13 @@ struct nfs4_clnt_odstate { * inode can have multiple filehandles associated with it, so there is * (potentially) a many to one relationship between this struct and struct * inode. - * - * These are hashed by filehandle in the file_hashtbl, which is protected by - * the global state_lock spinlock. */ struct nfs4_file { refcount_t fi_ref; struct inode * fi_inode; bool fi_aliased; spinlock_t fi_lock; - struct hlist_node fi_hash; /* hash on fi_fhandle */ + struct rhash_head fi_rhash; struct list_head fi_stateids; union { struct list_head fi_delegations;
fh_match() is expensive to use for hash chains that contain more than a few objects. With common workloads, I see multiple thousands of objects stored in file_hashtbl[], which always has only 256 buckets. Replace it with an rhashtable, which dynamically resizes its bucket array to keep hash chains short. This also enables the removal of the use of state_lock to serialize operations on the new rhashtable. The result is an improvement in the latency of NFSv4 operations and the reduction of nfsd CPU utilization due to the cache misses of walking long hash chains in file_hashtbl. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/nfsd/nfs4state.c | 235 ++++++++++++++++++++++++++++++++++++--------------- fs/nfsd/state.h | 5 - 2 files changed, 165 insertions(+), 75 deletions(-)