[RFC] vfs: add rcu-based find_inode variants for iget ops

Message ID	20240606140515.216424-1-mjguzik@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F1CA2BAF1; Thu, 6 Jun 2024 14:05:23 +0000 (UTC) From: Mateusz Guzik <mjguzik@gmail.com> To: brauner@kernel.org Cc: viro@zeniv.linux.org.uk, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [RFC PATCH] vfs: add rcu-based find_inode variants for iget ops Date: Thu, 6 Jun 2024 16:05:15 +0200 Message-ID: <20240606140515.216424-1-mjguzik@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[RFC] vfs: add rcu-based find_inode variants for iget ops \| expand [RFC] vfs: add rcu-based find_inode variants for iget ops

Mateusz Guzik June 6, 2024, 2:05 p.m. UTC

Instantiating a new inode normally takes the global inode hash lock
twice:
1. once to check if it happens to already be present
2. once to add it to the hash

The back-to-back lock/unlock pattern is known to degrade performance
significantly, which is further exacerbated if the hash is heavily
populated (long chains to walk, extending hold time). Arguably hash
sizing and hashing algo need to be revisited, but that's beyond the
scope of this patch.

A long term fix would introduce fine-grained locking, this was attempted
in [1], but that patchset was already posted several times and appears
stalled.

A simpler idea which solves majority of the problem and which may be
good enough for the time being is to use RCU for the initial lookup.
Basic RCU support is already present in the hash, it is just not being
used for lookup on inode creation.

iget_locked consumers (notably ext4) get away without any changes
because inode comparison method is built-in.

iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
removal of locking adds more problems (inode can be changing) it's not
safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu
ilookup5_nowait_rcu get added, requiring manual conversion.

In order to reduce code duplication find_inode and find_inode_fast grow
an argument indicating whether inode hash lock is held, which is passed
down should sleeping be necessary. They always rcu_read_lock, which is
redundant but harmless. Doing it conditionally reduces readability for
no real gain that I can see. RCU-alike restrictions were already put on
callbacks due to the hash spinlock being held.

Benchmarked with the following: a 32-core vm with 24GB of RAM, a
dedicated fs partition. 20 separate trees with 1000 directories * 1000
files.  Then walked by 20 processes issuing stat on files, each on a
dedicated tree. Testcase is at [2].

In this particular workload, mimicking a real-world setup $elsewhere,
the initial lookup is guaranteed to fail, guaranteeing the 2 lock
acquires. At the same time RAM is scarce enough enough compared to the
demand that inodes keep needing to be recycled.

Total real time fluctuates by 1-2s, sample results:

ext4 (needed mkfs.ext4 -N 24000000):
before:	3.77s user 890.90s system 1939% cpu 46.118 total
after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)

btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)

btrfs is heavily bottlenecked on its own locks, so the improvement is
small in comparison.

[1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/
[2] https://people.freebsd.org/~mjg/fstree.tgz

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

This is an initial submission to gauge interest.

I do claim this provides great bang for the buck, I don't claim it
solves the problem overall. *something* finer-grained will need to
land.

I wanted to add bcachefs to the list, but I ran into memory reclamation
issues again (first time here:
https://lore.kernel.org/all/CAGudoHGenxzk0ZqPXXi1_QDbfqQhGHu+wUwzyS6WmfkUZ1HiXA@mail.gmail.com/),
did not have time to mess with diagnostic to write a report yet.

I'll post a patchset with this (+ tidy ups to comments and whatnot) +
btrfs + bcachefs conversion after the above gets reported and sorted
out.

Also interestingly things improved since last year, when Linux needed
about a minute.

 fs/inode.c         | 106 +++++++++++++++++++++++++++++++++++++--------
 include/linux/fs.h |  10 ++++-
 2 files changed, 98 insertions(+), 18 deletions(-)

Jan Kara June 6, 2024, 4:31 p.m. UTC | #1

On Thu 06-06-24 16:05:15, Mateusz Guzik wrote:
> Instantiating a new inode normally takes the global inode hash lock
> twice:
> 1. once to check if it happens to already be present
> 2. once to add it to the hash
> 
> The back-to-back lock/unlock pattern is known to degrade performance
> significantly, which is further exacerbated if the hash is heavily
> populated (long chains to walk, extending hold time). Arguably hash
> sizing and hashing algo need to be revisited, but that's beyond the
> scope of this patch.
> 
> A long term fix would introduce fine-grained locking, this was attempted
> in [1], but that patchset was already posted several times and appears
> stalled.
> 
> A simpler idea which solves majority of the problem and which may be
> good enough for the time being is to use RCU for the initial lookup.
> Basic RCU support is already present in the hash, it is just not being
> used for lookup on inode creation.
> 
> iget_locked consumers (notably ext4) get away without any changes
> because inode comparison method is built-in.
> 
> iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
> removal of locking adds more problems (inode can be changing) it's not
> safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu
> ilookup5_nowait_rcu get added, requiring manual conversion.

BTW, why not ilookup5_rcu() as well? To keep symmetry with non-RCU APIs and
iget5_locked_rcu() could then use ilookup5_rcu(). I presume eventually we'd
like to trasition everything to these RCU based methods?

> In order to reduce code duplication find_inode and find_inode_fast grow
> an argument indicating whether inode hash lock is held, which is passed
> down should sleeping be necessary. They always rcu_read_lock, which is
> redundant but harmless. Doing it conditionally reduces readability for
> no real gain that I can see. RCU-alike restrictions were already put on
> callbacks due to the hash spinlock being held.
> 
> Benchmarked with the following: a 32-core vm with 24GB of RAM, a
> dedicated fs partition. 20 separate trees with 1000 directories * 1000
> files.  Then walked by 20 processes issuing stat on files, each on a
> dedicated tree. Testcase is at [2].
> 
> In this particular workload, mimicking a real-world setup $elsewhere,
> the initial lookup is guaranteed to fail, guaranteeing the 2 lock
> acquires. At the same time RAM is scarce enough enough compared to the
> demand that inodes keep needing to be recycled.
> 
> Total real time fluctuates by 1-2s, sample results:
> 
> ext4 (needed mkfs.ext4 -N 24000000):
> before:	3.77s user 890.90s system 1939% cpu 46.118 total
> after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
> 
> btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
> before: 3.54s user 892.30s system 1966% cpu 45.549 total
> after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
> 
> btrfs is heavily bottlenecked on its own locks, so the improvement is
> small in comparison.
> 
> [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/
> [2] https://people.freebsd.org/~mjg/fstree.tgz

Nice results. I've looked through the patch and otherwise I didn't find any
issue.

								Honza

> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> 
> This is an initial submission to gauge interest.
> 
> I do claim this provides great bang for the buck, I don't claim it
> solves the problem overall. *something* finer-grained will need to
> land.
> 
> I wanted to add bcachefs to the list, but I ran into memory reclamation
> issues again (first time here:
> https://lore.kernel.org/all/CAGudoHGenxzk0ZqPXXi1_QDbfqQhGHu+wUwzyS6WmfkUZ1HiXA@mail.gmail.com/),
> did not have time to mess with diagnostic to write a report yet.
> 
> I'll post a patchset with this (+ tidy ups to comments and whatnot) +
> btrfs + bcachefs conversion after the above gets reported and sorted
> out.
> 
> Also interestingly things improved since last year, when Linux needed
> about a minute.
> 
>  fs/inode.c         | 106 +++++++++++++++++++++++++++++++++++++--------
>  include/linux/fs.h |  10 ++++-
>  2 files changed, 98 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 3a41f83a4ba5..f40b868f491f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -886,36 +886,43 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
>  	return freed;
>  }
>  
> -static void __wait_on_freeing_inode(struct inode *inode);
> +static void __wait_on_freeing_inode(struct inode *inode, bool locked);
>  /*
>   * Called with the inode lock held.
>   */
>  static struct inode *find_inode(struct super_block *sb,
>  				struct hlist_head *head,
>  				int (*test)(struct inode *, void *),
> -				void *data)
> +				void *data, bool locked)
>  {
>  	struct inode *inode = NULL;
>  
> +	if (locked)
> +		lockdep_assert_held(&inode_hash_lock);
> +
> +	rcu_read_lock();
>  repeat:
> -	hlist_for_each_entry(inode, head, i_hash) {
> +	hlist_for_each_entry_rcu(inode, head, i_hash) {
>  		if (inode->i_sb != sb)
>  			continue;
>  		if (!test(inode, data))
>  			continue;
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> -			__wait_on_freeing_inode(inode);
> +			__wait_on_freeing_inode(inode, locked);
>  			goto repeat;
>  		}
>  		if (unlikely(inode->i_state & I_CREATING)) {
>  			spin_unlock(&inode->i_lock);
> +			rcu_read_unlock();
>  			return ERR_PTR(-ESTALE);
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
>  
> @@ -924,29 +931,37 @@ static struct inode *find_inode(struct super_block *sb,
>   * iget_locked for details.
>   */
>  static struct inode *find_inode_fast(struct super_block *sb,
> -				struct hlist_head *head, unsigned long ino)
> +				struct hlist_head *head, unsigned long ino,
> +				bool locked)
>  {
>  	struct inode *inode = NULL;
>  
> +	if (locked)
> +		lockdep_assert_held(&inode_hash_lock);
> +
> +	rcu_read_lock();
>  repeat:
> -	hlist_for_each_entry(inode, head, i_hash) {
> +	hlist_for_each_entry_rcu(inode, head, i_hash) {
>  		if (inode->i_ino != ino)
>  			continue;
>  		if (inode->i_sb != sb)
>  			continue;
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> -			__wait_on_freeing_inode(inode);
> +			__wait_on_freeing_inode(inode, locked);
>  			goto repeat;
>  		}
>  		if (unlikely(inode->i_state & I_CREATING)) {
>  			spin_unlock(&inode->i_lock);
> +			rcu_read_unlock();
>  			return ERR_PTR(-ESTALE);
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
>  
> @@ -1161,7 +1176,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
>  
>  again:
>  	spin_lock(&inode_hash_lock);
> -	old = find_inode(inode->i_sb, head, test, data);
> +	old = find_inode(inode->i_sb, head, test, data, true);
>  	if (unlikely(old)) {
>  		/*
>  		 * Uhhuh, somebody else created the same inode under us.
> @@ -1245,6 +1260,43 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
>  }
>  EXPORT_SYMBOL(iget5_locked);
>  
> +/**
> + * iget5_locked_rcu - obtain an inode from a mounted file system
> + *
> + * This is equivalent to iget5_locked, except the @test callback must
> + * tolerate inode not being stable, including being mid-teardown.
> + */
> +struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
> +		int (*test)(struct inode *, void *),
> +		int (*set)(struct inode *, void *), void *data)
> +{
> +	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> +	struct inode *inode, *new;
> +
> +again:
> +	inode = find_inode(sb, head, test, data, false);
> +	if (inode) {
> +		if (IS_ERR(inode))
> +			return NULL;
> +		wait_on_inode(inode);
> +		if (unlikely(inode_unhashed(inode))) {
> +			iput(inode);
> +			goto again;
> +		}
> +		return inode;
> +	}
> +
> +	new = alloc_inode(sb);
> +	if (new) {
> +		new->i_state = 0;
> +		inode = inode_insert5(new, hashval, test, set, data);
> +		if (unlikely(inode != new))
> +			destroy_inode(new);
> +	}
> +	return inode;
> +}
> +EXPORT_SYMBOL(iget5_locked_rcu);
> +
>  /**
>   * iget_locked - obtain an inode from a mounted file system
>   * @sb:		super block of file system
> @@ -1263,9 +1315,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
>  	struct hlist_head *head = inode_hashtable + hash(sb, ino);
>  	struct inode *inode;
>  again:
> -	spin_lock(&inode_hash_lock);
> -	inode = find_inode_fast(sb, head, ino);
> -	spin_unlock(&inode_hash_lock);
> +	inode = find_inode_fast(sb, head, ino, false);
>  	if (inode) {
>  		if (IS_ERR(inode))
>  			return NULL;
> @@ -1283,7 +1333,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
>  
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode_fast(sb, head, ino);
> +		old = find_inode_fast(sb, head, ino, true);
>  		if (!old) {
>  			inode->i_ino = ino;
>  			spin_lock(&inode->i_lock);
> @@ -1419,13 +1469,31 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
>  	struct inode *inode;
>  
>  	spin_lock(&inode_hash_lock);
> -	inode = find_inode(sb, head, test, data);
> +	inode = find_inode(sb, head, test, data, true);
>  	spin_unlock(&inode_hash_lock);
>  
>  	return IS_ERR(inode) ? NULL : inode;
>  }
>  EXPORT_SYMBOL(ilookup5_nowait);
>  
> +/**
> + * ilookup5_nowait_rcu - search for an inode in the inode cache
> + *
> + * This is equivalent to ilookup5_nowait, except the @test callback must
> + * tolerate inode not being stable, including being mid-teardown.
> + */
> +struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval,
> +		int (*test)(struct inode *, void *), void *data)
> +{
> +	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> +	struct inode *inode;
> +
> +	inode = find_inode(sb, head, test, data, false);
> +
> +	return IS_ERR(inode) ? NULL : inode;
> +}
> +EXPORT_SYMBOL(ilookup5_nowait_rcu);
> +
>  /**
>   * ilookup5 - search for an inode in the inode cache
>   * @sb:		super block of file system to search
> @@ -1474,7 +1542,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
>  	struct inode *inode;
>  again:
>  	spin_lock(&inode_hash_lock);
> -	inode = find_inode_fast(sb, head, ino);
> +	inode = find_inode_fast(sb, head, ino, true);
>  	spin_unlock(&inode_hash_lock);
>  
>  	if (inode) {
> @@ -2235,17 +2303,21 @@ EXPORT_SYMBOL(inode_needs_sync);
>   * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
>   * will DTRT.
>   */
> -static void __wait_on_freeing_inode(struct inode *inode)
> +static void __wait_on_freeing_inode(struct inode *inode, bool locked)
>  {
>  	wait_queue_head_t *wq;
>  	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
>  	wq = bit_waitqueue(&inode->i_state, __I_NEW);
>  	prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
>  	spin_unlock(&inode->i_lock);
> -	spin_unlock(&inode_hash_lock);
> +	rcu_read_unlock();
> +	if (locked)
> +		spin_unlock(&inode_hash_lock);
>  	schedule();
>  	finish_wait(wq, &wait.wq_entry);
> -	spin_lock(&inode_hash_lock);
> +	if (locked)
> +		spin_lock(&inode_hash_lock);
> +	rcu_read_lock();
>  }
>  
>  static __initdata unsigned long ihash_entries;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0283cf366c2a..2817c915d355 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3021,6 +3021,9 @@ extern void d_mark_dontcache(struct inode *inode);
>  extern struct inode *ilookup5_nowait(struct super_block *sb,
>  		unsigned long hashval, int (*test)(struct inode *, void *),
>  		void *data);
> +extern struct inode *ilookup5_nowait_rcu(struct super_block *sb,
> +		unsigned long hashval, int (*test)(struct inode *, void *),
> +		void *data);
>  extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
>  		int (*test)(struct inode *, void *), void *data);
>  extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
> @@ -3029,7 +3032,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
>  		int (*test)(struct inode *, void *),
>  		int (*set)(struct inode *, void *),
>  		void *data);
> -extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
> +extern struct inode * iget5_locked(struct super_block *, unsigned long,
> +				   int (*test)(struct inode *, void *),
> +				   int (*set)(struct inode *, void *), void *);
> +extern struct inode * iget5_locked_rcu(struct super_block *, unsigned long,
> +				       int (*test)(struct inode *, void *),
> +				       int (*set)(struct inode *, void *), void *);
>  extern struct inode * iget_locked(struct super_block *, unsigned long);
>  extern struct inode *find_inode_nowait(struct super_block *,
>  				       unsigned long,
> -- 
> 2.43.0
>

Mateusz Guzik June 6, 2024, 6:45 p.m. UTC | #2

On Thu, Jun 6, 2024 at 6:31 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 06-06-24 16:05:15, Mateusz Guzik wrote:
> > Instantiating a new inode normally takes the global inode hash lock
> > twice:
> > 1. once to check if it happens to already be present
> > 2. once to add it to the hash
> >
> > The back-to-back lock/unlock pattern is known to degrade performance
> > significantly, which is further exacerbated if the hash is heavily
> > populated (long chains to walk, extending hold time). Arguably hash
> > sizing and hashing algo need to be revisited, but that's beyond the
> > scope of this patch.
> >
> > A long term fix would introduce fine-grained locking, this was attempted
> > in [1], but that patchset was already posted several times and appears
> > stalled.
> >
> > A simpler idea which solves majority of the problem and which may be
> > good enough for the time being is to use RCU for the initial lookup.
> > Basic RCU support is already present in the hash, it is just not being
> > used for lookup on inode creation.
> >
> > iget_locked consumers (notably ext4) get away without any changes
> > because inode comparison method is built-in.
> >
> > iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
> > removal of locking adds more problems (inode can be changing) it's not
> > safe to assume all filesystems happen to cope.  Thus iget5_locked_rcu
> > ilookup5_nowait_rcu get added, requiring manual conversion.
>
> BTW, why not ilookup5_rcu() as well? To keep symmetry with non-RCU APIs and
> iget5_locked_rcu() could then use ilookup5_rcu().

I don't have a strong opinion. Note that the routine as implemented
right now mimicks iget_locked.

> I presume eventually we'd like to trasition everything to these RCU based methods?
>

That is up in the air, but I would not go for it. Note every
iget5_locked consumer would have to get reviewed for safety of the
callback, which is a lot of work for no real gain for vast majority of
filesystems out there.

Also note the rcu variants are only used in cases which tolerate a
false negative -- if the inode fails to match, the caller is expected
to cope by creating a new inode and performing a locked lookup. It
does not have to be this way, but I find it less error prone.

> > In order to reduce code duplication find_inode and find_inode_fast grow
> > an argument indicating whether inode hash lock is held, which is passed
> > down should sleeping be necessary. They always rcu_read_lock, which is
> > redundant but harmless. Doing it conditionally reduces readability for
> > no real gain that I can see. RCU-alike restrictions were already put on
> > callbacks due to the hash spinlock being held.
> >
> > Benchmarked with the following: a 32-core vm with 24GB of RAM, a
> > dedicated fs partition. 20 separate trees with 1000 directories * 1000
> > files.  Then walked by 20 processes issuing stat on files, each on a
> > dedicated tree. Testcase is at [2].
> >
> > In this particular workload, mimicking a real-world setup $elsewhere,
> > the initial lookup is guaranteed to fail, guaranteeing the 2 lock
> > acquires. At the same time RAM is scarce enough enough compared to the
> > demand that inodes keep needing to be recycled.
> >
> > Total real time fluctuates by 1-2s, sample results:
> >
> > ext4 (needed mkfs.ext4 -N 24000000):
> > before:       3.77s user 890.90s system 1939% cpu 46.118 total
> > after:  3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
> >
> > btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
> > before: 3.54s user 892.30s system 1966% cpu 45.549 total
> > after:  3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
> >
> > btrfs is heavily bottlenecked on its own locks, so the improvement is
> > small in comparison.
> >
> > [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/
> > [2] https://people.freebsd.org/~mjg/fstree.tgz
>
> Nice results. I've looked through the patch and otherwise I didn't find any
> issue.
>
>                                                                 Honza
>
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > This is an initial submission to gauge interest.
> >
> > I do claim this provides great bang for the buck, I don't claim it
> > solves the problem overall. *something* finer-grained will need to
> > land.
> >
> > I wanted to add bcachefs to the list, but I ran into memory reclamation
> > issues again (first time here:
> > https://lore.kernel.org/all/CAGudoHGenxzk0ZqPXXi1_QDbfqQhGHu+wUwzyS6WmfkUZ1HiXA@mail.gmail.com/),
> > did not have time to mess with diagnostic to write a report yet.
> >
> > I'll post a patchset with this (+ tidy ups to comments and whatnot) +
> > btrfs + bcachefs conversion after the above gets reported and sorted
> > out.
> >
> > Also interestingly things improved since last year, when Linux needed
> > about a minute.
> >
> >  fs/inode.c         | 106 +++++++++++++++++++++++++++++++++++++--------
> >  include/linux/fs.h |  10 ++++-
> >  2 files changed, 98 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 3a41f83a4ba5..f40b868f491f 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -886,36 +886,43 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
> >       return freed;
> >  }
> >
> > -static void __wait_on_freeing_inode(struct inode *inode);
> > +static void __wait_on_freeing_inode(struct inode *inode, bool locked);
> >  /*
> >   * Called with the inode lock held.
> >   */
> >  static struct inode *find_inode(struct super_block *sb,
> >                               struct hlist_head *head,
> >                               int (*test)(struct inode *, void *),
> > -                             void *data)
> > +                             void *data, bool locked)
> >  {
> >       struct inode *inode = NULL;
> >
> > +     if (locked)
> > +             lockdep_assert_held(&inode_hash_lock);
> > +
> > +     rcu_read_lock();
> >  repeat:
> > -     hlist_for_each_entry(inode, head, i_hash) {
> > +     hlist_for_each_entry_rcu(inode, head, i_hash) {
> >               if (inode->i_sb != sb)
> >                       continue;
> >               if (!test(inode, data))
> >                       continue;
> >               spin_lock(&inode->i_lock);
> >               if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> > -                     __wait_on_freeing_inode(inode);
> > +                     __wait_on_freeing_inode(inode, locked);
> >                       goto repeat;
> >               }
> >               if (unlikely(inode->i_state & I_CREATING)) {
> >                       spin_unlock(&inode->i_lock);
> > +                     rcu_read_unlock();
> >                       return ERR_PTR(-ESTALE);
> >               }
> >               __iget(inode);
> >               spin_unlock(&inode->i_lock);
> > +             rcu_read_unlock();
> >               return inode;
> >       }
> > +     rcu_read_unlock();
> >       return NULL;
> >  }
> >
> > @@ -924,29 +931,37 @@ static struct inode *find_inode(struct super_block *sb,
> >   * iget_locked for details.
> >   */
> >  static struct inode *find_inode_fast(struct super_block *sb,
> > -                             struct hlist_head *head, unsigned long ino)
> > +                             struct hlist_head *head, unsigned long ino,
> > +                             bool locked)
> >  {
> >       struct inode *inode = NULL;
> >
> > +     if (locked)
> > +             lockdep_assert_held(&inode_hash_lock);
> > +
> > +     rcu_read_lock();
> >  repeat:
> > -     hlist_for_each_entry(inode, head, i_hash) {
> > +     hlist_for_each_entry_rcu(inode, head, i_hash) {
> >               if (inode->i_ino != ino)
> >                       continue;
> >               if (inode->i_sb != sb)
> >                       continue;
> >               spin_lock(&inode->i_lock);
> >               if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> > -                     __wait_on_freeing_inode(inode);
> > +                     __wait_on_freeing_inode(inode, locked);
> >                       goto repeat;
> >               }
> >               if (unlikely(inode->i_state & I_CREATING)) {
> >                       spin_unlock(&inode->i_lock);
> > +                     rcu_read_unlock();
> >                       return ERR_PTR(-ESTALE);
> >               }
> >               __iget(inode);
> >               spin_unlock(&inode->i_lock);
> > +             rcu_read_unlock();
> >               return inode;
> >       }
> > +     rcu_read_unlock();
> >       return NULL;
> >  }
> >
> > @@ -1161,7 +1176,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
> >
> >  again:
> >       spin_lock(&inode_hash_lock);
> > -     old = find_inode(inode->i_sb, head, test, data);
> > +     old = find_inode(inode->i_sb, head, test, data, true);
> >       if (unlikely(old)) {
> >               /*
> >                * Uhhuh, somebody else created the same inode under us.
> > @@ -1245,6 +1260,43 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
> >  }
> >  EXPORT_SYMBOL(iget5_locked);
> >
> > +/**
> > + * iget5_locked_rcu - obtain an inode from a mounted file system
> > + *
> > + * This is equivalent to iget5_locked, except the @test callback must
> > + * tolerate inode not being stable, including being mid-teardown.
> > + */
> > +struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
> > +             int (*test)(struct inode *, void *),
> > +             int (*set)(struct inode *, void *), void *data)
> > +{
> > +     struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> > +     struct inode *inode, *new;
> > +
> > +again:
> > +     inode = find_inode(sb, head, test, data, false);
> > +     if (inode) {
> > +             if (IS_ERR(inode))
> > +                     return NULL;
> > +             wait_on_inode(inode);
> > +             if (unlikely(inode_unhashed(inode))) {
> > +                     iput(inode);
> > +                     goto again;
> > +             }
> > +             return inode;
> > +     }
> > +
> > +     new = alloc_inode(sb);
> > +     if (new) {
> > +             new->i_state = 0;
> > +             inode = inode_insert5(new, hashval, test, set, data);
> > +             if (unlikely(inode != new))
> > +                     destroy_inode(new);
> > +     }
> > +     return inode;
> > +}
> > +EXPORT_SYMBOL(iget5_locked_rcu);
> > +
> >  /**
> >   * iget_locked - obtain an inode from a mounted file system
> >   * @sb:              super block of file system
> > @@ -1263,9 +1315,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> >       struct hlist_head *head = inode_hashtable + hash(sb, ino);
> >       struct inode *inode;
> >  again:
> > -     spin_lock(&inode_hash_lock);
> > -     inode = find_inode_fast(sb, head, ino);
> > -     spin_unlock(&inode_hash_lock);
> > +     inode = find_inode_fast(sb, head, ino, false);
> >       if (inode) {
> >               if (IS_ERR(inode))
> >                       return NULL;
> > @@ -1283,7 +1333,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> >
> >               spin_lock(&inode_hash_lock);
> >               /* We released the lock, so.. */
> > -             old = find_inode_fast(sb, head, ino);
> > +             old = find_inode_fast(sb, head, ino, true);
> >               if (!old) {
> >                       inode->i_ino = ino;
> >                       spin_lock(&inode->i_lock);
> > @@ -1419,13 +1469,31 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
> >       struct inode *inode;
> >
> >       spin_lock(&inode_hash_lock);
> > -     inode = find_inode(sb, head, test, data);
> > +     inode = find_inode(sb, head, test, data, true);
> >       spin_unlock(&inode_hash_lock);
> >
> >       return IS_ERR(inode) ? NULL : inode;
> >  }
> >  EXPORT_SYMBOL(ilookup5_nowait);
> >
> > +/**
> > + * ilookup5_nowait_rcu - search for an inode in the inode cache
> > + *
> > + * This is equivalent to ilookup5_nowait, except the @test callback must
> > + * tolerate inode not being stable, including being mid-teardown.
> > + */
> > +struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval,
> > +             int (*test)(struct inode *, void *), void *data)
> > +{
> > +     struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> > +     struct inode *inode;
> > +
> > +     inode = find_inode(sb, head, test, data, false);
> > +
> > +     return IS_ERR(inode) ? NULL : inode;
> > +}
> > +EXPORT_SYMBOL(ilookup5_nowait_rcu);
> > +
> >  /**
> >   * ilookup5 - search for an inode in the inode cache
> >   * @sb:              super block of file system to search
> > @@ -1474,7 +1542,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
> >       struct inode *inode;
> >  again:
> >       spin_lock(&inode_hash_lock);
> > -     inode = find_inode_fast(sb, head, ino);
> > +     inode = find_inode_fast(sb, head, ino, true);
> >       spin_unlock(&inode_hash_lock);
> >
> >       if (inode) {
> > @@ -2235,17 +2303,21 @@ EXPORT_SYMBOL(inode_needs_sync);
> >   * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
> >   * will DTRT.
> >   */
> > -static void __wait_on_freeing_inode(struct inode *inode)
> > +static void __wait_on_freeing_inode(struct inode *inode, bool locked)
> >  {
> >       wait_queue_head_t *wq;
> >       DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
> >       wq = bit_waitqueue(&inode->i_state, __I_NEW);
> >       prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> >       spin_unlock(&inode->i_lock);
> > -     spin_unlock(&inode_hash_lock);
> > +     rcu_read_unlock();
> > +     if (locked)
> > +             spin_unlock(&inode_hash_lock);
> >       schedule();
> >       finish_wait(wq, &wait.wq_entry);
> > -     spin_lock(&inode_hash_lock);
> > +     if (locked)
> > +             spin_lock(&inode_hash_lock);
> > +     rcu_read_lock();
> >  }
> >
> >  static __initdata unsigned long ihash_entries;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 0283cf366c2a..2817c915d355 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3021,6 +3021,9 @@ extern void d_mark_dontcache(struct inode *inode);
> >  extern struct inode *ilookup5_nowait(struct super_block *sb,
> >               unsigned long hashval, int (*test)(struct inode *, void *),
> >               void *data);
> > +extern struct inode *ilookup5_nowait_rcu(struct super_block *sb,
> > +             unsigned long hashval, int (*test)(struct inode *, void *),
> > +             void *data);
> >  extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
> >               int (*test)(struct inode *, void *), void *data);
> >  extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
> > @@ -3029,7 +3032,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
> >               int (*test)(struct inode *, void *),
> >               int (*set)(struct inode *, void *),
> >               void *data);
> > -extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
> > +extern struct inode * iget5_locked(struct super_block *, unsigned long,
> > +                                int (*test)(struct inode *, void *),
> > +                                int (*set)(struct inode *, void *), void *);
> > +extern struct inode * iget5_locked_rcu(struct super_block *, unsigned long,
> > +                                    int (*test)(struct inode *, void *),
> > +                                    int (*set)(struct inode *, void *), void *);
> >  extern struct inode * iget_locked(struct super_block *, unsigned long);
> >  extern struct inode *find_inode_nowait(struct super_block *,
> >                                      unsigned long,
> > --
> > 2.43.0
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Dave Chinner June 7, 2024, 2:04 a.m. UTC | #3

On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote:
> Instantiating a new inode normally takes the global inode hash lock
> twice:
> 1. once to check if it happens to already be present
> 2. once to add it to the hash
> 
> The back-to-back lock/unlock pattern is known to degrade performance
> significantly, which is further exacerbated if the hash is heavily
> populated (long chains to walk, extending hold time). Arguably hash
> sizing and hashing algo need to be revisited, but that's beyond the
> scope of this patch.
> 
> A long term fix would introduce fine-grained locking, this was attempted
> in [1], but that patchset was already posted several times and appears
> stalled.

Why not just pick up those patches and drive them to completion?

I have no issues with somebody else doing the finishing work for
that code; several of the patches in that series were originally
written by other people in the first place...

-Dave.

Mateusz Guzik June 7, 2024, 7:51 a.m. UTC | #4

On Fri, Jun 07, 2024 at 12:04:58PM +1000, Dave Chinner wrote:
> On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote:
> > Instantiating a new inode normally takes the global inode hash lock
> > twice:
> > 1. once to check if it happens to already be present
> > 2. once to add it to the hash
> > 
> > The back-to-back lock/unlock pattern is known to degrade performance
> > significantly, which is further exacerbated if the hash is heavily
> > populated (long chains to walk, extending hold time). Arguably hash
> > sizing and hashing algo need to be revisited, but that's beyond the
> > scope of this patch.
> > 
> > A long term fix would introduce fine-grained locking, this was attempted
> > in [1], but that patchset was already posted several times and appears
> > stalled.
> 
> Why not just pick up those patches and drive them to completion?
> 

Time constraints on my end aside.

From your own e-mail [1] last year problems are:

> - A lack of recent validation against ext4, btrfs and other
> filesystems.
> - the loss of lockdep coverage by moving to bit locks
> - it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks
>   inside the inode_hash_lock and we can't do that if we convert the
>   inode hash to bit locks because RT makes spinlocks sleeping locks.
> - There's been additions for lockless RCU inode hash lookups from
>   AFS and ext4 in weird, uncommon corner cases and I have no idea
>   how to validate they still work correctly with hash-bl. I suspect
>   they should just go away with hash-bl, but....

> There's more, but these are the big ones.

I did see the lockdep and preempt_rt problem were patched up in a later
iteration ([2]).

What we both agree on is that the patchset adds enough complexity that
it needs solid justification. I assumed one was there on your end when
you posted it.

For that entire patchset I don't have one. I can however justify the
comparatively trivial thing I posted in this thread.

That aside if I had to make the entire thing scale I would approach
things differently, most notably in terms of locking granularity. Per
your own statement things can be made to look great in microbenchmarks,
but that does not necessarily mean they help. A lot of it is a tradeoff
and making everything per-cpu for this particular problem may be taking
it too far.

For the inode hash it may be the old hack of having a lock array would
do it more than well enough -- say 32 locks (or some other number
dependent on hash size) each covering a dedicated subset. This very
plausibly would scale well enough even in a microbenchmark with a high
core count. Also note having regular spinlocks there would would retain
all lockdep and whatnot coverage, while having smaller memory footprint
than the variant coming in [2] which adds regular spinlocks for each
bucket.

In a similar spirit I'm not convinced per-cpu lists for super blocks are
warranted either. Instead, some number per numa domain or core count,
with backing area allocated from respective domain would probably do the
trick well enough. Even if per-cpu granularity is needed, a mere array
allocation as seen in the dlist patch warrants adjustment:
+	dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+			       GFP_KERNEL);

that is it should alloc separate arrays per domain, but then everything
else has to be adjusted to also know which array to index into. Maybe it
should use the alloc_percpu machinery.

All that said, if someone(tm) wants to pick up your patchset and even
commit it the way it is right now I'm not going to protest anything.

I don't have time nor justification to do full work My Way(tm).

I did have time to hack up the initial rcu lookup, which imo does not
require much to justify inclusion and does help the case I'm interested
in.

[1] https://lore.kernel.org/linux-fsdevel/ZUautmLUcRyUqZZ+@dread.disaster.area/
[2] https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-11-david@fromorbit.com/

Dave Chinner June 11, 2024, 10:17 a.m. UTC | #5

On Fri, Jun 07, 2024 at 09:51:51AM +0200, Mateusz Guzik wrote:
> On Fri, Jun 07, 2024 at 12:04:58PM +1000, Dave Chinner wrote:
> > On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote:
> > > Instantiating a new inode normally takes the global inode hash lock
> > > twice:
> > > 1. once to check if it happens to already be present
> > > 2. once to add it to the hash
> > > 
> > > The back-to-back lock/unlock pattern is known to degrade performance
> > > significantly, which is further exacerbated if the hash is heavily
> > > populated (long chains to walk, extending hold time). Arguably hash
> > > sizing and hashing algo need to be revisited, but that's beyond the
> > > scope of this patch.
> > > 
> > > A long term fix would introduce fine-grained locking, this was attempted
> > > in [1], but that patchset was already posted several times and appears
> > > stalled.
> > 
> > Why not just pick up those patches and drive them to completion?
> > 
> 
> Time constraints on my end aside.
> 
> From your own e-mail [1] last year problems are:
> 
> > - A lack of recent validation against ext4, btrfs and other
> > filesystems.
> > - the loss of lockdep coverage by moving to bit locks
> > - it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks
> >   inside the inode_hash_lock and we can't do that if we convert the
> >   inode hash to bit locks because RT makes spinlocks sleeping locks.
> > - There's been additions for lockless RCU inode hash lookups from
> >   AFS and ext4 in weird, uncommon corner cases and I have no idea
> >   how to validate they still work correctly with hash-bl. I suspect
> >   they should just go away with hash-bl, but....
> 
> > There's more, but these are the big ones.
> 
> I did see the lockdep and preempt_rt problem were patched up in a later
> iteration ([2]).
> 
> What we both agree on is that the patchset adds enough complexity that
> it needs solid justification. I assumed one was there on your end when
> you posted it.
> 
> For that entire patchset I don't have one. I can however justify the
> comparatively trivial thing I posted in this thread.

I didn't suggest you pick up the entire patchset, I suggested you
pull the inode cache conversion to hash-bl from it and use that
instead of hacking RCU lookups into the inode cache.

> That aside if I had to make the entire thing scale I would approach
> things differently, most notably in terms of locking granularity. Per
> your own statement things can be made to look great in microbenchmarks,
> but that does not necessarily mean they help. A lot of it is a tradeoff
> and making everything per-cpu for this particular problem may be taking
> it too far.

The hash-bl conversion doesn't make anything per-cpu, so I don't
know what you are complaining about here.

> For the inode hash it may be the old hack of having a lock array would
> do it more than well enough -- say 32 locks (or some other number
> dependent on hash size) each covering a dedicated subset.

No, that's a mid 1990s-era hack that was done because we didn't know
any better ways to scale global algorithms back then. 
It's not a scalable algorithm, it's a global algorithm that
has been sharded a few times to try to keep global scope operations
apart. The moment you have subset collisions because of a specific
workload pattern, then the scalability problem comes straight back.

We have been doing better than this for years, and it's not a viable
solution for things like core OS cache indexing. We need a construct
that is inherently scalable and has little cost to that inherent
scalability.

We have an inhernetly scalable solution already - hash-bl adds
fine-grained locking without changing locking behaviour nor
consuming extra memory. And it scales lookup, insert and remove with
a common locking scheme across all operations.

Your patch, however, just converts *some* of the lookup API
operations to use RCU. It adds complexity for things like inserts
which are going to need inode hash locking if the RCU lookup fails,
anyway.

Hence your patch optimises the case where the inode is in cache but
the dentry isn't, but we'll still get massive contention on lookup
when the RCU lookup on the inode cache and inserts are always going
to be required.

IOWs, even RCU lookups are not going to prevent inode hash lock
contention for parallel cold cache lookups. Hence, with RCU,
applications are going to see unpredictable contention behaviour
dependent on the memory footprint of the caches at the time of the
lookup. Users will have no way of predicting when the behaviour will
change, let alone have any way of mitigating it. Unpredictable
variable behaviour is the thing we want to avoid the most with core
OS caches.

In comparison, hash-bl based inode cache lookups will have the same
behaviour for both cache hits and cache misses. There will be
nearly zero contention for both types of lookups, as the insert can
be done immediately under the same lock that the is held for the
lookup. Hence performance will be predictable and largely
deterministic under a wide variety of workloads and changing memory
pressure. This is how we want the inode cache to behave.

IOWs, the hash-bl solution is superior to RCU lookups for many
reasons. RCU lookups might be fast, but it's not the best solution
to every problem that requires scalable behaviour.

> All that said, if someone(tm) wants to pick up your patchset and even
> commit it the way it is right now I'm not going to protest anything.
>
> I don't have time nor justification to do full work My Way(tm).

You *always* say this sort of thing when someone asks you to do
something different.

If you don't agree with the reviewer's comments, you make some
statement about how you "don't care enough to fix it properly" or
that you don't have time to fix it properly, or that you think the
reviewing is asking you to "fix everything" rather than just one
line in your patch so you reject all the reviewers comment, or you
say "if the maintainers want something" as a way of saying "you
don't have any authority, so I'm not going to listen to you at all",
or ....

> I did have time to hack up the initial rcu lookup, which imo does not
> require much to justify inclusion and does help the case I'm interested
> in.

I thin kit requires a lot more justification than you have given,
especially given the fact changing memory pressure, access patterns
and cache balance will have a great impact on whether the RCU path
makes any different to inode hash lock contention at all.

The hash-bl conversion is a much a better solution and the
work is mostly done. If you want your workload case to perform
better, then please pick up the inode hash lock conversion to
hash-bl.

-Dave.

Mateusz Guzik June 11, 2024, 11:13 a.m. UTC | #6

On Tue, Jun 11, 2024 at 08:17:16PM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2024 at 09:51:51AM +0200, Mateusz Guzik wrote:
> > On Fri, Jun 07, 2024 at 12:04:58PM +1000, Dave Chinner wrote:
> > > On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote:
> > > > Instantiating a new inode normally takes the global inode hash lock
> > > > twice:
> > > > 1. once to check if it happens to already be present
> > > > 2. once to add it to the hash
> > > > 
> > > > The back-to-back lock/unlock pattern is known to degrade performance
> > > > significantly, which is further exacerbated if the hash is heavily
> > > > populated (long chains to walk, extending hold time). Arguably hash
> > > > sizing and hashing algo need to be revisited, but that's beyond the
> > > > scope of this patch.
> > > > 
> > > > A long term fix would introduce fine-grained locking, this was attempted
> > > > in [1], but that patchset was already posted several times and appears
> > > > stalled.
> > > 
> > > Why not just pick up those patches and drive them to completion?
> > > 
> > 
> > Time constraints on my end aside.
> > 
> > From your own e-mail [1] last year problems are:
> > 
> > > - A lack of recent validation against ext4, btrfs and other
> > > filesystems.
> > > - the loss of lockdep coverage by moving to bit locks
> > > - it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks
> > >   inside the inode_hash_lock and we can't do that if we convert the
> > >   inode hash to bit locks because RT makes spinlocks sleeping locks.
> > > - There's been additions for lockless RCU inode hash lookups from
> > >   AFS and ext4 in weird, uncommon corner cases and I have no idea
> > >   how to validate they still work correctly with hash-bl. I suspect
> > >   they should just go away with hash-bl, but....
> > 
> > > There's more, but these are the big ones.
> > 
> > I did see the lockdep and preempt_rt problem were patched up in a later
> > iteration ([2]).
> > 
> > What we both agree on is that the patchset adds enough complexity that
> > it needs solid justification. I assumed one was there on your end when
> > you posted it.
> > 
> > For that entire patchset I don't have one. I can however justify the
> > comparatively trivial thing I posted in this thread.
> 
> I didn't suggest you pick up the entire patchset, I suggested you
> pull the inode cache conversion to hash-bl from it and use that
> instead of hacking RCU lookups into the inode cache.
> 

That's still a significantly more complicated change than my proposal.

> > That aside if I had to make the entire thing scale I would approach
> > things differently, most notably in terms of locking granularity. Per
> > your own statement things can be made to look great in microbenchmarks,
> > but that does not necessarily mean they help. A lot of it is a tradeoff
> > and making everything per-cpu for this particular problem may be taking
> > it too far.
> 
> The hash-bl conversion doesn't make anything per-cpu, so I don't
> know what you are complaining about here.
> 

I crossed the wires with another patchset and wrote some garbage here,
but it is not hard to error-correct to what I meant given the context:
per-chain locking is not necessarily warranted, in which case bitlocks
and associated trouble can be avoided.

> > For the inode hash it may be the old hack of having a lock array would
> > do it more than well enough -- say 32 locks (or some other number
> > dependent on hash size) each covering a dedicated subset.
> 
> No, that's a mid 1990s-era hack that was done because we didn't know
> any better ways to scale global algorithms back then. 
> It's not a scalable algorithm, it's a global algorithm that
> has been sharded a few times to try to keep global scope operations
> apart. The moment you have subset collisions because of a specific
> workload pattern, then the scalability problem comes straight back.
> 

It is a tradeoff, as I said. Avoidance of bitlock-associated trouble is
the only reason I considered it.

> Your patch, however, just converts *some* of the lookup API
> operations to use RCU. It adds complexity for things like inserts
> which are going to need inode hash locking if the RCU lookup fails,
> anyway.
> 

You may notice the patch only trivially alters insertion code because
rcu awaraness in the hash was already present.

> Hence your patch optimises the case where the inode is in cache but
> the dentry isn't, but we'll still get massive contention on lookup
> when the RCU lookup on the inode cache and inserts are always going
> to be required.

While it is true that the particular case gets sped up, that's not the
only thing that happens and should you take a look at the synthethic
benchmark I'm running you will see it does not even test that -- it is
very heavily cache busting.

The win comes from going from back-to-back lock acquire (a well known
terribly performing pattern) to just one.

> IOWs, even RCU lookups are not going to prevent inode hash lock
> contention for parallel cold cache lookups. Hence, with RCU,
> applications are going to see unpredictable contention behaviour
> dependent on the memory footprint of the caches at the time of the
> lookup. Users will have no way of predicting when the behaviour will
> change, let alone have any way of mitigating it. Unpredictable
> variable behaviour is the thing we want to avoid the most with core
> OS caches.

Of course it's still contended, I just claim it happens less.

> 
> IOWs, the hash-bl solution is superior to RCU lookups for many
> reasons. RCU lookups might be fast, but it's not the best solution
> to every problem that requires scalable behaviour.
> 

Of course hash-bl is faster, I never claimed otherwise.

I did claim my patch is trivial and provides a nice win for little work
in exchange. See below for a continuation.

> > All that said, if someone(tm) wants to pick up your patchset and even
> > commit it the way it is right now I'm not going to protest anything.
> >
> > I don't have time nor justification to do full work My Way(tm).
> 
> You *always* say this sort of thing when someone asks you to do
> something different.
> 
> If you don't agree with the reviewer's comments, you make some
> statement about how you "don't care enough to fix it properly" or
> that you don't have time to fix it properly, or that you think the
> reviewing is asking you to "fix everything" rather than just one
> line in your patch so you reject all the reviewers comment, or you
> say "if the maintainers want something" as a way of saying "you
> don't have any authority, so I'm not going to listen to you at all",
> or ....
> 

By any chance is your idea of me claiming the reviewer is asking to "fix
everything" based on my exchange with Christopher Hellwing concerning v2
of the patch? Because I pretty clearly did not say anything of the sort,
even though he might have taken it that way.

Anyhow, you do understand I'm not getting paid to do this work?

So here how I see it: the inode hash is a problem, there is one patchset
which solves it but is stalled. Seeing how nobody is working on the
problem and that there is an easy to code speed up, I hack it up, bench
and submit.

This is where you come in suggesting I carry hash-bl across the finish
line instead, despite my submission explicitly stating limited interest
in devoting time to the area.

I write an explicit explanation why I'm not interested in doing it.

This is where you should stop instead of complaining I'm not picking up
your thing.

> > I did have time to hack up the initial rcu lookup, which imo does not
> > require much to justify inclusion and does help the case I'm interested
> > in.
> 
> I thin kit requires a lot more justification than you have given,
> especially given the fact changing memory pressure, access patterns
> and cache balance will have a great impact on whether the RCU path
> makes any different to inode hash lock contention at all.
> 

Per my explanation above the win is not from iget getting away without
taking the hash lock, but from only taking it once instead of twice. The
commit message starts with explicitly stating it.

New inodes continuously land in the hash in my benchmark.

> The hash-bl conversion is a much a better solution and the
> work is mostly done. If you want your workload case to perform
> better, then please pick up the inode hash lock conversion to
> hash-bl.
> 

It is a much faster thing, but also requiring much more work to get in.
I am not interested in putting in that work at this moment.

That aside, one mailing list over there is another person arguing the
hash in its current form needs to be eliminated altogether in favor of
rhashtables and they very well may be right. Whoever picks this up will
have probably have to argue against that too.

All that said, my wilingness to touch the ordeal for the time being is
limited to maybe one round of cosmetic fixups to post a v4 of my thing.

If the patch is not wanted, someone decides to get hash-bl in, or
rshashtable or something completely different, and my patch is whacked,
I'm going to have 0 complains. I can't stress enough how side-questy
this particular thing is at the moment.

Jan Kara June 11, 2024, 11:28 a.m. UTC | #7

On Tue 11-06-24 20:17:16, Dave Chinner wrote:
> Your patch, however, just converts *some* of the lookup API
> operations to use RCU. It adds complexity for things like inserts
> which are going to need inode hash locking if the RCU lookup fails,
> anyway.
> 
> Hence your patch optimises the case where the inode is in cache but
> the dentry isn't, but we'll still get massive contention on lookup
> when the RCU lookup on the inode cache and inserts are always going
> to be required.
> 
> IOWs, even RCU lookups are not going to prevent inode hash lock
> contention for parallel cold cache lookups. Hence, with RCU,
> applications are going to see unpredictable contention behaviour
> dependent on the memory footprint of the caches at the time of the
> lookup. Users will have no way of predicting when the behaviour will
> change, let alone have any way of mitigating it. Unpredictable
> variable behaviour is the thing we want to avoid the most with core
> OS caches.

I don't believe this is what Mateusz's patches do (but maybe I've terribly
misread them). iget_locked() does:

	spin_lock(&inode_hash_lock);
	inode = find_inode_fast(...);
	spin_unlock(&inode_hash_lock);
	if (inode)
		we are happy and return
	inode = alloc_inode(sb);
	spin_lock(&inode_hash_lock);
	old = find_inode_fast(...)
	the rest of insert code
	spin_unlock(&inode_hash_lock);

And Mateusz got rid of the first lock-unlock pair by teaching
find_inode_fast() to *also* operate under RCU. The second lookup &
insertion stays under inode_hash_lock as it is now. So his optimization is
orthogonal to your hash bit lock improvements AFAICT. Sure his optimization
just ~halves the lock hold time for uncached cases (for cached it
completely eliminates the lock acquisition but I agree these are not that
interesting) so it is not a fundamental scalability improvement but still
it is a nice win for a contended lock AFAICT.

								Honza

Dave Chinner June 11, 2024, 10 p.m. UTC | #8

On Tue, Jun 11, 2024 at 01:28:46PM +0200, Jan Kara wrote:
> On Tue 11-06-24 20:17:16, Dave Chinner wrote:
> > Your patch, however, just converts *some* of the lookup API
> > operations to use RCU. It adds complexity for things like inserts
> > which are going to need inode hash locking if the RCU lookup fails,
> > anyway.
> > 
> > Hence your patch optimises the case where the inode is in cache but
> > the dentry isn't, but we'll still get massive contention on lookup
> > when the RCU lookup on the inode cache and inserts are always going
> > to be required.
> > 
> > IOWs, even RCU lookups are not going to prevent inode hash lock
> > contention for parallel cold cache lookups. Hence, with RCU,
> > applications are going to see unpredictable contention behaviour
> > dependent on the memory footprint of the caches at the time of the
> > lookup. Users will have no way of predicting when the behaviour will
> > change, let alone have any way of mitigating it. Unpredictable
> > variable behaviour is the thing we want to avoid the most with core
> > OS caches.
> 
> I don't believe this is what Mateusz's patches do (but maybe I've terribly
> misread them). iget_locked() does:
> 
> 	spin_lock(&inode_hash_lock);
> 	inode = find_inode_fast(...);
> 	spin_unlock(&inode_hash_lock);
> 	if (inode)
> 		we are happy and return
> 	inode = alloc_inode(sb);
> 	spin_lock(&inode_hash_lock);
> 	old = find_inode_fast(...)
> 	the rest of insert code
> 	spin_unlock(&inode_hash_lock);
> 
> And Mateusz got rid of the first lock-unlock pair by teaching
> find_inode_fast() to *also* operate under RCU. The second lookup &
> insertion stays under inode_hash_lock as it is now.

Yes, I understand that. I also understand what that does to
performance characteristics when memory pressure causes the working
set cache footprint to change. This will result in currently 
workloads that hit the fast path falling off the fast path and
hitting contention and performing no better than they do today.

Remember, the inode has lock is taken when inode are evicted from
memory, too, so contention on the inode hash lock will be much worse
when we are cycling inodes through the cache compared to when we are
just doing hot cache lookups.

> So his optimization is
> orthogonal to your hash bit lock improvements AFAICT.

Not really. RCU for lookups is not necessary when hash-bl is used.
The new apis and conditional locking changes needed for RCU to work
are not needed with hash-bl. hash-bl scales and performs the same
regardless of whether the workload is cache hot or cache-cold.

And the work is almost all done already - it just needs someone with
time to polish it for merge.

> Sure his optimization
> just ~halves the lock hold time for uncached cases (for cached it
> completely eliminates the lock acquisition but I agree these are not that
> interesting) so it is not a fundamental scalability improvement but still
> it is a nice win for a contended lock AFAICT.

Yes, but my point is that it doesn't get rid of the scalability
problem - it just kicks it down the road for small machines and
people with big machines will continue to suffer from the global
lock contention cycling inodes through the inode cache causes...

That's kinda my point - adding RCU doesn't fix the scalability
problem, it simply hides a specific symptom of the problem *in some
environments* for *some worklaods*. hash-bl works for everyone,
everywhere and for all workloads, doesn't require new APIs or
conditional locking changes, and th work is mostly done. Why take a
poor solution when the same amount of verification effort would
finish off a complete solution?

The hash-bl patchset is out there - I don't have time to finish it,
so anyone who has time to work on inode hash lock scalability issues
is free to take it and work on it. I may have written it, but I
don't care who gets credit for getting it merged. Again: why take
a poor solution just because the person who wants the scalability
problem fixed won't pick up the almost finished work that has all
ready been done?

-Dave.

Mateusz Guzik June 13, 2024, 10:50 a.m. UTC | #9

On Wed, Jun 12, 2024 at 08:00:42AM +1000, Dave Chinner wrote:
> On Tue, Jun 11, 2024 at 01:28:46PM +0200, Jan Kara wrote:
> > On Tue 11-06-24 20:17:16, Dave Chinner wrote:
> > > Your patch, however, just converts *some* of the lookup API
> > > operations to use RCU. It adds complexity for things like inserts
> > > which are going to need inode hash locking if the RCU lookup fails,
> > > anyway.
> > > 
> > > Hence your patch optimises the case where the inode is in cache but
> > > the dentry isn't, but we'll still get massive contention on lookup
> > > when the RCU lookup on the inode cache and inserts are always going
> > > to be required.
> > > 
> > > IOWs, even RCU lookups are not going to prevent inode hash lock
> > > contention for parallel cold cache lookups. Hence, with RCU,
> > > applications are going to see unpredictable contention behaviour
> > > dependent on the memory footprint of the caches at the time of the
> > > lookup. Users will have no way of predicting when the behaviour will
> > > change, let alone have any way of mitigating it. Unpredictable
> > > variable behaviour is the thing we want to avoid the most with core
> > > OS caches.
> > 
> > I don't believe this is what Mateusz's patches do (but maybe I've terribly
> > misread them). iget_locked() does:
> > 
> > 	spin_lock(&inode_hash_lock);
> > 	inode = find_inode_fast(...);
> > 	spin_unlock(&inode_hash_lock);
> > 	if (inode)
> > 		we are happy and return
> > 	inode = alloc_inode(sb);
> > 	spin_lock(&inode_hash_lock);
> > 	old = find_inode_fast(...)
> > 	the rest of insert code
> > 	spin_unlock(&inode_hash_lock);
> > 
> > And Mateusz got rid of the first lock-unlock pair by teaching
> > find_inode_fast() to *also* operate under RCU. The second lookup &
> > insertion stays under inode_hash_lock as it is now.
> 
> Yes, I understand that. I also understand what that does to
> performance characteristics when memory pressure causes the working
> set cache footprint to change. This will result in currently 
> workloads that hit the fast path falling off the fast path and
> hitting contention and performing no better than they do today.
> 
> Remember, the inode has lock is taken when inode are evicted from
> memory, too, so contention on the inode hash lock will be much worse
> when we are cycling inodes through the cache compared to when we are
> just doing hot cache lookups.
> 
> > So his optimization is
> > orthogonal to your hash bit lock improvements AFAICT.
> 
> Not really. RCU for lookups is not necessary when hash-bl is used.
> The new apis and conditional locking changes needed for RCU to work
> are not needed with hash-bl. hash-bl scales and performs the same
> regardless of whether the workload is cache hot or cache-cold.
> 
> And the work is almost all done already - it just needs someone with
> time to polish it for merge.
> 
> > Sure his optimization
> > just ~halves the lock hold time for uncached cases (for cached it
> > completely eliminates the lock acquisition but I agree these are not that
> > interesting) so it is not a fundamental scalability improvement but still
> > it is a nice win for a contended lock AFAICT.
> 
> Yes, but my point is that it doesn't get rid of the scalability
> problem - it just kicks it down the road for small machines and
> people with big machines will continue to suffer from the global
> lock contention cycling inodes through the inode cache causes...
> 
> That's kinda my point - adding RCU doesn't fix the scalability
> problem, it simply hides a specific symptom of the problem *in some
> environments* for *some worklaods*. hash-bl works for everyone,
> everywhere and for all workloads, doesn't require new APIs or
> conditional locking changes, and th work is mostly done. Why take a
> poor solution when the same amount of verification effort would
> finish off a complete solution?
> 
> The hash-bl patchset is out there - I don't have time to finish it,
> so anyone who has time to work on inode hash lock scalability issues
> is free to take it and work on it. I may have written it, but I
> don't care who gets credit for getting it merged. Again: why take
> a poor solution just because the person who wants the scalability
> problem fixed won't pick up the almost finished work that has all
> ready been done?
> 

My patch submission explicitly states that it does not fix the
scalability problem but merely reduces it, so we are in agreement on
this bit. My primary point being that absent full solutions the
benefit/effort ratio is pretty good, but people are free to disagree
with it.

This conversation is going in circles, so how about this as a way
forward:

1. you could isolate the hash-bl parts from your original patchset (+
rebase) and write a quick rundown what needs to be done for this to be
committable
2. if you think you can find the time to do the work yourself in the
foreseeable future you could state the rough timeline (the thing will
probably want to miss this merge cycle anyway so there is plenty of
time)
3. if you can't commit to the work yourself you can look for someone to
do it for you. while you suggested that on the list there were no takers
(for example someone else could have stepped in after I said I'm not
going to do it, but that did not happen). perhaps you can prod people at
your dayjob and whatever non-list spots.

If you can't do the work in the foreseeable future (understandable) and
there are no takers (realistically I suspect there wont be) then you are
going to have stop opposing my patch on the grounds that hash-bl exists.

I don't know how much work is needed to sort it out, it is definitely
much more than what was needed for my thing, which is in part why I did
not go for hash-bl myself.

Of course there may be reasons to not include my patch regardless of the
state of hash-bl. If you have any then I suggest you state them for the
vfs folk to consider. If you intend to write it should not go in on the
because it does not fully fix the problem, then I note the commit
message both concedes there is a limitation and provides a justification
for inclusion despite of it. Merely stating there is still a scalability
ceiling does not address it. Claiming it adds too much complexity for
the reported benefit is imo not legit, but again it is a judgment call
to make by the vfs folk.

Right now the v4 landed in a vfs.inode.rcu branch. It really does not
have to reach master if someone gets hash-bl to a state where the vfs
cabal is happy with it. I don't know if Christian intends to submit it
to master in the upcomming merge cycle, it is perfectly fine with me if
it does not happen. Perhaps it even should not happen if the hash-bl
gets a sensible timeline. Even so, should my patch land in master and
hash-bl get work done at a much later date, there is no difficulty added
to it stemming from my thing -- at worst some trivial editing to resolve
a merge conflict.

And with this message I'm done with the entire ordeal, with 2 exceptions:
- if there is a bug reported against my patch i'm going to investigate
- if my patch is stalled in the vfs.inode.rcu branch for weeks and there
  is no replacement in sight (hash-bl, rhashtable, ${whatever_else}), I'm
  going to prod about it

Cheers.

[RFC] vfs: add rcu-based find_inode variants for iget ops

Commit Message

Comments

Patch