vfs: move d_lockref out of the area used by RCU lookup

Message ID	20240612164715.614843-1-mjguzik@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDB9C17FAB2; Wed, 12 Jun 2024 16:47:23 +0000 (UTC) From: Mateusz Guzik <mjguzik@gmail.com> To: brauner@kernel.org Cc: viro@zeniv.linux.org.uk, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Mateusz Guzik <mjguzik@gmail.com> Subject: [PATCH] vfs: move d_lockref out of the area used by RCU lookup Date: Wed, 12 Jun 2024 18:47:15 +0200 Message-ID: <20240612164715.614843-1-mjguzik@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	vfs: move d_lockref out of the area used by RCU lookup \| expand vfs: move d_lockref out of the area used by RCU lookup

Message ID

20240612164715.614843-1-mjguzik@gmail.com (mailing list archive)

State

New

Headers

From: Mateusz Guzik <mjguzik@gmail.com>
To: brauner@kernel.org
Cc: viro@zeniv.linux.org.uk,
	jack@suse.cz,
	linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	Mateusz Guzik <mjguzik@gmail.com>
Subject: [PATCH] vfs: move d_lockref out of the area used by RCU lookup
Date: Wed, 12 Jun 2024 18:47:15 +0200
Message-ID: <20240612164715.614843-1-mjguzik@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

vfs: move d_lockref out of the area used by RCU lookup | expand

Commit Message

Mateusz Guzik June 12, 2024, 4:47 p.m. UTC

Stock kernel scales worse than FreeBSD when doing a 20-way stat(2) on
the same tmpfs-backed file.

According to perf top:
  38.09%  [kernel]              [k] lockref_put_return
  26.08%  [kernel]              [k] lockref_get_not_dead
  25.60%  [kernel]              [k] __d_lookup_rcu
   0.89%  [kernel]              [k] clear_bhb_loop

__d_lookup_rcu is participating in cacheline ping pong due to the
embedded name sharing a cacheline with lockref.

Moving it out resolves the problem:
  41.50%  [kernel]                  [k] lockref_put_return
  41.03%  [kernel]                  [k] lockref_get_not_dead
   1.54%  [kernel]                  [k] clear_bhb_loop

benchmark (will-it-scale, Sapphire Rapids, tmpfs, ops/s):
FreeBSD:7219334
before:	5038006
after:	7842883 (+55%)

One minor remark: the 'after' result is unstable, fluctuating between
~7.8 mln and ~9 mln between restarts of the test. I picked the lower
bound.

An important remark: lockref API has a deficiency where if the spinlock
is taken for any reason and there is a continuous stream of incs/decs,
it will never recover back to atomic op -- everyone will be stuck taking
the lock. I used to run into it on occasion when spawning 'perf top'
while benchmarking, but now that the pressure on lockref itself is
increased I randomly see it merely when benchmarking.

It looks like this:
min:308703 max:429561 total:8217844	<-- nice start
min:152207 max:178380 total:3501879	<-- things are degrading
min:65563 max:70106 total:1349677	<-- everyone is stuck locking
min:69001 max:72873 total:1424714
min:68993 max:73084 total:1425902

The fix would be to add a variant which will wait for the lock to be
released for some number of spins, and only take it after to still
guarantee forward progress. I'm going to look into it. Mentioned in the
commit message if someone runs into it as is.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 include/linux/dcache.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Mateusz Guzik June 12, 2024, 6:27 p.m. UTC | #1

While I 100% stand behind the patch I found the lockref issue
mentioned below reproduces almost reliably on my box if given enough
time, thus I'm going to need to fix it first.

As such consider this patch posting as a heads up, don't pull it.

On Wed, Jun 12, 2024 at 6:47 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Stock kernel scales worse than FreeBSD when doing a 20-way stat(2) on
> the same tmpfs-backed file.
>
> According to perf top:
>   38.09%  [kernel]              [k] lockref_put_return
>   26.08%  [kernel]              [k] lockref_get_not_dead
>   25.60%  [kernel]              [k] __d_lookup_rcu
>    0.89%  [kernel]              [k] clear_bhb_loop
>
> __d_lookup_rcu is participating in cacheline ping pong due to the
> embedded name sharing a cacheline with lockref.
>
> Moving it out resolves the problem:
>   41.50%  [kernel]                  [k] lockref_put_return
>   41.03%  [kernel]                  [k] lockref_get_not_dead
>    1.54%  [kernel]                  [k] clear_bhb_loop
>
> benchmark (will-it-scale, Sapphire Rapids, tmpfs, ops/s):
> FreeBSD:7219334
> before: 5038006
> after:  7842883 (+55%)
>
> One minor remark: the 'after' result is unstable, fluctuating between
> ~7.8 mln and ~9 mln between restarts of the test. I picked the lower
> bound.
>
> An important remark: lockref API has a deficiency where if the spinlock
> is taken for any reason and there is a continuous stream of incs/decs,
> it will never recover back to atomic op -- everyone will be stuck taking
> the lock. I used to run into it on occasion when spawning 'perf top'
> while benchmarking, but now that the pressure on lockref itself is
> increased I randomly see it merely when benchmarking.
>
> It looks like this:
> min:308703 max:429561 total:8217844     <-- nice start
> min:152207 max:178380 total:3501879     <-- things are degrading
> min:65563 max:70106 total:1349677       <-- everyone is stuck locking
> min:69001 max:72873 total:1424714
> min:68993 max:73084 total:1425902
>
> The fix would be to add a variant which will wait for the lock to be
> released for some number of spins, and only take it after to still
> guarantee forward progress. I'm going to look into it. Mentioned in the
> commit message if someone runs into it as is.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>  include/linux/dcache.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index bf53e3894aae..326dbccc3736 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -89,13 +89,18 @@ struct dentry {
>         struct inode *d_inode;          /* Where the name belongs to - NULL is
>                                          * negative */
>         unsigned char d_iname[DNAME_INLINE_LEN];        /* small names */
> +       /* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
>
>         /* Ref lookup also touches following */
> -       struct lockref d_lockref;       /* per-dentry lock and refcount */
>         const struct dentry_operations *d_op;
>         struct super_block *d_sb;       /* The root of the dentry tree */
>         unsigned long d_time;           /* used by d_revalidate */
>         void *d_fsdata;                 /* fs-specific data */
> +       /* --- cacheline 2 boundary (128 bytes) --- */
> +       struct lockref d_lockref;       /* per-dentry lock and refcount
> +                                        * keep separate from RCU lookup area if
> +                                        * possible!
> +                                        */
>
>         union {
>                 struct list_head d_lru;         /* LRU list */
> --
> 2.43.0
>

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf53e3894aae..326dbccc3736 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -89,13 +89,18 @@  struct dentry {
 	struct inode *d_inode;		/* Where the name belongs to - NULL is
 					 * negative */
 	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */
+	/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
 
 	/* Ref lookup also touches following */
-	struct lockref d_lockref;	/* per-dentry lock and refcount */
 	const struct dentry_operations *d_op;
 	struct super_block *d_sb;	/* The root of the dentry tree */
 	unsigned long d_time;		/* used by d_revalidate */
 	void *d_fsdata;			/* fs-specific data */
+	/* --- cacheline 2 boundary (128 bytes) --- */
+	struct lockref d_lockref;	/* per-dentry lock and refcount
+					 * keep separate from RCU lookup area if
+					 * possible!
+					 */
 
 	union {
 		struct list_head d_lru;		/* LRU list */

vfs: move d_lockref out of the area used by RCU lookup

Commit Message

Comments

Patch