diff mbox series

fs: place f_ref to 3rd cache line in struct file to resolve false sharing

Message ID 20250228020059.3023375-1-pan.deng@intel.com (mailing list archive)
State New
Headers show
Series fs: place f_ref to 3rd cache line in struct file to resolve false sharing | expand

Commit Message

Pan Deng Feb. 28, 2025, 2 a.m. UTC
When running syscall pread in a high core count system, f_ref contends
with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the
same cache line.

This change places f_ref to the 3rd cache line where fields are not
updated as frequently as the 1st cache line, and the contention is
grealy reduced according to tests. In addition, the size of file
object is kept in 3 cache lines.

This change has been tested with rocksdb benchmark readwhilewriting case
in 1 socket 64 physical core 128 logical core baremetal machine, with
build config CONFIG_RANDSTRUCT_NONE=y
Command:
./db_bench --benchmarks="readwhilewriting" --threads $cnt --duration 60
The throughput(ops/s) is improved up to ~21%.
=====
thread		baseline	compare
16		 100%		 +1.3%
32		 100%		 +2.2%
64		 100%		 +7.2%
128		 100%		 +20.9%

It was also tested with UnixBench: syscall, fsbuffer, fstime,
fsdisk cases that has been used for file struct layout tuning, no
regression was observed.

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Lipeng Zhu <lipeng.zhu@intel.com>
---
 include/linux/fs.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Christian Brauner Feb. 28, 2025, 10:15 a.m. UTC | #1
On Fri, 28 Feb 2025 10:00:59 +0800, Pan Deng wrote:
> When running syscall pread in a high core count system, f_ref contends
> with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the
> same cache line.
> 
> This change places f_ref to the 3rd cache line where fields are not
> updated as frequently as the 1st cache line, and the contention is
> grealy reduced according to tests. In addition, the size of file
> object is kept in 3 cache lines.
> 
> [...]

Applied to the vfs-6.15.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.misc

[1/1] fs: place f_ref to 3rd cache line in struct file to resolve false sharing
      https://git.kernel.org/vfs/vfs/c/6a9ebf00c3be
Jan Kara Feb. 28, 2025, 7:51 p.m. UTC | #2
On Fri 28-02-25 10:00:59, Pan Deng wrote:
> When running syscall pread in a high core count system, f_ref contends
> with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the
> same cache line.

Well, but you have to have mulithreaded process using the same struct file
for the IO, don't you? Otherwise f_ref is not touched...

> This change places f_ref to the 3rd cache line where fields are not
> updated as frequently as the 1st cache line, and the contention is
> grealy reduced according to tests. In addition, the size of file
> object is kept in 3 cache lines.
> 
> This change has been tested with rocksdb benchmark readwhilewriting case
> in 1 socket 64 physical core 128 logical core baremetal machine, with
> build config CONFIG_RANDSTRUCT_NONE=y
> Command:
> ./db_bench --benchmarks="readwhilewriting" --threads $cnt --duration 60
> The throughput(ops/s) is improved up to ~21%.
> =====
> thread		baseline	compare
> 16		 100%		 +1.3%
> 32		 100%		 +2.2%
> 64		 100%		 +7.2%
> 128		 100%		 +20.9%
> 
> It was also tested with UnixBench: syscall, fsbuffer, fstime,
> fsdisk cases that has been used for file struct layout tuning, no
> regression was observed.

So overall keeping the first cacheline read mostly with important stuff
makes sense to limit cache traffic. But:

>  struct file {
> -	file_ref_t			f_ref;
>  	spinlock_t			f_lock;
>  	fmode_t				f_mode;
>  	const struct file_operations	*f_op;
> @@ -1102,6 +1101,7 @@ struct file {
>  	unsigned int			f_flags;
>  	unsigned int			f_iocb_flags;
>  	const struct cred		*f_cred;
> +	u8				padding[8];
>  	/* --- cacheline 1 boundary (64 bytes) --- */
>  	struct path			f_path;
>  	union {
> @@ -1127,6 +1127,7 @@ struct file {
>  		struct file_ra_state	f_ra;
>  		freeptr_t		f_freeptr;
>  	};
> +	file_ref_t			f_ref;
>  	/* --- cacheline 3 boundary (192 bytes) --- */
>  } __randomize_layout

This keeps struct file within 3 cachelines but it actually grows it from
184 to 192 bytes (and yes, that changes how many file structs we can fit in
a slab). So instead of adding 8 bytes of padding, just pick some
read-mostly element and move it into the hole - f_owner looks like one
possible candidate.

Also did you test how moving f_ref to the second cache line instead of the
third one behaves?

								Honza
Christian Brauner March 1, 2025, 10:07 a.m. UTC | #3
On Fri, Feb 28, 2025 at 08:51:27PM +0100, Jan Kara wrote:
> On Fri 28-02-25 10:00:59, Pan Deng wrote:
> > When running syscall pread in a high core count system, f_ref contends
> > with the reading of f_mode, f_op, f_mapping, f_inode, f_flags in the
> > same cache line.
> 
> Well, but you have to have mulithreaded process using the same struct file
> for the IO, don't you? Otherwise f_ref is not touched...

Yes, it's specifically designed to scale better under high contention.

> 
> > This change places f_ref to the 3rd cache line where fields are not
> > updated as frequently as the 1st cache line, and the contention is
> > grealy reduced according to tests. In addition, the size of file
> > object is kept in 3 cache lines.
> > 
> > This change has been tested with rocksdb benchmark readwhilewriting case
> > in 1 socket 64 physical core 128 logical core baremetal machine, with
> > build config CONFIG_RANDSTRUCT_NONE=y
> > Command:
> > ./db_bench --benchmarks="readwhilewriting" --threads $cnt --duration 60
> > The throughput(ops/s) is improved up to ~21%.
> > =====
> > thread		baseline	compare
> > 16		 100%		 +1.3%
> > 32		 100%		 +2.2%
> > 64		 100%		 +7.2%
> > 128		 100%		 +20.9%
> > 
> > It was also tested with UnixBench: syscall, fsbuffer, fstime,
> > fsdisk cases that has been used for file struct layout tuning, no
> > regression was observed.
> 
> So overall keeping the first cacheline read mostly with important stuff
> makes sense to limit cache traffic. But:
> 
> >  struct file {
> > -	file_ref_t			f_ref;
> >  	spinlock_t			f_lock;
> >  	fmode_t				f_mode;
> >  	const struct file_operations	*f_op;
> > @@ -1102,6 +1101,7 @@ struct file {
> >  	unsigned int			f_flags;
> >  	unsigned int			f_iocb_flags;
> >  	const struct cred		*f_cred;
> > +	u8				padding[8];
> >  	/* --- cacheline 1 boundary (64 bytes) --- */
> >  	struct path			f_path;
> >  	union {
> > @@ -1127,6 +1127,7 @@ struct file {
> >  		struct file_ra_state	f_ra;
> >  		freeptr_t		f_freeptr;
> >  	};
> > +	file_ref_t			f_ref;
> >  	/* --- cacheline 3 boundary (192 bytes) --- */
> >  } __randomize_layout
> 
> This keeps struct file within 3 cachelines but it actually grows it from
> 184 to 192 bytes (and yes, that changes how many file structs we can fit in
> a slab). So instead of adding 8 bytes of padding, just pick some
> read-mostly element and move it into the hole - f_owner looks like one
> possible candidate.

This is what I did. See vfs-6.15.misc! Thanks!

> 
> Also did you test how moving f_ref to the second cache line instead of the
> third one behaves?
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
diff mbox series

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c3b2f8a621f..63b402cc9219 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1067,7 +1067,6 @@  static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 
 /**
  * struct file - Represents a file
- * @f_ref: reference count
  * @f_lock: Protects f_ep, f_flags. Must not be taken from IRQ context.
  * @f_mode: FMODE_* flags often used in hotpaths
  * @f_op: file operations
@@ -1090,9 +1089,9 @@  static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
  * @f_llist: work queue entrypoint
  * @f_ra: file's readahead state
  * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
+ * @f_ref: reference count
  */
 struct file {
-	file_ref_t			f_ref;
 	spinlock_t			f_lock;
 	fmode_t				f_mode;
 	const struct file_operations	*f_op;
@@ -1102,6 +1101,7 @@  struct file {
 	unsigned int			f_flags;
 	unsigned int			f_iocb_flags;
 	const struct cred		*f_cred;
+	u8				padding[8];
 	/* --- cacheline 1 boundary (64 bytes) --- */
 	struct path			f_path;
 	union {
@@ -1127,6 +1127,7 @@  struct file {
 		struct file_ra_state	f_ra;
 		freeptr_t		f_freeptr;
 	};
+	file_ref_t			f_ref;
 	/* --- cacheline 3 boundary (192 bytes) --- */
 } __randomize_layout
   __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */