diff mbox series

[v3] btrfs: try to search for data csums in commit root

Message ID 01721e6680b4a05c06cea8afc98b1726102ba5f5.1728947030.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series [v3] btrfs: try to search for data csums in commit root | expand

Commit Message

Boris Burkov Oct. 14, 2024, 11:08 p.m. UTC
If you run a workload like:
- a cgroup that does tons of data reading, with a harsh memory limit
- a second cgroup that tries to write new files
e.g.:
https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh

then what quickly occurs is:
- a high degree of contention on the csum root node eb rwsem
- memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.

This results in VERY long transactions. On my test system, that script
produces 20-30s long transaction commits. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

This reproducer is a bit silly, as the villanous cgroup is using almost
all of its memory.max for kernel memory (specifically pagetables) but it
sort of doesn't matter, as I am most interested in the btrfs locking
behavior. It isn't an academic problem, as we see this exact problem in
production at Meta with one cgroup over memory.max ruining btrfs
performance for the whole system.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known. e.g.
https://lpc.events/event/18/contributions/1883/

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems. In the case of the csum tree, we can
either redesign btree locking (hard...) or try to use the commit root
when we can. Luckily, it seems likely that many reads are for old extents
written many transactions ago, and that for those we *can* in fact
search the commit root!

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. Luckily, we don't need this flag in the bio after
submitting, so we can save space by setting it on bbio->bio.bi_flags
and clear before submitting, so the block layer is unaffected.

When we go to lookup the csums in lookup_bio_sums we can check this
condition on the btrfs_bio and do the commit root lookup accordingly.

With the fix, on that same test case, commit latencies no longer exceed
~400ms on my system.

Signed-off-by: Boris Burkov <boris@bur.io>
---
Changelog:
v3:
- add some simple machinery for setting/getting/clearing btrfs private
  flags in bi_flags
- clear those flags before bio_submit (ensure no-op wrt block layer)
- store the csum commit root flag there to save space
v2:
- hold the commit_root_sem for the duration of the entire lookup, not
  just per btrfs_search_slot. Note that we can't naively do the thing
  where we release the lock every loop as that is exactly what we are
  trying to avoid. Theoretically, we could re-grab the lock and fully
  start over if the lock is write contended or something. I suspect the
  rwsem fairness features will let the commit writer get it fast enough
  anyway.


---
 fs/btrfs/bio.c       | 20 ++++++++++++++++++++
 fs/btrfs/bio.h       |  7 +++++++
 fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
 fs/btrfs/file-item.c | 10 ++++++++++
 4 files changed, 58 insertions(+)

Comments

David Sterba Oct. 15, 2024, 3:43 p.m. UTC | #1
On Mon, Oct 14, 2024 at 04:08:31PM -0700, Boris Burkov wrote:
> If you run a workload like:
> - a cgroup that does tons of data reading, with a harsh memory limit
> - a second cgroup that tries to write new files
> e.g.:
> https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> 
> then what quickly occurs is:
> - a high degree of contention on the csum root node eb rwsem
> - memory starved cgroup doing tons of reclaim on CPU.
> - many reader threads in the memory starved cgroup "holding" the sem
>   as readers, but not scheduling promptly. i.e., task __state == 0, but
>   not running on a cpu.
> - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> 
> This results in VERY long transactions. On my test system, that script
> produces 20-30s long transaction commits. This then results in
> seriously degraded performance for any cgroup using the filesystem (the
> victim cgroup in the script).
> 
> This reproducer is a bit silly, as the villanous cgroup is using almost
> all of its memory.max for kernel memory (specifically pagetables) but it
> sort of doesn't matter, as I am most interested in the btrfs locking
> behavior. It isn't an academic problem, as we see this exact problem in
> production at Meta with one cgroup over memory.max ruining btrfs
> performance for the whole system.
> 
> The underlying scheduling "problem" with global rwsems is sort of thorny
> and apparently well known. e.g.
> https://lpc.events/event/18/contributions/1883/
> 
> As a result, our main lever in the short term is just trying to reduce
> contention on our various rwsems. In the case of the csum tree, we can
> either redesign btree locking (hard...) or try to use the commit root
> when we can. Luckily, it seems likely that many reads are for old extents
> written many transactions ago, and that for those we *can* in fact
> search the commit root!
> 
> This change detects when we are trying to read an old extent (according
> to extent map generation) and then wires that through bio_ctrl to the
> btrfs_bio, which unfortunately isn't allocated yet when we have this
> information. Luckily, we don't need this flag in the bio after
> submitting, so we can save space by setting it on bbio->bio.bi_flags
> and clear before submitting, so the block layer is unaffected.
> 
> When we go to lookup the csums in lookup_bio_sums we can check this
> condition on the btrfs_bio and do the commit root lookup accordingly.
> 
> With the fix, on that same test case, commit latencies no longer exceed
> ~400ms on my system.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Changelog:
> v3:
> - add some simple machinery for setting/getting/clearing btrfs private
>   flags in bi_flags
> - clear those flags before bio_submit (ensure no-op wrt block layer)
> - store the csum commit root flag there to save space
> v2:
> - hold the commit_root_sem for the duration of the entire lookup, not
>   just per btrfs_search_slot. Note that we can't naively do the thing
>   where we release the lock every loop as that is exactly what we are
>   trying to avoid. Theoretically, we could re-grab the lock and fully
>   start over if the lock is write contended or something. I suspect the
>   rwsem fairness features will let the commit writer get it fast enough
>   anyway.
> 
> 
> ---
>  fs/btrfs/bio.c       | 20 ++++++++++++++++++++
>  fs/btrfs/bio.h       |  7 +++++++
>  fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
>  fs/btrfs/file-item.c | 10 ++++++++++
>  4 files changed, 58 insertions(+)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 5d3f8bd406d9..24c159ef3854 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -71,6 +71,25 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
>  	return bbio;
>  }
>  
> +void btrfs_bio_set_private_flag(struct btrfs_bio *bbio, unsigned short flag) {
> +	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
> +	bbio->bio.bi_flags |= flag;
> +}
> +
> +void btrfs_bio_clear_private_flag(struct btrfs_bio *bbio, unsigned short flag) {
> +	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
> +	bbio->bio.bi_flags &= ~flag;
> +}
> +
> +void btrfs_bio_clear_private_flags(struct btrfs_bio *bbio) {
> +	bbio->bio.bi_flags &= ~BTRFS_BIO_PRIVATE_FLAG_MASK;
> +}
> +
> +bool btrfs_bio_private_flagged(struct btrfs_bio *bbio, unsigned short flag) {
> +	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
> +	return bbio->bio.bi_flags & flag;
> +}

This is good, open coding the flag updates is trivial but the assertions
make ti much better.

Though using two almost identical names for clearing the private flag,
one with assertion and one without is confusing and hard to spot in the
code. Also I don't see where is btrfs_bio_clear_private_flag() used at
all.

> +
>  static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
>  					 struct btrfs_bio *orig_bbio,
>  					 u64 map_length)
> @@ -493,6 +512,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
>  static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
>  			     struct btrfs_io_stripe *smap, int mirror_num)
>  {
> +	btrfs_bio_clear_private_flags(btrfs_bio(bio));
>  	if (!bioc) {
>  		/* Single mirror read/write fast path. */
>  		btrfs_bio(bio)->mirror_num = mirror_num;
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index e48612340745..749004ffdc1c 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -101,6 +101,13 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
>  				  btrfs_bio_end_io_t end_io, void *private);
>  void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
>  
> +#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT	1U << (BIO_FLAG_LAST + 1)

All expressions in macros should be in ( ), namelly when there's a
potential for operator precedence change like with "<<" after the macro
is expanded in some code.

> +#define BTRFS_BIO_PRIVATE_FLAG_MASK	(BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT)

Do you plan to add more such flags to private? This looks line one level
more of abstraction than we need for this optimization. This could be
simply used as BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT. And for that
reason the helpers do not need to sound generic that it manipulates
'private', the names can be btrfs_bio_set_csum_search_commit_root(),
which is IMHO expressing the same semantics.

> +void btrfs_bio_set_private_flag(struct btrfs_bio *bbio, unsigned short flag);
> +void btrfs_bio_clear_private_flag(struct btrfs_bio *bbio, unsigned short flag);
> +void btrfs_bio_clear_private_flags(struct btrfs_bio *bbio);
> +bool btrfs_bio_private_flagged(struct btrfs_bio *bbio, unsigned short flag);
> +
>  /* Submit using blkcg_punt_bio_submit. */
>  #define REQ_BTRFS_CGROUP_PUNT			REQ_FS_PRIVATE
>  
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 09c0d18a7b5a..b1b5dce05728 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
>  	 * This is to avoid touching ranges covered by compression/inline.
>  	 */
>  	unsigned long submit_bitmap;
> +	/*
> +	 * If this is a data read bio, we set this to true if it is safe to
> +	 * search for csums in the commit root. Otherwise, it is set to false.
> +	 *
> +	 * This is an optimization to reduce the contention on the csum tree
> +	 * root rwsem. Due to how rwsem is implemented, there is a possible
> +	 * priority inversion where the readers holding the lock don't get
> +	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
> +	 * then blocks btrfs transactions. The only real help is to try to
> +	 * reduce the contention on the lock as much as we can.
> +	 *
> +	 * Do this by detecting when a data read is reading data from an old
> +	 * transaction so it's safe to look in the commit root.
> +	 */
> +	bool commit_root_csum;

There's a 4 byte hole after 'opf', you can move it there for better
struct packing.
Boris Burkov Oct. 15, 2024, 6:01 p.m. UTC | #2
On Tue, Oct 15, 2024 at 05:43:20PM +0200, David Sterba wrote:
> On Mon, Oct 14, 2024 at 04:08:31PM -0700, Boris Burkov wrote:
> > If you run a workload like:
> > - a cgroup that does tons of data reading, with a harsh memory limit
> > - a second cgroup that tries to write new files
> > e.g.:
> > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> > 
> > then what quickly occurs is:
> > - a high degree of contention on the csum root node eb rwsem
> > - memory starved cgroup doing tons of reclaim on CPU.
> > - many reader threads in the memory starved cgroup "holding" the sem
> >   as readers, but not scheduling promptly. i.e., task __state == 0, but
> >   not running on a cpu.
> > - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> > 
> > This results in VERY long transactions. On my test system, that script
> > produces 20-30s long transaction commits. This then results in
> > seriously degraded performance for any cgroup using the filesystem (the
> > victim cgroup in the script).
> > 
> > This reproducer is a bit silly, as the villanous cgroup is using almost
> > all of its memory.max for kernel memory (specifically pagetables) but it
> > sort of doesn't matter, as I am most interested in the btrfs locking
> > behavior. It isn't an academic problem, as we see this exact problem in
> > production at Meta with one cgroup over memory.max ruining btrfs
> > performance for the whole system.
> > 
> > The underlying scheduling "problem" with global rwsems is sort of thorny
> > and apparently well known. e.g.
> > https://lpc.events/event/18/contributions/1883/
> > 
> > As a result, our main lever in the short term is just trying to reduce
> > contention on our various rwsems. In the case of the csum tree, we can
> > either redesign btree locking (hard...) or try to use the commit root
> > when we can. Luckily, it seems likely that many reads are for old extents
> > written many transactions ago, and that for those we *can* in fact
> > search the commit root!
> > 
> > This change detects when we are trying to read an old extent (according
> > to extent map generation) and then wires that through bio_ctrl to the
> > btrfs_bio, which unfortunately isn't allocated yet when we have this
> > information. Luckily, we don't need this flag in the bio after
> > submitting, so we can save space by setting it on bbio->bio.bi_flags
> > and clear before submitting, so the block layer is unaffected.
> > 
> > When we go to lookup the csums in lookup_bio_sums we can check this
> > condition on the btrfs_bio and do the commit root lookup accordingly.
> > 
> > With the fix, on that same test case, commit latencies no longer exceed
> > ~400ms on my system.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Changelog:
> > v3:
> > - add some simple machinery for setting/getting/clearing btrfs private
> >   flags in bi_flags
> > - clear those flags before bio_submit (ensure no-op wrt block layer)
> > - store the csum commit root flag there to save space
> > v2:
> > - hold the commit_root_sem for the duration of the entire lookup, not
> >   just per btrfs_search_slot. Note that we can't naively do the thing
> >   where we release the lock every loop as that is exactly what we are
> >   trying to avoid. Theoretically, we could re-grab the lock and fully
> >   start over if the lock is write contended or something. I suspect the
> >   rwsem fairness features will let the commit writer get it fast enough
> >   anyway.
> > 
> > 
> > ---
> >  fs/btrfs/bio.c       | 20 ++++++++++++++++++++
> >  fs/btrfs/bio.h       |  7 +++++++
> >  fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
> >  fs/btrfs/file-item.c | 10 ++++++++++
> >  4 files changed, 58 insertions(+)
> > 
> > diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> > index 5d3f8bd406d9..24c159ef3854 100644
> > --- a/fs/btrfs/bio.c
> > +++ b/fs/btrfs/bio.c
> > @@ -71,6 +71,25 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
> >  	return bbio;
> >  }
> >  
> > +void btrfs_bio_set_private_flag(struct btrfs_bio *bbio, unsigned short flag) {
> > +	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
> > +	bbio->bio.bi_flags |= flag;
> > +}
> > +
> > +void btrfs_bio_clear_private_flag(struct btrfs_bio *bbio, unsigned short flag) {
> > +	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
> > +	bbio->bio.bi_flags &= ~flag;
> > +}
> > +
> > +void btrfs_bio_clear_private_flags(struct btrfs_bio *bbio) {
> > +	bbio->bio.bi_flags &= ~BTRFS_BIO_PRIVATE_FLAG_MASK;
> > +}
> > +
> > +bool btrfs_bio_private_flagged(struct btrfs_bio *bbio, unsigned short flag) {
> > +	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
> > +	return bbio->bio.bi_flags & flag;
> > +}
> 
> This is good, open coding the flag updates is trivial but the assertions
> make ti much better.
> 
> Though using two almost identical names for clearing the private flag,
> one with assertion and one without is confusing and hard to spot in the
> code. Also I don't see where is btrfs_bio_clear_private_flag() used at
> all.
> 

There's no use of it, I just figured if I wrote the rest, I should write
that one too, with the checking and stuff, so that it's a complete API.

> > +
> >  static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
> >  					 struct btrfs_bio *orig_bbio,
> >  					 u64 map_length)
> > @@ -493,6 +512,7 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
> >  static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
> >  			     struct btrfs_io_stripe *smap, int mirror_num)
> >  {
> > +	btrfs_bio_clear_private_flags(btrfs_bio(bio));
> >  	if (!bioc) {
> >  		/* Single mirror read/write fast path. */
> >  		btrfs_bio(bio)->mirror_num = mirror_num;
> > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > index e48612340745..749004ffdc1c 100644
> > --- a/fs/btrfs/bio.h
> > +++ b/fs/btrfs/bio.h
> > @@ -101,6 +101,13 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
> >  				  btrfs_bio_end_io_t end_io, void *private);
> >  void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
> >  
> > +#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT	1U << (BIO_FLAG_LAST + 1)
> 
> All expressions in macros should be in ( ), namelly when there's a
> potential for operator precedence change like with "<<" after the macro
> is expanded in some code.
> 
> > +#define BTRFS_BIO_PRIVATE_FLAG_MASK	(BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT)
> 
> Do you plan to add more such flags to private? This looks line one level
> more of abstraction than we need for this optimization. This could be
> simply used as BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT. And for that
> reason the helpers do not need to sound generic that it manipulates
> 'private', the names can be btrfs_bio_set_csum_search_commit_root(),
> which is IMHO expressing the same semantics.
> 

I don't plan to add any more btrfs private bio flags. I made the
judgment that doing it generically with checking (particularly for the
important and explicit clear before submit) was better than the one off,
because it made what was being done as clear as possible. It's still not
perfectly clear, because of the whole "private flags" name not being
perfect.

Since there is only one user, I totally see the argument for just doing
it as a one-off. Would you like me to rewrite it to
btrfs_bio_set_csum_search_commit_root?

> > +void btrfs_bio_set_private_flag(struct btrfs_bio *bbio, unsigned short flag);
> > +void btrfs_bio_clear_private_flag(struct btrfs_bio *bbio, unsigned short flag);
> > +void btrfs_bio_clear_private_flags(struct btrfs_bio *bbio);
> > +bool btrfs_bio_private_flagged(struct btrfs_bio *bbio, unsigned short flag);
> > +
> >  /* Submit using blkcg_punt_bio_submit. */
> >  #define REQ_BTRFS_CGROUP_PUNT			REQ_FS_PRIVATE
> >  
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 09c0d18a7b5a..b1b5dce05728 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
> >  	 * This is to avoid touching ranges covered by compression/inline.
> >  	 */
> >  	unsigned long submit_bitmap;
> > +	/*
> > +	 * If this is a data read bio, we set this to true if it is safe to
> > +	 * search for csums in the commit root. Otherwise, it is set to false.
> > +	 *
> > +	 * This is an optimization to reduce the contention on the csum tree
> > +	 * root rwsem. Due to how rwsem is implemented, there is a possible
> > +	 * priority inversion where the readers holding the lock don't get
> > +	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
> > +	 * then blocks btrfs transactions. The only real help is to try to
> > +	 * reduce the contention on the lock as much as we can.
> > +	 *
> > +	 * Do this by detecting when a data read is reading data from an old
> > +	 * transaction so it's safe to look in the commit root.
> > +	 */
> > +	bool commit_root_csum;
> 
> There's a 4 byte hole after 'opf', you can move it there for better
> struct packing.
David Sterba Oct. 16, 2024, 12:25 a.m. UTC | #3
On Tue, Oct 15, 2024 at 11:01:44AM -0700, Boris Burkov wrote:
> > > +#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT	1U << (BIO_FLAG_LAST + 1)
> > 
> > All expressions in macros should be in ( ), namelly when there's a
> > potential for operator precedence change like with "<<" after the macro
> > is expanded in some code.
> > 
> > > +#define BTRFS_BIO_PRIVATE_FLAG_MASK	(BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT)
> > 
> > Do you plan to add more such flags to private? This looks line one level
> > more of abstraction than we need for this optimization. This could be
> > simply used as BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT. And for that
> > reason the helpers do not need to sound generic that it manipulates
> > 'private', the names can be btrfs_bio_set_csum_search_commit_root(),
> > which is IMHO expressing the same semantics.
> 
> I don't plan to add any more btrfs private bio flags. I made the
> judgment that doing it generically with checking (particularly for the
> important and explicit clear before submit) was better than the one off,
> because it made what was being done as clear as possible. It's still not
> perfectly clear, because of the whole "private flags" name not being
> perfect.
> 
> Since there is only one user, I totally see the argument for just doing
> it as a one-off. Would you like me to rewrite it to
> btrfs_bio_set_csum_search_commit_root?

Yes please, make it specific for the csum search root. Making it generic
works when we have at least two, in use or soon to be used, but as you
say no plans for more so it's best to keep it suitable what we need now.
If it is for any reason needaed in the future we know how to extend the
interface, so no big deal.
diff mbox series

Patch

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 5d3f8bd406d9..24c159ef3854 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -71,6 +71,25 @@  struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 	return bbio;
 }
 
+void btrfs_bio_set_private_flag(struct btrfs_bio *bbio, unsigned short flag) {
+	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
+	bbio->bio.bi_flags |= flag;
+}
+
+void btrfs_bio_clear_private_flag(struct btrfs_bio *bbio, unsigned short flag) {
+	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
+	bbio->bio.bi_flags &= ~flag;
+}
+
+void btrfs_bio_clear_private_flags(struct btrfs_bio *bbio) {
+	bbio->bio.bi_flags &= ~BTRFS_BIO_PRIVATE_FLAG_MASK;
+}
+
+bool btrfs_bio_private_flagged(struct btrfs_bio *bbio, unsigned short flag) {
+	ASSERT(flag & BTRFS_BIO_PRIVATE_FLAG_MASK);
+	return bbio->bio.bi_flags & flag;
+}
+
 static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
 					 struct btrfs_bio *orig_bbio,
 					 u64 map_length)
@@ -493,6 +512,7 @@  static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 static void btrfs_submit_bio(struct bio *bio, struct btrfs_io_context *bioc,
 			     struct btrfs_io_stripe *smap, int mirror_num)
 {
+	btrfs_bio_clear_private_flags(btrfs_bio(bio));
 	if (!bioc) {
 		/* Single mirror read/write fast path. */
 		btrfs_bio(bio)->mirror_num = mirror_num;
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index e48612340745..749004ffdc1c 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -101,6 +101,13 @@  struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 				  btrfs_bio_end_io_t end_io, void *private);
 void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
 
+#define BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT	1U << (BIO_FLAG_LAST + 1)
+#define BTRFS_BIO_PRIVATE_FLAG_MASK	(BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT)
+void btrfs_bio_set_private_flag(struct btrfs_bio *bbio, unsigned short flag);
+void btrfs_bio_clear_private_flag(struct btrfs_bio *bbio, unsigned short flag);
+void btrfs_bio_clear_private_flags(struct btrfs_bio *bbio);
+bool btrfs_bio_private_flagged(struct btrfs_bio *bbio, unsigned short flag);
+
 /* Submit using blkcg_punt_bio_submit. */
 #define REQ_BTRFS_CGROUP_PUNT			REQ_FS_PRIVATE
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 09c0d18a7b5a..b1b5dce05728 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -108,6 +108,21 @@  struct btrfs_bio_ctrl {
 	 * This is to avoid touching ranges covered by compression/inline.
 	 */
 	unsigned long submit_bitmap;
+	/*
+	 * If this is a data read bio, we set this to true if it is safe to
+	 * search for csums in the commit root. Otherwise, it is set to false.
+	 *
+	 * This is an optimization to reduce the contention on the csum tree
+	 * root rwsem. Due to how rwsem is implemented, there is a possible
+	 * priority inversion where the readers holding the lock don't get
+	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
+	 * then blocks btrfs transactions. The only real help is to try to
+	 * reduce the contention on the lock as much as we can.
+	 *
+	 * Do this by detecting when a data read is reading data from an old
+	 * transaction so it's safe to look in the commit root.
+	 */
+	bool commit_root_csum;
 };
 
 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -770,6 +785,9 @@  static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
 			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
 				      folio_pos(folio) + pg_offset);
 		}
+		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ &&
+		    is_data_inode(inode) && bio_ctrl->commit_root_csum)
+			btrfs_bio_set_private_flag(bio_ctrl->bbio, BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT);
 
 		/* Cap to the current ordered extent boundary if there is one. */
 		if (len > bio_ctrl->len_to_oe_boundary) {
@@ -1048,6 +1066,9 @@  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		if (prev_em_start)
 			*prev_em_start = em->start;
 
+		if (em->generation < btrfs_get_fs_generation(fs_info))
+			bio_ctrl->commit_root_csum = true;
+
 		free_extent_map(em);
 		em = NULL;
 
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 886749b39672..52db43bdd623 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -401,6 +401,13 @@  blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
 		path->skip_locking = 1;
 	}
 
+	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
+	if (btrfs_bio_private_flagged(bbio, BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT)) {
+		path->search_commit_root = 1;
+		path->skip_locking = 1;
+		down_read(&fs_info->commit_root_sem);
+	}
+
 	while (bio_offset < orig_len) {
 		int count;
 		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
@@ -446,6 +453,9 @@  blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
 		bio_offset += count * sectorsize;
 	}
 
+	if (btrfs_bio_private_flagged(bbio, BTRFS_BIO_FLAG_CSUM_SEARCH_COMMIT_ROOT))
+		up_read(&fs_info->commit_root_sem);
+
 	btrfs_free_path(path);
 	return ret;
 }