diff mbox series

[v2] btrfs: try to search for data csums in commit root

Message ID 0a59693a70f542120a0302e9864e7f9b86e1cb4c.1728415983.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: try to search for data csums in commit root | expand

Commit Message

Boris Burkov Oct. 8, 2024, 7:36 p.m. UTC
If you run a workload like:
- a cgroup that does tons of data reading, with a harsh memory limit
- a second cgroup that tries to write new files
e.g.:
https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh

then what quickly occurs is:
- a high degree of contention on the csum root node eb rwsem
- memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.

This results in VERY long transactions. On my test system, that script
produces 20-30s long transaction commits. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

This reproducer is a bit silly, as the villanous cgroup is using almost
all of its memory.max for kernel memory (specifically pagetables) but it
sort of doesn't matter, as I am most interested in the btrfs locking
behavior. It isn't an academic problem, as we see this exact problem in
production at Meta with one cgroup over memory.max ruining btrfs
performance for the whole system.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known. e.g.
https://lpc.events/event/18/contributions/1883/

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems. In the case of the csum tree, we can
either redesign btree locking (hard...) or try to use the commit root
when we can. Luckily, it seems likely that many reads are for old extents
written many transactions ago, and that for those we *can* in fact
search the commit root!

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.

With the fix, on that same test case, commit latencies no longer exceed
~400ms on my system.

Signed-off-by: Boris Burkov <boris@bur.io>
---
Changelog:
v2:
- hold the commit_root_sem for the duration of the entire lookup, not
  just per btrfs_search_slot. Note that we can't naively do the thing
  where we release the lock every loop as that is exactly what we are
  trying to avoid. Theoretically, we could re-grab the lock and fully
  start over if the lock is write contended or something. I suspect the
  rwsem fairness features will let the commit writer get it fast enough
  anyway.

---
 fs/btrfs/bio.h       |  1 +
 fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
 fs/btrfs/file-item.c | 10 ++++++++++
 3 files changed, 32 insertions(+)

Comments

David Sterba Oct. 11, 2024, 5:46 p.m. UTC | #1
On Tue, Oct 08, 2024 at 12:36:34PM -0700, Boris Burkov wrote:
> If you run a workload like:
> - a cgroup that does tons of data reading, with a harsh memory limit
> - a second cgroup that tries to write new files
> e.g.:
> https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> 
> then what quickly occurs is:
> - a high degree of contention on the csum root node eb rwsem
> - memory starved cgroup doing tons of reclaim on CPU.
> - many reader threads in the memory starved cgroup "holding" the sem
>   as readers, but not scheduling promptly. i.e., task __state == 0, but
>   not running on a cpu.
> - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> 
> This results in VERY long transactions. On my test system, that script
> produces 20-30s long transaction commits. This then results in
> seriously degraded performance for any cgroup using the filesystem (the
> victim cgroup in the script).
> 
> This reproducer is a bit silly, as the villanous cgroup is using almost
> all of its memory.max for kernel memory (specifically pagetables) but it
> sort of doesn't matter, as I am most interested in the btrfs locking
> behavior. It isn't an academic problem, as we see this exact problem in
> production at Meta with one cgroup over memory.max ruining btrfs
> performance for the whole system.
> 
> The underlying scheduling "problem" with global rwsems is sort of thorny
> and apparently well known. e.g.
> https://lpc.events/event/18/contributions/1883/
> 
> As a result, our main lever in the short term is just trying to reduce
> contention on our various rwsems. In the case of the csum tree, we can
> either redesign btree locking (hard...) or try to use the commit root
> when we can. Luckily, it seems likely that many reads are for old extents
> written many transactions ago, and that for those we *can* in fact
> search the commit root!
> 
> This change detects when we are trying to read an old extent (according
> to extent map generation) and then wires that through bio_ctrl to the
> btrfs_bio, which unfortunately isn't allocated yet when we have this
> information. When we go to lookup the csums in lookup_bio_sums we can
> check this condition on the btrfs_bio and do the commit root lookup
> accordingly.
> 
> With the fix, on that same test case, commit latencies no longer exceed
> ~400ms on my system.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
> Changelog:
> v2:
> - hold the commit_root_sem for the duration of the entire lookup, not
>   just per btrfs_search_slot. Note that we can't naively do the thing
>   where we release the lock every loop as that is exactly what we are
>   trying to avoid. Theoretically, we could re-grab the lock and fully
>   start over if the lock is write contended or something. I suspect the
>   rwsem fairness features will let the commit writer get it fast enough
>   anyway.

I'd rather let the locking primitives do the fairness, mutex spinning or
other optimizations, unless you'd find a good reason to add some try
locks or contention checks. But such things are hard to predict just
from the code, this depends on data, how many threads are involved.
Regrabbing could be left as an option when there is some corner case.

> 
> ---
>  fs/btrfs/bio.h       |  1 +
>  fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
>  fs/btrfs/file-item.c | 10 ++++++++++
>  3 files changed, 32 insertions(+)
> 
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index e48612340745..159f6a4425a6 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -48,6 +48,7 @@ struct btrfs_bio {
>  			u8 *csum;
>  			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
>  			struct bvec_iter saved_iter;
> +			bool commit_root_csum;

Can you please find another way how to store this? Maybe the bio flags
have some bits free. Otherwise this adds 8 bytes to btrfs_bio, while
this structure should be optimized for size.  It's ok to use bool for
simplicity in the first versions when you're testing the locking.
Boris Burkov Oct. 11, 2024, 7:48 p.m. UTC | #2
On Fri, Oct 11, 2024 at 07:46:03PM +0200, David Sterba wrote:
> On Tue, Oct 08, 2024 at 12:36:34PM -0700, Boris Burkov wrote:
> > If you run a workload like:
> > - a cgroup that does tons of data reading, with a harsh memory limit
> > - a second cgroup that tries to write new files
> > e.g.:
> > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> > 
> > then what quickly occurs is:
> > - a high degree of contention on the csum root node eb rwsem
> > - memory starved cgroup doing tons of reclaim on CPU.
> > - many reader threads in the memory starved cgroup "holding" the sem
> >   as readers, but not scheduling promptly. i.e., task __state == 0, but
> >   not running on a cpu.
> > - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> > 
> > This results in VERY long transactions. On my test system, that script
> > produces 20-30s long transaction commits. This then results in
> > seriously degraded performance for any cgroup using the filesystem (the
> > victim cgroup in the script).
> > 
> > This reproducer is a bit silly, as the villanous cgroup is using almost
> > all of its memory.max for kernel memory (specifically pagetables) but it
> > sort of doesn't matter, as I am most interested in the btrfs locking
> > behavior. It isn't an academic problem, as we see this exact problem in
> > production at Meta with one cgroup over memory.max ruining btrfs
> > performance for the whole system.
> > 
> > The underlying scheduling "problem" with global rwsems is sort of thorny
> > and apparently well known. e.g.
> > https://lpc.events/event/18/contributions/1883/
> > 
> > As a result, our main lever in the short term is just trying to reduce
> > contention on our various rwsems. In the case of the csum tree, we can
> > either redesign btree locking (hard...) or try to use the commit root
> > when we can. Luckily, it seems likely that many reads are for old extents
> > written many transactions ago, and that for those we *can* in fact
> > search the commit root!
> > 
> > This change detects when we are trying to read an old extent (according
> > to extent map generation) and then wires that through bio_ctrl to the
> > btrfs_bio, which unfortunately isn't allocated yet when we have this
> > information. When we go to lookup the csums in lookup_bio_sums we can
> > check this condition on the btrfs_bio and do the commit root lookup
> > accordingly.
> > 
> > With the fix, on that same test case, commit latencies no longer exceed
> > ~400ms on my system.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> > Changelog:
> > v2:
> > - hold the commit_root_sem for the duration of the entire lookup, not
> >   just per btrfs_search_slot. Note that we can't naively do the thing
> >   where we release the lock every loop as that is exactly what we are
> >   trying to avoid. Theoretically, we could re-grab the lock and fully
> >   start over if the lock is write contended or something. I suspect the
> >   rwsem fairness features will let the commit writer get it fast enough
> >   anyway.
> 
> I'd rather let the locking primitives do the fairness, mutex spinning or
> other optimizations, unless you'd find a good reason to add some try
> locks or contention checks. But such things are hard to predict just
> from the code, this depends on data, how many threads are involved.
> Regrabbing could be left as an option when there is some corner case.
> 

Agreed.

> > 
> > ---
> >  fs/btrfs/bio.h       |  1 +
> >  fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
> >  fs/btrfs/file-item.c | 10 ++++++++++
> >  3 files changed, 32 insertions(+)
> > 
> > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > index e48612340745..159f6a4425a6 100644
> > --- a/fs/btrfs/bio.h
> > +++ b/fs/btrfs/bio.h
> > @@ -48,6 +48,7 @@ struct btrfs_bio {
> >  			u8 *csum;
> >  			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> >  			struct bvec_iter saved_iter;
> > +			bool commit_root_csum;
> 
> Can you please find another way how to store this? Maybe the bio flags
> have some bits free. Otherwise this adds 8 bytes to btrfs_bio, while
> this structure should be optimized for size.  It's ok to use bool for
> simplicity in the first versions when you're testing the locking.

Ooh, good point!

I started looking into it some more and it's tricky but I have a few
ideas, curious what you think:

1. Export the btrfs_bio_ctrl and optionally wire it through the
   callstack. For data reads it is still live on the stack, we just can't
   be sure that containerof will work in general. Or just wire the bool
   through the calls. This is pretty "trivial" but also ugly.
2. We already allocate an io context in multi-device scenarios. we could
   allocate a smaller one for single. That doesn't seem that different
   than adding a flags to btrfs_bio.
3. Figure out how to stuff it into struct btrfs_bio. For example, I
   think we aren't using btrfs_bio->private until later, so we could put
   a to-be-overwritten submit control struct in there.
4. Figure out how to stuff it into struct bio. This would be your
   bi_flags idea. However, I'm confused how to do that safely. Doesn't
   the block layer own those bits? It seems aggressive for me to try
   to use them. bio has a bi_private as well, which might be unset in
   the context we care about.

I'm leaning towards option 3: taking advantage of the yet unset
btrfs_bio->private

Thanks for the review,
Boris
David Sterba Oct. 14, 2024, 3:16 p.m. UTC | #3
On Fri, Oct 11, 2024 at 12:48:31PM -0700, Boris Burkov wrote:
> > Can you please find another way how to store this? Maybe the bio flags
> > have some bits free. Otherwise this adds 8 bytes to btrfs_bio, while
> > this structure should be optimized for size.  It's ok to use bool for
> > simplicity in the first versions when you're testing the locking.
> 
> Ooh, good point!
> 
> I started looking into it some more and it's tricky but I have a few
> ideas, curious what you think:
> 
> 1. Export the btrfs_bio_ctrl and optionally wire it through the
>    callstack. For data reads it is still live on the stack, we just can't
>    be sure that containerof will work in general. Or just wire the bool
>    through the calls. This is pretty "trivial" but also ugly.
> 2. We already allocate an io context in multi-device scenarios. we could
>    allocate a smaller one for single. That doesn't seem that different
>    than adding a flags to btrfs_bio.
> 3. Figure out how to stuff it into struct btrfs_bio. For example, I
>    think we aren't using btrfs_bio->private until later, so we could put
>    a to-be-overwritten submit control struct in there.

Maybe this could work but seems fragile as the private pointer is used
for other strucutres and purposes. All in a void pointer. We need to
store only a single bit of information, so this can work with some stub
pointer type that can be at least recognized when mistakenly
dereference.

> 4. Figure out how to stuff it into struct bio. This would be your
>    bi_flags idea. However, I'm confused how to do that safely. Doesn't
>    the block layer own those bits? It seems aggressive for me to try
>    to use them. bio has a bi_private as well, which might be unset in
>    the context we care about.

#define BTRFS_BIO_FLAG_SEARCH_COMMIT_ROOT	(1U << (BIO_FLAG_LAST + 1)

I don't see any code in block/* that would verify the bio bits. Also the
REQ_ bits could be used, we already do that with REQ_BTRFS_CGROUP_PUNT
and at some point the REQ_DRV was used (removed in a39da514eba81e687db).

> I'm leaning towards option 3: taking advantage of the yet unset
> btrfs_bio->private
Boris Burkov Oct. 14, 2024, 5:02 p.m. UTC | #4
On Mon, Oct 14, 2024 at 05:16:03PM +0200, David Sterba wrote:
> On Fri, Oct 11, 2024 at 12:48:31PM -0700, Boris Burkov wrote:
> > > Can you please find another way how to store this? Maybe the bio flags
> > > have some bits free. Otherwise this adds 8 bytes to btrfs_bio, while
> > > this structure should be optimized for size.  It's ok to use bool for
> > > simplicity in the first versions when you're testing the locking.
> > 
> > Ooh, good point!
> > 
> > I started looking into it some more and it's tricky but I have a few
> > ideas, curious what you think:
> > 
> > 1. Export the btrfs_bio_ctrl and optionally wire it through the
> >    callstack. For data reads it is still live on the stack, we just can't
> >    be sure that containerof will work in general. Or just wire the bool
> >    through the calls. This is pretty "trivial" but also ugly.
> > 2. We already allocate an io context in multi-device scenarios. we could
> >    allocate a smaller one for single. That doesn't seem that different
> >    than adding a flags to btrfs_bio.
> > 3. Figure out how to stuff it into struct btrfs_bio. For example, I
> >    think we aren't using btrfs_bio->private until later, so we could put
> >    a to-be-overwritten submit control struct in there.
> 
> Maybe this could work but seems fragile as the private pointer is used
> for other strucutres and purposes. All in a void pointer. We need to
> store only a single bit of information, so this can work with some stub
> pointer type that can be at least recognized when mistakenly
> dereference.
> 

Agreed, I tried this approach first and it was very ugly/fragile.

> > 4. Figure out how to stuff it into struct bio. This would be your
> >    bi_flags idea. However, I'm confused how to do that safely. Doesn't
> >    the block layer own those bits? It seems aggressive for me to try
> >    to use them. bio has a bi_private as well, which might be unset in
> >    the context we care about.
> 
> #define BTRFS_BIO_FLAG_SEARCH_COMMIT_ROOT	(1U << (BIO_FLAG_LAST + 1)
> 
> I don't see any code in block/* that would verify the bio bits. Also the
> REQ_ bits could be used, we already do that with REQ_BTRFS_CGROUP_PUNT
> and at some point the REQ_DRV was used (removed in a39da514eba81e687db).
> 

Oh duh, this isn't persisted. Thanks for the extra explanation, that was
helpful. Trying something along these lines now and it looks decent.

> > I'm leaning towards option 3: taking advantage of the yet unset
> > btrfs_bio->private
diff mbox series

Patch

diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index e48612340745..159f6a4425a6 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -48,6 +48,7 @@  struct btrfs_bio {
 			u8 *csum;
 			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 			struct bvec_iter saved_iter;
+			bool commit_root_csum;
 		};
 
 		/*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9fbc83c76b94..b1c0a531b26a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -108,6 +108,21 @@  struct btrfs_bio_ctrl {
 	 * This is to avoid touching ranges covered by compression/inline.
 	 */
 	unsigned long submit_bitmap;
+	/*
+	 * If this is a data read bio, we set this to true if it is safe to
+	 * search for csums in the commit root. Otherwise, it is set to false.
+	 *
+	 * This is an optimization to reduce the contention on the csum tree
+	 * root rwsem. Due to how rwsem is implemented, there is a possible
+	 * priority inversion where the readers holding the lock don't get
+	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
+	 * then blocks btrfs transactions. The only real help is to try to
+	 * reduce the contention on the lock as much as we can.
+	 *
+	 * Do this by detecting when a data read is reading data from an old
+	 * transaction so it's safe to look in the commit root.
+	 */
+	bool commit_root_csum;
 };
 
 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -770,6 +785,9 @@  static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
 			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
 				      folio_pos(folio) + pg_offset);
 		}
+		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
+			bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
+
 
 		/* Cap to the current ordered extent boundary if there is one. */
 		if (len > bio_ctrl->len_to_oe_boundary) {
@@ -1048,6 +1066,9 @@  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		if (prev_em_start)
 			*prev_em_start = em->start;
 
+		if (em->generation < btrfs_get_fs_generation(fs_info))
+			bio_ctrl->commit_root_csum = true;
+
 		free_extent_map(em);
 		em = NULL;
 
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 886749b39672..c59f9e84e314 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -401,6 +401,13 @@  blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
 		path->skip_locking = 1;
 	}
 
+	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
+	if (bbio->commit_root_csum) {
+		path->search_commit_root = 1;
+		path->skip_locking = 1;
+		down_read(&fs_info->commit_root_sem);
+	}
+
 	while (bio_offset < orig_len) {
 		int count;
 		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
@@ -446,6 +453,9 @@  blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
 		bio_offset += count * sectorsize;
 	}
 
+	if (bbio->commit_root_csum)
+		up_read(&fs_info->commit_root_sem);
+
 	btrfs_free_path(path);
 	return ret;
 }