diff mbox series

btrfs: try to search for data csums in commit root

Message ID 9d12c373a49184e84897ff2d6df601f2c7c66a32.1728084164.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series btrfs: try to search for data csums in commit root | expand

Commit Message

Boris Burkov Oct. 4, 2024, 11:23 p.m. UTC
If you run a workload like:
- a cgroup that does tons of data reading, with a harsh memory limit
- a second cgroup that tries to write new files
e.g.:
https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh

then what quickly occurs is:
- a high degree of contention on the csum root node eb rwsem
- memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
  as readers, but not scheduling promptly. i.e., task __state == 0, but
  not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.

This results in VERY long transactions. On my test system, that script
produces 20-30s long transaction commits. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).

This reproducer is a bit silly, as the villanous cgroup is using almost
all of its memory.max for kernel memory (specifically pagetables) but it
sort of doesn't matter, as I am most interested in the btrfs locking
behavior. It isn't an academic problem, as we see this exact problem in
production at Meta with one cgroup over memory.max ruining btrfs
performance for the whole system.

The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known. e.g.
https://lpc.events/event/18/contributions/1883/

As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems. In the case of the csum tree, we can
either redesign btree locking (hard...) or try to use the commit root
when we can. Luckily, it seems likely that many reads are for old extents
written many transactions ago, and that for those we *can* in fact
search the commit root!

This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.

With the fix, on that same test case, commit latencies no longer exceed
~400ms on my system.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/bio.h       |  1 +
 fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
 fs/btrfs/file-item.c |  7 +++++++
 3 files changed, 29 insertions(+)

Comments

Qu Wenruo Oct. 5, 2024, 7 a.m. UTC | #1
在 2024/10/5 08:53, Boris Burkov 写道:
> If you run a workload like:
> - a cgroup that does tons of data reading, with a harsh memory limit
> - a second cgroup that tries to write new files
> e.g.:
> https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> 
> then what quickly occurs is:
> - a high degree of contention on the csum root node eb rwsem
> - memory starved cgroup doing tons of reclaim on CPU.
> - many reader threads in the memory starved cgroup "holding" the sem
>    as readers, but not scheduling promptly. i.e., task __state == 0, but
>    not running on a cpu.
> - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> 
> This results in VERY long transactions. On my test system, that script
> produces 20-30s long transaction commits. This then results in
> seriously degraded performance for any cgroup using the filesystem (the
> victim cgroup in the script).
> 
> This reproducer is a bit silly, as the villanous cgroup is using almost
> all of its memory.max for kernel memory (specifically pagetables) but it
> sort of doesn't matter, as I am most interested in the btrfs locking
> behavior. It isn't an academic problem, as we see this exact problem in
> production at Meta with one cgroup over memory.max ruining btrfs
> performance for the whole system.
> 
> The underlying scheduling "problem" with global rwsems is sort of thorny
> and apparently well known. e.g.
> https://lpc.events/event/18/contributions/1883/
> 
> As a result, our main lever in the short term is just trying to reduce
> contention on our various rwsems. In the case of the csum tree, we can
> either redesign btree locking (hard...) or try to use the commit root
> when we can. Luckily, it seems likely that many reads are for old extents
> written many transactions ago, and that for those we *can* in fact
> search the commit root!

The idea looks good to me.

The extent_map::generation is updated to the larger one during merge, so 
if we got a em whose generation is smaller than the current generation 
it's definitely older.

And since data extents in commit root won't be overwritten until the 
current transaction committed, so it should also be fine.


But my concern is, the path->need_commit_sem is only blocking 
transaction from happening when the path is holding something.
And inside search_csum_tree() we can release the path halfway, would 
that cause 2 transaction to be committed during that release window?

Shouldn't we hold the semaphore manually inside btrfs_lookup_bio_sums() 
other than relying on the btrfs_path::need_commit_sem?

Thanks,
Qu
> 
> This change detects when we are trying to read an old extent (according
> to extent map generation) and then wires that through bio_ctrl to the
> btrfs_bio, which unfortunately isn't allocated yet when we have this
> information. When we go to lookup the csums in lookup_bio_sums we can
> check this condition on the btrfs_bio and do the commit root lookup
> accordingly.
> 
> With the fix, on that same test case, commit latencies no longer exceed
> ~400ms on my system.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/bio.h       |  1 +
>   fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
>   fs/btrfs/file-item.c |  7 +++++++
>   3 files changed, 29 insertions(+)
> 
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index e48612340745..159f6a4425a6 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -48,6 +48,7 @@ struct btrfs_bio {
>   			u8 *csum;
>   			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
>   			struct bvec_iter saved_iter;
> +			bool commit_root_csum;
>   		};
>   
>   		/*
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index cb0a39370d30..8544fe2302ff 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
>   	 * This is to avoid touching ranges covered by compression/inline.
>   	 */
>   	unsigned long submit_bitmap;
> +	/*
> +	 * If this is a data read bio, we set this to true if it is safe to
> +	 * search for csums in the commit root. Otherwise, it is set to false.
> +	 *
> +	 * This is an optimization to reduce the contention on the csum tree
> +	 * root rwsem. Due to how rwsem is implemented, there is a possible
> +	 * priority inversion where the readers holding the lock don't get
> +	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
> +	 * then blocks btrfs transactions. The only real help is to try to
> +	 * reduce the contention on the lock as much as we can.
> +	 *
> +	 * Do this by detecting when a data read is reading data from an old
> +	 * transaction so it's safe to look in the commit root.
> +	 */
> +	bool commit_root_csum;
>   };
>   
>   static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
> @@ -770,6 +785,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
>   			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
>   				      folio_pos(folio) + pg_offset);
>   		}
> +		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
> +			bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
> +
>   
>   		/* Cap to the current ordered extent boundary if there is one. */
>   		if (len > bio_ctrl->len_to_oe_boundary) {
> @@ -1048,6 +1066,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>   		if (prev_em_start)
>   			*prev_em_start = em->start;
>   
> +		if (em->generation < btrfs_get_fs_generation(fs_info))
> +			bio_ctrl->commit_root_csum = true;
> +
>   		free_extent_map(em);
>   		em = NULL;
>   
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 886749b39672..2433b169a4e6 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -401,6 +401,13 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
>   		path->skip_locking = 1;
>   	}
>   
> +	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
> +	if (bbio->commit_root_csum) {
> +		path->search_commit_root = 1;
> +		path->skip_locking = 1;
> +		path->need_commit_sem = 1;
> +	}
> +
>   	while (bio_offset < orig_len) {
>   		int count;
>   		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
Boris Burkov Oct. 7, 2024, 10:30 p.m. UTC | #2
On Sat, Oct 05, 2024 at 04:30:29PM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/10/5 08:53, Boris Burkov 写道:
> > If you run a workload like:
> > - a cgroup that does tons of data reading, with a harsh memory limit
> > - a second cgroup that tries to write new files
> > e.g.:
> > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> > 
> > then what quickly occurs is:
> > - a high degree of contention on the csum root node eb rwsem
> > - memory starved cgroup doing tons of reclaim on CPU.
> > - many reader threads in the memory starved cgroup "holding" the sem
> >    as readers, but not scheduling promptly. i.e., task __state == 0, but
> >    not running on a cpu.
> > - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> > 
> > This results in VERY long transactions. On my test system, that script
> > produces 20-30s long transaction commits. This then results in
> > seriously degraded performance for any cgroup using the filesystem (the
> > victim cgroup in the script).
> > 
> > This reproducer is a bit silly, as the villanous cgroup is using almost
> > all of its memory.max for kernel memory (specifically pagetables) but it
> > sort of doesn't matter, as I am most interested in the btrfs locking
> > behavior. It isn't an academic problem, as we see this exact problem in
> > production at Meta with one cgroup over memory.max ruining btrfs
> > performance for the whole system.
> > 
> > The underlying scheduling "problem" with global rwsems is sort of thorny
> > and apparently well known. e.g.
> > https://lpc.events/event/18/contributions/1883/
> > 
> > As a result, our main lever in the short term is just trying to reduce
> > contention on our various rwsems. In the case of the csum tree, we can
> > either redesign btree locking (hard...) or try to use the commit root
> > when we can. Luckily, it seems likely that many reads are for old extents
> > written many transactions ago, and that for those we *can* in fact
> > search the commit root!
> 
> The idea looks good to me.
> 
> The extent_map::generation is updated to the larger one during merge, so if
> we got a em whose generation is smaller than the current generation it's
> definitely older.
> 
> And since data extents in commit root won't be overwritten until the current
> transaction committed, so it should also be fine.
> 
> 
> But my concern is, the path->need_commit_sem is only blocking transaction
> from happening when the path is holding something.
> And inside search_csum_tree() we can release the path halfway, would that
> cause 2 transaction to be committed during that release window?
> 
> Shouldn't we hold the semaphore manually inside btrfs_lookup_bio_sums()
> other than relying on the btrfs_path::need_commit_sem?

Yes, I think you are right. Good catch! I will test that version and
re-send, assuming it still works well.

> 
> Thanks,
> Qu
> > 
> > This change detects when we are trying to read an old extent (according
> > to extent map generation) and then wires that through bio_ctrl to the
> > btrfs_bio, which unfortunately isn't allocated yet when we have this
> > information. When we go to lookup the csums in lookup_bio_sums we can
> > check this condition on the btrfs_bio and do the commit root lookup
> > accordingly.
> > 
> > With the fix, on that same test case, commit latencies no longer exceed
> > ~400ms on my system.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/bio.h       |  1 +
> >   fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
> >   fs/btrfs/file-item.c |  7 +++++++
> >   3 files changed, 29 insertions(+)
> > 
> > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > index e48612340745..159f6a4425a6 100644
> > --- a/fs/btrfs/bio.h
> > +++ b/fs/btrfs/bio.h
> > @@ -48,6 +48,7 @@ struct btrfs_bio {
> >   			u8 *csum;
> >   			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> >   			struct bvec_iter saved_iter;
> > +			bool commit_root_csum;
> >   		};
> >   		/*
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index cb0a39370d30..8544fe2302ff 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
> >   	 * This is to avoid touching ranges covered by compression/inline.
> >   	 */
> >   	unsigned long submit_bitmap;
> > +	/*
> > +	 * If this is a data read bio, we set this to true if it is safe to
> > +	 * search for csums in the commit root. Otherwise, it is set to false.
> > +	 *
> > +	 * This is an optimization to reduce the contention on the csum tree
> > +	 * root rwsem. Due to how rwsem is implemented, there is a possible
> > +	 * priority inversion where the readers holding the lock don't get
> > +	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
> > +	 * then blocks btrfs transactions. The only real help is to try to
> > +	 * reduce the contention on the lock as much as we can.
> > +	 *
> > +	 * Do this by detecting when a data read is reading data from an old
> > +	 * transaction so it's safe to look in the commit root.
> > +	 */
> > +	bool commit_root_csum;
> >   };
> >   static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
> > @@ -770,6 +785,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
> >   			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
> >   				      folio_pos(folio) + pg_offset);
> >   		}
> > +		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
> > +			bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
> > +
> >   		/* Cap to the current ordered extent boundary if there is one. */
> >   		if (len > bio_ctrl->len_to_oe_boundary) {
> > @@ -1048,6 +1066,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> >   		if (prev_em_start)
> >   			*prev_em_start = em->start;
> > +		if (em->generation < btrfs_get_fs_generation(fs_info))
> > +			bio_ctrl->commit_root_csum = true;
> > +
> >   		free_extent_map(em);
> >   		em = NULL;
> > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > index 886749b39672..2433b169a4e6 100644
> > --- a/fs/btrfs/file-item.c
> > +++ b/fs/btrfs/file-item.c
> > @@ -401,6 +401,13 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
> >   		path->skip_locking = 1;
> >   	}
> > +	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
> > +	if (bbio->commit_root_csum) {
> > +		path->search_commit_root = 1;
> > +		path->skip_locking = 1;
> > +		path->need_commit_sem = 1;
> > +	}
> > +
> >   	while (bio_offset < orig_len) {
> >   		int count;
> >   		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
>
Qu Wenruo Oct. 7, 2024, 11:43 p.m. UTC | #3
在 2024/10/8 09:00, Boris Burkov 写道:
> On Sat, Oct 05, 2024 at 04:30:29PM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2024/10/5 08:53, Boris Burkov 写道:
>>> If you run a workload like:
>>> - a cgroup that does tons of data reading, with a harsh memory limit
>>> - a second cgroup that tries to write new files
>>> e.g.:
>>> https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
>>>
>>> then what quickly occurs is:
>>> - a high degree of contention on the csum root node eb rwsem
>>> - memory starved cgroup doing tons of reclaim on CPU.
>>> - many reader threads in the memory starved cgroup "holding" the sem
>>>     as readers, but not scheduling promptly. i.e., task __state == 0, but
>>>     not running on a cpu.
>>> - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
>>>
>>> This results in VERY long transactions. On my test system, that script
>>> produces 20-30s long transaction commits. This then results in
>>> seriously degraded performance for any cgroup using the filesystem (the
>>> victim cgroup in the script).
>>>
>>> This reproducer is a bit silly, as the villanous cgroup is using almost
>>> all of its memory.max for kernel memory (specifically pagetables) but it
>>> sort of doesn't matter, as I am most interested in the btrfs locking
>>> behavior. It isn't an academic problem, as we see this exact problem in
>>> production at Meta with one cgroup over memory.max ruining btrfs
>>> performance for the whole system.
>>>
>>> The underlying scheduling "problem" with global rwsems is sort of thorny
>>> and apparently well known. e.g.
>>> https://lpc.events/event/18/contributions/1883/
>>>
>>> As a result, our main lever in the short term is just trying to reduce
>>> contention on our various rwsems. In the case of the csum tree, we can
>>> either redesign btree locking (hard...) or try to use the commit root
>>> when we can. Luckily, it seems likely that many reads are for old extents
>>> written many transactions ago, and that for those we *can* in fact
>>> search the commit root!
>>
>> The idea looks good to me.
>>
>> The extent_map::generation is updated to the larger one during merge, so if
>> we got a em whose generation is smaller than the current generation it's
>> definitely older.
>>
>> And since data extents in commit root won't be overwritten until the current
>> transaction committed, so it should also be fine.
>>
>>
>> But my concern is, the path->need_commit_sem is only blocking transaction
>> from happening when the path is holding something.
>> And inside search_csum_tree() we can release the path halfway, would that
>> cause 2 transaction to be committed during that release window?
>>
>> Shouldn't we hold the semaphore manually inside btrfs_lookup_bio_sums()
>> other than relying on the btrfs_path::need_commit_sem?
>
> Yes, I think you are right. Good catch! I will test that version and
> re-send, assuming it still works well.

The problem is, if we hold the semaphore that long, it will be no better
than the regular tree search method...

So I'm out of good ideas.

Thanks,
Qu
>
>>
>> Thanks,
>> Qu
>>>
>>> This change detects when we are trying to read an old extent (according
>>> to extent map generation) and then wires that through bio_ctrl to the
>>> btrfs_bio, which unfortunately isn't allocated yet when we have this
>>> information. When we go to lookup the csums in lookup_bio_sums we can
>>> check this condition on the btrfs_bio and do the commit root lookup
>>> accordingly.
>>>
>>> With the fix, on that same test case, commit latencies no longer exceed
>>> ~400ms on my system.
>>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>>    fs/btrfs/bio.h       |  1 +
>>>    fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
>>>    fs/btrfs/file-item.c |  7 +++++++
>>>    3 files changed, 29 insertions(+)
>>>
>>> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
>>> index e48612340745..159f6a4425a6 100644
>>> --- a/fs/btrfs/bio.h
>>> +++ b/fs/btrfs/bio.h
>>> @@ -48,6 +48,7 @@ struct btrfs_bio {
>>>    			u8 *csum;
>>>    			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
>>>    			struct bvec_iter saved_iter;
>>> +			bool commit_root_csum;
>>>    		};
>>>    		/*
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index cb0a39370d30..8544fe2302ff 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
>>>    	 * This is to avoid touching ranges covered by compression/inline.
>>>    	 */
>>>    	unsigned long submit_bitmap;
>>> +	/*
>>> +	 * If this is a data read bio, we set this to true if it is safe to
>>> +	 * search for csums in the commit root. Otherwise, it is set to false.
>>> +	 *
>>> +	 * This is an optimization to reduce the contention on the csum tree
>>> +	 * root rwsem. Due to how rwsem is implemented, there is a possible
>>> +	 * priority inversion where the readers holding the lock don't get
>>> +	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
>>> +	 * then blocks btrfs transactions. The only real help is to try to
>>> +	 * reduce the contention on the lock as much as we can.
>>> +	 *
>>> +	 * Do this by detecting when a data read is reading data from an old
>>> +	 * transaction so it's safe to look in the commit root.
>>> +	 */
>>> +	bool commit_root_csum;
>>>    };
>>>    static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
>>> @@ -770,6 +785,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
>>>    			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
>>>    				      folio_pos(folio) + pg_offset);
>>>    		}
>>> +		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
>>> +			bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
>>> +
>>>    		/* Cap to the current ordered extent boundary if there is one. */
>>>    		if (len > bio_ctrl->len_to_oe_boundary) {
>>> @@ -1048,6 +1066,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>>>    		if (prev_em_start)
>>>    			*prev_em_start = em->start;
>>> +		if (em->generation < btrfs_get_fs_generation(fs_info))
>>> +			bio_ctrl->commit_root_csum = true;
>>> +
>>>    		free_extent_map(em);
>>>    		em = NULL;
>>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>>> index 886749b39672..2433b169a4e6 100644
>>> --- a/fs/btrfs/file-item.c
>>> +++ b/fs/btrfs/file-item.c
>>> @@ -401,6 +401,13 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
>>>    		path->skip_locking = 1;
>>>    	}
>>> +	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
>>> +	if (bbio->commit_root_csum) {
>>> +		path->search_commit_root = 1;
>>> +		path->skip_locking = 1;
>>> +		path->need_commit_sem = 1;
>>> +	}
>>> +
>>>    	while (bio_offset < orig_len) {
>>>    		int count;
>>>    		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
>>
>
Boris Burkov Oct. 8, 2024, 6 a.m. UTC | #4
On Tue, Oct 08, 2024 at 10:13:03AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2024/10/8 09:00, Boris Burkov 写道:
> > On Sat, Oct 05, 2024 at 04:30:29PM +0930, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/10/5 08:53, Boris Burkov 写道:
> > > > If you run a workload like:
> > > > - a cgroup that does tons of data reading, with a harsh memory limit
> > > > - a second cgroup that tries to write new files
> > > > e.g.:
> > > > https://github.com/boryas/scripts/blob/main/sh/noisy-neighbor/run.sh
> > > > 
> > > > then what quickly occurs is:
> > > > - a high degree of contention on the csum root node eb rwsem
> > > > - memory starved cgroup doing tons of reclaim on CPU.
> > > > - many reader threads in the memory starved cgroup "holding" the sem
> > > >     as readers, but not scheduling promptly. i.e., task __state == 0, but
> > > >     not running on a cpu.
> > > > - btrfs_commit_transaction stuck trying to acquire the sem as a writer.
> > > > 
> > > > This results in VERY long transactions. On my test system, that script
> > > > produces 20-30s long transaction commits. This then results in
> > > > seriously degraded performance for any cgroup using the filesystem (the
> > > > victim cgroup in the script).
> > > > 
> > > > This reproducer is a bit silly, as the villanous cgroup is using almost
> > > > all of its memory.max for kernel memory (specifically pagetables) but it
> > > > sort of doesn't matter, as I am most interested in the btrfs locking
> > > > behavior. It isn't an academic problem, as we see this exact problem in
> > > > production at Meta with one cgroup over memory.max ruining btrfs
> > > > performance for the whole system.
> > > > 
> > > > The underlying scheduling "problem" with global rwsems is sort of thorny
> > > > and apparently well known. e.g.
> > > > https://lpc.events/event/18/contributions/1883/
> > > > 
> > > > As a result, our main lever in the short term is just trying to reduce
> > > > contention on our various rwsems. In the case of the csum tree, we can
> > > > either redesign btree locking (hard...) or try to use the commit root
> > > > when we can. Luckily, it seems likely that many reads are for old extents
> > > > written many transactions ago, and that for those we *can* in fact
> > > > search the commit root!
> > > 
> > > The idea looks good to me.
> > > 
> > > The extent_map::generation is updated to the larger one during merge, so if
> > > we got a em whose generation is smaller than the current generation it's
> > > definitely older.
> > > 
> > > And since data extents in commit root won't be overwritten until the current
> > > transaction committed, so it should also be fine.
> > > 
> > > 
> > > But my concern is, the path->need_commit_sem is only blocking transaction
> > > from happening when the path is holding something.
> > > And inside search_csum_tree() we can release the path halfway, would that
> > > cause 2 transaction to be committed during that release window?
> > > 
> > > Shouldn't we hold the semaphore manually inside btrfs_lookup_bio_sums()
> > > other than relying on the btrfs_path::need_commit_sem?
> > 
> > Yes, I think you are right. Good catch! I will test that version and
> > re-send, assuming it still works well.
> 
> The problem is, if we hold the semaphore that long, it will be no better
> than the regular tree search method...

Quite possible, I have a hard time seeing a difference for the reader
logic.

But! There is good news: you need a writer in the wait queue to trigger
the full worst version of this where some readers are "holding" the lock
but starved on CPU while other readers pile up in the wait queue with
the writer. Without a writer, all the other readers also acquire the
semaphore successfully and hopefully make progress. And the
commit_root_sem should have much less write contention than the csum
tree route node.

I'll test the performance of your fix on the reproducer tomorrow and
hopefully come back with happy news :)

> 
> So I'm out of good ideas.
> 
> Thanks,
> Qu
> > 
> > > 
> > > Thanks,
> > > Qu
> > > > 
> > > > This change detects when we are trying to read an old extent (according
> > > > to extent map generation) and then wires that through bio_ctrl to the
> > > > btrfs_bio, which unfortunately isn't allocated yet when we have this
> > > > information. When we go to lookup the csums in lookup_bio_sums we can
> > > > check this condition on the btrfs_bio and do the commit root lookup
> > > > accordingly.
> > > > 
> > > > With the fix, on that same test case, commit latencies no longer exceed
> > > > ~400ms on my system.
> > > > 
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >    fs/btrfs/bio.h       |  1 +
> > > >    fs/btrfs/extent_io.c | 21 +++++++++++++++++++++
> > > >    fs/btrfs/file-item.c |  7 +++++++
> > > >    3 files changed, 29 insertions(+)
> > > > 
> > > > diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> > > > index e48612340745..159f6a4425a6 100644
> > > > --- a/fs/btrfs/bio.h
> > > > +++ b/fs/btrfs/bio.h
> > > > @@ -48,6 +48,7 @@ struct btrfs_bio {
> > > >    			u8 *csum;
> > > >    			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
> > > >    			struct bvec_iter saved_iter;
> > > > +			bool commit_root_csum;
> > > >    		};
> > > >    		/*
> > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > > index cb0a39370d30..8544fe2302ff 100644
> > > > --- a/fs/btrfs/extent_io.c
> > > > +++ b/fs/btrfs/extent_io.c
> > > > @@ -108,6 +108,21 @@ struct btrfs_bio_ctrl {
> > > >    	 * This is to avoid touching ranges covered by compression/inline.
> > > >    	 */
> > > >    	unsigned long submit_bitmap;
> > > > +	/*
> > > > +	 * If this is a data read bio, we set this to true if it is safe to
> > > > +	 * search for csums in the commit root. Otherwise, it is set to false.
> > > > +	 *
> > > > +	 * This is an optimization to reduce the contention on the csum tree
> > > > +	 * root rwsem. Due to how rwsem is implemented, there is a possible
> > > > +	 * priority inversion where the readers holding the lock don't get
> > > > +	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
> > > > +	 * then blocks btrfs transactions. The only real help is to try to
> > > > +	 * reduce the contention on the lock as much as we can.
> > > > +	 *
> > > > +	 * Do this by detecting when a data read is reading data from an old
> > > > +	 * transaction so it's safe to look in the commit root.
> > > > +	 */
> > > > +	bool commit_root_csum;
> > > >    };
> > > >    static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
> > > > @@ -770,6 +785,9 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
> > > >    			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
> > > >    				      folio_pos(folio) + pg_offset);
> > > >    		}
> > > > +		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
> > > > +			bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
> > > > +
> > > >    		/* Cap to the current ordered extent boundary if there is one. */
> > > >    		if (len > bio_ctrl->len_to_oe_boundary) {
> > > > @@ -1048,6 +1066,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
> > > >    		if (prev_em_start)
> > > >    			*prev_em_start = em->start;
> > > > +		if (em->generation < btrfs_get_fs_generation(fs_info))
> > > > +			bio_ctrl->commit_root_csum = true;
> > > > +
> > > >    		free_extent_map(em);
> > > >    		em = NULL;
> > > > diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> > > > index 886749b39672..2433b169a4e6 100644
> > > > --- a/fs/btrfs/file-item.c
> > > > +++ b/fs/btrfs/file-item.c
> > > > @@ -401,6 +401,13 @@ blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
> > > >    		path->skip_locking = 1;
> > > >    	}
> > > > +	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
> > > > +	if (bbio->commit_root_csum) {
> > > > +		path->search_commit_root = 1;
> > > > +		path->skip_locking = 1;
> > > > +		path->need_commit_sem = 1;
> > > > +	}
> > > > +
> > > >    	while (bio_offset < orig_len) {
> > > >    		int count;
> > > >    		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
> > > 
> > 
>
diff mbox series

Patch

diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index e48612340745..159f6a4425a6 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -48,6 +48,7 @@  struct btrfs_bio {
 			u8 *csum;
 			u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 			struct bvec_iter saved_iter;
+			bool commit_root_csum;
 		};
 
 		/*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb0a39370d30..8544fe2302ff 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -108,6 +108,21 @@  struct btrfs_bio_ctrl {
 	 * This is to avoid touching ranges covered by compression/inline.
 	 */
 	unsigned long submit_bitmap;
+	/*
+	 * If this is a data read bio, we set this to true if it is safe to
+	 * search for csums in the commit root. Otherwise, it is set to false.
+	 *
+	 * This is an optimization to reduce the contention on the csum tree
+	 * root rwsem. Due to how rwsem is implemented, there is a possible
+	 * priority inversion where the readers holding the lock don't get
+	 * scheduled (say they're in a cgroup stuck in heavy reclaim) which
+	 * then blocks btrfs transactions. The only real help is to try to
+	 * reduce the contention on the lock as much as we can.
+	 *
+	 * Do this by detecting when a data read is reading data from an old
+	 * transaction so it's safe to look in the commit root.
+	 */
+	bool commit_root_csum;
 };
 
 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -770,6 +785,9 @@  static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
 			alloc_new_bio(inode, bio_ctrl, disk_bytenr,
 				      folio_pos(folio) + pg_offset);
 		}
+		if (btrfs_op(&bio_ctrl->bbio->bio) == BTRFS_MAP_READ && is_data_inode(inode))
+			bio_ctrl->bbio->commit_root_csum = bio_ctrl->commit_root_csum;
+
 
 		/* Cap to the current ordered extent boundary if there is one. */
 		if (len > bio_ctrl->len_to_oe_boundary) {
@@ -1048,6 +1066,9 @@  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 		if (prev_em_start)
 			*prev_em_start = em->start;
 
+		if (em->generation < btrfs_get_fs_generation(fs_info))
+			bio_ctrl->commit_root_csum = true;
+
 		free_extent_map(em);
 		em = NULL;
 
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 886749b39672..2433b169a4e6 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -401,6 +401,13 @@  blk_status_t btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
 		path->skip_locking = 1;
 	}
 
+	/* See the comment on btrfs_bio_ctrl->commit_root_csum. */
+	if (bbio->commit_root_csum) {
+		path->search_commit_root = 1;
+		path->skip_locking = 1;
+		path->need_commit_sem = 1;
+	}
+
 	while (bio_offset < orig_len) {
 		int count;
 		u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;