diff mbox series

[01/21] btrfs: fix race when refilling delayed refs block reserve

Message ID ad77c4035370a5db4e0b813da9eff73e52fd30c0.1694174371.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: updates to delayed refs accounting and space reservation | expand

Commit Message

Filipe Manana Sept. 8, 2023, 12:09 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If we have two (or more) tasks attempting to refill the delayed refs block
reserve we can end up with the delayed block reserve being over reserved,
that is, with a reserved space greater than its size. If this happens, we
are holding to more reserved space than necessary, however it may or may
not be harmless. In case the delayed refs block reserve size later
increases to match or go beyond the reserved space, then nothing bad
happens, we end up simply avoiding doing a space reservation later at the
expense of doing it now. However if the delayed refs block reserve size
never increases to match its reserved size, then at unmount time we will
trigger the following warning from btrfs_release_global_block_rsv():

   WARN_ON(fs_info->delayed_block_rsv.reserved > 0);

The race happens like this:

1) The delayed refs block reserve has a size of 8M and a reserved space of
   6M for example;

2) Task A calls btrfs_delayed_refs_rsv_refill();

3) Task B also calls btrfs_delayed_refs_rsv_refill();

4) Task A sees there's a 2M difference between the size and the reserved
   space of the delayed refs rsv, so it will reserve 2M of space by
   calling btrfs_reserve_metadata_bytes();

5) Task B also sees that 2M difference, and like task A, it reserves
   another 2M of metadata space;

6) Both task A and task B increase the reserved space of block reserve
   by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve
   ends up with a size of 8M and a reserved space of 10M;

7) As delayed refs are run and their space released, the delayed refs
   block reserve ends up with a size of 0 and a reserved space of 2M;

8) The fs is unmounted and the warning at btrfs_release_global_block_rsv()
   is triggered because the size the reserved space is not zero.

So fix this by checking if we still need to add space to the delayed refs
block reserve after reserving the metadata space, and if we don't, just
release that space.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-ref.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

Comments

Josef Bacik Sept. 8, 2023, 2:46 p.m. UTC | #1
On Fri, Sep 08, 2023 at 01:09:03PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we have two (or more) tasks attempting to refill the delayed refs block
> reserve we can end up with the delayed block reserve being over reserved,
> that is, with a reserved space greater than its size. If this happens, we
> are holding to more reserved space than necessary, however it may or may
> not be harmless. In case the delayed refs block reserve size later
> increases to match or go beyond the reserved space, then nothing bad
> happens, we end up simply avoiding doing a space reservation later at the
> expense of doing it now. However if the delayed refs block reserve size
> never increases to match its reserved size, then at unmount time we will
> trigger the following warning from btrfs_release_global_block_rsv():
> 
>    WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
> 
> The race happens like this:
> 
> 1) The delayed refs block reserve has a size of 8M and a reserved space of
>    6M for example;
> 
> 2) Task A calls btrfs_delayed_refs_rsv_refill();
> 
> 3) Task B also calls btrfs_delayed_refs_rsv_refill();
> 
> 4) Task A sees there's a 2M difference between the size and the reserved
>    space of the delayed refs rsv, so it will reserve 2M of space by
>    calling btrfs_reserve_metadata_bytes();
> 
> 5) Task B also sees that 2M difference, and like task A, it reserves
>    another 2M of metadata space;
> 
> 6) Both task A and task B increase the reserved space of block reserve
>    by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve
>    ends up with a size of 8M and a reserved space of 10M;
> 
> 7) As delayed refs are run and their space released, the delayed refs
>    block reserve ends up with a size of 0 and a reserved space of 2M;
> 

This part shouldn't happen, delayed refs space only manipulates
delayed_refs_rsv->size, so when we drop their space, we do

btrfs_delayed_refs_rsv_release
 -> btrfs_block_rsv_release
  -> block_rsv_release_bytes

which does

        spin_lock(&block_rsv->lock);
        if (num_bytes == (u64)-1) {
                num_bytes = block_rsv->size;
                qgroup_to_release = block_rsv->qgroup_rsv_size;
        }
        block_rsv->size -= num_bytes;
        if (block_rsv->reserved >= block_rsv->size) {
                num_bytes = block_rsv->reserved - block_rsv->size;
                block_rsv->reserved = block_rsv->size;
                block_rsv->full = true;
        }

so if A and B race and the reservation is now much larger than size, the extra
will be cut off at the end when we call btrfs_delayed_refs_rsv_release.

Now I'm all for not holding the over-reserve for that long, but this analysis is
incorrect, we would have to somehow do the btrfs_delayed_refs_rsv_refill() and
then not call btrfs_delayed_refs_rsv_release() at some point.

This *could* happen I suppose by starting a trans handle and not modifying
anything while the delayed refs are run and finish before we do our reserve, and
in that case we didn't generate a delayed ref that could later on call
btrfs_delayed_refs_rsv_release() to clear the excess reservation, but that's a
decidedly different problem with a different solution.  Thanks,

Josef
Filipe Manana Sept. 8, 2023, 5:21 p.m. UTC | #2
On Fri, Sep 8, 2023 at 3:46 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Fri, Sep 08, 2023 at 01:09:03PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If we have two (or more) tasks attempting to refill the delayed refs block
> > reserve we can end up with the delayed block reserve being over reserved,
> > that is, with a reserved space greater than its size. If this happens, we
> > are holding to more reserved space than necessary, however it may or may
> > not be harmless. In case the delayed refs block reserve size later
> > increases to match or go beyond the reserved space, then nothing bad
> > happens, we end up simply avoiding doing a space reservation later at the
> > expense of doing it now. However if the delayed refs block reserve size
> > never increases to match its reserved size, then at unmount time we will
> > trigger the following warning from btrfs_release_global_block_rsv():
> >
> >    WARN_ON(fs_info->delayed_block_rsv.reserved > 0);
> >
> > The race happens like this:
> >
> > 1) The delayed refs block reserve has a size of 8M and a reserved space of
> >    6M for example;
> >
> > 2) Task A calls btrfs_delayed_refs_rsv_refill();
> >
> > 3) Task B also calls btrfs_delayed_refs_rsv_refill();
> >
> > 4) Task A sees there's a 2M difference between the size and the reserved
> >    space of the delayed refs rsv, so it will reserve 2M of space by
> >    calling btrfs_reserve_metadata_bytes();
> >
> > 5) Task B also sees that 2M difference, and like task A, it reserves
> >    another 2M of metadata space;
> >
> > 6) Both task A and task B increase the reserved space of block reserve
> >    by 2M, by calling btrfs_block_rsv_add_bytes(), so the block reserve
> >    ends up with a size of 8M and a reserved space of 10M;
> >
> > 7) As delayed refs are run and their space released, the delayed refs
> >    block reserve ends up with a size of 0 and a reserved space of 2M;
> >
>
> This part shouldn't happen, delayed refs space only manipulates
> delayed_refs_rsv->size, so when we drop their space, we do
>
> btrfs_delayed_refs_rsv_release
>  -> btrfs_block_rsv_release
>   -> block_rsv_release_bytes
>
> which does
>
>         spin_lock(&block_rsv->lock);
>         if (num_bytes == (u64)-1) {
>                 num_bytes = block_rsv->size;
>                 qgroup_to_release = block_rsv->qgroup_rsv_size;
>         }
>         block_rsv->size -= num_bytes;
>         if (block_rsv->reserved >= block_rsv->size) {
>                 num_bytes = block_rsv->reserved - block_rsv->size;
>                 block_rsv->reserved = block_rsv->size;
>                 block_rsv->full = true;
>         }
>
> so if A and B race and the reservation is now much larger than size, the extra
> will be cut off at the end when we call btrfs_delayed_refs_rsv_release.
>
> Now I'm all for not holding the over-reserve for that long, but this analysis is
> incorrect, we would have to somehow do the btrfs_delayed_refs_rsv_refill() and
> then not call btrfs_delayed_refs_rsv_release() at some point.
>
> This *could* happen I suppose by starting a trans handle and not modifying
> anything while the delayed refs are run and finish before we do our reserve, and
> in that case we didn't generate a delayed ref that could later on call
> btrfs_delayed_refs_rsv_release() to clear the excess reservation, but that's a
> decidedly different problem with a different solution.  Thanks,

Oh yes, let me fix that. Holding the extra space for longer than
necessary was actually the thing I noticed first.

Fixed in v2 thanks.
>
> Josef
diff mbox series

Patch

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 6a13cf00218b..1043f66cc130 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -163,6 +163,8 @@  int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
 	struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
 	u64 limit = btrfs_calc_delayed_ref_bytes(fs_info, 1);
 	u64 num_bytes = 0;
+	u64 refilled_bytes;
+	u64 to_free;
 	int ret = -ENOSPC;
 
 	spin_lock(&block_rsv->lock);
@@ -178,9 +180,38 @@  int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
 	ret = btrfs_reserve_metadata_bytes(fs_info, block_rsv, num_bytes, flush);
 	if (ret)
 		return ret;
-	btrfs_block_rsv_add_bytes(block_rsv, num_bytes, false);
-	trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
-				      0, num_bytes, 1);
+
+	/*
+	 * We may have raced with someone else, so check again if we the block
+	 * reserve is still not full and release any excess space.
+	 */
+	spin_lock(&block_rsv->lock);
+	if (block_rsv->reserved < block_rsv->size) {
+		u64 needed = block_rsv->size - block_rsv->reserved;
+
+		if (num_bytes >= needed) {
+			block_rsv->reserved += needed;
+			block_rsv->full = true;
+			to_free = num_bytes - needed;
+			refilled_bytes = needed;
+		} else {
+			block_rsv->reserved += num_bytes;
+			to_free = 0;
+			refilled_bytes = num_bytes;
+		}
+	} else {
+		to_free = num_bytes;
+		refilled_bytes = 0;
+	}
+	spin_unlock(&block_rsv->lock);
+
+	if (to_free > 0)
+		btrfs_space_info_free_bytes_may_use(fs_info, block_rsv->space_info,
+						    to_free);
+
+	if (refilled_bytes > 0)
+		trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv", 0,
+					      refilled_bytes, 1);
 	return 0;
 }