diff mbox series

btrfs: fix block group item corruption after inserting new block group

Message ID 257397e78b93bcb888daec01c1e02be87cd4dee7.1678097398.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix block group item corruption after inserting new block group | expand

Commit Message

Filipe Manana March 6, 2023, 10:13 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.

This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.

For example:

1) Task A creates a metadata block group X;

2) Two extents are allocated from block group X, so its "used" field is
   updated to 32K, and its "commit_used" field remains as 0;

3) Transaction commit starts, by some task B, and it enters
   btrfs_start_dirty_block_groups(). There it tries to update the block
   group item for block group X, which currently has its "used" field with
   a value of 32K. But that fails since the block group item was not yet
   inserted, and so on failure update_block_group_item() sets the
   "commit_used" field of the block group back to 0;

4) The block group item is inserted by task A, when for example
   btrfs_create_pending_block_groups() is called when releasing its
   transaction handle. This results in insert_block_group_item() inserting
   the block group item in the extent tree (or block group tree), with a
   "used" field having a value of 32K, but without updating the
   "commit_used" field in the block group, which remains with value of 0;

5) The two extents are freed from block X, so its "used" field changes
   from 32K to 0;

6) The transaction commit by task B continues, it enters
   btrfs_write_dirty_block_groups() which calls update_block_group_item()
   for block group X, and there it decides to skip the block group item
   update, because "used" has a value of 0 and "commit_used" has a value
   of 0 too.

   As a result, we end up with a block item having a 32K "used" field but
   no extents allocated from it.

When this issue happens, a btrfs check reports an error like this:

   [1/7] checking root items
   [2/7] checking extents
   block group [1104150528 1073741824] used 39796736 but extent items used 0
   ERROR: errors found in extent allocation tree or chunk allocation
   (...)

Fix this by making insert_block_group_item() update the block group's
"commit_used" field.

Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Qu Wenruo March 6, 2023, 10:23 a.m. UTC | #1
On 2023/3/6 18:13, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We can often end up inserting a block group item, for a new block group,
> with a wrong value for the used bytes field.
> 
> This happens if for the new allocated block group, in the same transaction
> that created the block group, we have tasks allocating extents from it as
> well as tasks removing extents from it.
> 
> For example:
> 
> 1) Task A creates a metadata block group X;
> 
> 2) Two extents are allocated from block group X, so its "used" field is
>     updated to 32K, and its "commit_used" field remains as 0;
> 
> 3) Transaction commit starts, by some task B, and it enters
>     btrfs_start_dirty_block_groups(). There it tries to update the block
>     group item for block group X, which currently has its "used" field with
>     a value of 32K. But that fails since the block group item was not yet
>     inserted, and so on failure update_block_group_item() sets the
>     "commit_used" field of the block group back to 0;
> 
> 4) The block group item is inserted by task A, when for example
>     btrfs_create_pending_block_groups() is called when releasing its
>     transaction handle. This results in insert_block_group_item() inserting
>     the block group item in the extent tree (or block group tree), with a
>     "used" field having a value of 32K, but without updating the
>     "commit_used" field in the block group, which remains with value of 0;
> 
> 5) The two extents are freed from block X, so its "used" field changes
>     from 32K to 0;
> 
> 6) The transaction commit by task B continues, it enters
>     btrfs_write_dirty_block_groups() which calls update_block_group_item()
>     for block group X, and there it decides to skip the block group item
>     update, because "used" has a value of 0 and "commit_used" has a value
>     of 0 too.
> 
>     As a result, we end up with a block item having a 32K "used" field but
>     no extents allocated from it.
> 
> When this issue happens, a btrfs check reports an error like this:
> 
>     [1/7] checking root items
>     [2/7] checking extents
>     block group [1104150528 1073741824] used 39796736 but extent items used 0
>     ERROR: errors found in extent allocation tree or chunk allocation
>     (...)
> 
> Fix this by making insert_block_group_item() update the block group's
> "commit_used" field.
> 
> Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
> CC: stable@vger.kernel.org # 6.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks for pinning down the bug, I hit it several times randomly but 
didn't even consider it's related to that commit.

The last several runs hitting can be found here in btrfs/073:

https://h.anonymoususers.xyz:8443/results/details/btrfs-rock5b-global/2023-03-05T18:37:09/index.html

Thanks,
Qu
> ---
>   fs/btrfs/block-group.c | 13 ++++++++++++-
>   1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index fd11d03a50ff..953e9547ca07 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2493,18 +2493,29 @@ static int insert_block_group_item(struct btrfs_trans_handle *trans,
>   	struct btrfs_block_group_item bgi;
>   	struct btrfs_root *root = btrfs_block_group_root(fs_info);
>   	struct btrfs_key key;
> +	u64 old_commit_used;
> +	int ret;
>   
>   	spin_lock(&block_group->lock);
>   	btrfs_set_stack_block_group_used(&bgi, block_group->used);
>   	btrfs_set_stack_block_group_chunk_objectid(&bgi,
>   						   block_group->global_root_id);
>   	btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
> +	old_commit_used = block_group->commit_used;
> +	block_group->commit_used = block_group->used;
>   	key.objectid = block_group->start;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>   	key.offset = block_group->length;
>   	spin_unlock(&block_group->lock);
>   
> -	return btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
> +	ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
> +	if (ret < 0) {
> +		spin_lock(&block_group->lock);
> +		block_group->commit_used = old_commit_used;
> +		spin_unlock(&block_group->lock);
> +	}
> +
> +	return ret;
>   }
>   
>   static int insert_dev_extent(struct btrfs_trans_handle *trans,
David Sterba March 6, 2023, 6:41 p.m. UTC | #2
On Mon, Mar 06, 2023 at 10:13:34AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We can often end up inserting a block group item, for a new block group,
> with a wrong value for the used bytes field.
> 
> This happens if for the new allocated block group, in the same transaction
> that created the block group, we have tasks allocating extents from it as
> well as tasks removing extents from it.
> 
> For example:
> 
> 1) Task A creates a metadata block group X;
> 
> 2) Two extents are allocated from block group X, so its "used" field is
>    updated to 32K, and its "commit_used" field remains as 0;
> 
> 3) Transaction commit starts, by some task B, and it enters
>    btrfs_start_dirty_block_groups(). There it tries to update the block
>    group item for block group X, which currently has its "used" field with
>    a value of 32K. But that fails since the block group item was not yet
>    inserted, and so on failure update_block_group_item() sets the
>    "commit_used" field of the block group back to 0;
> 
> 4) The block group item is inserted by task A, when for example
>    btrfs_create_pending_block_groups() is called when releasing its
>    transaction handle. This results in insert_block_group_item() inserting
>    the block group item in the extent tree (or block group tree), with a
>    "used" field having a value of 32K, but without updating the
>    "commit_used" field in the block group, which remains with value of 0;
> 
> 5) The two extents are freed from block X, so its "used" field changes
>    from 32K to 0;
> 
> 6) The transaction commit by task B continues, it enters
>    btrfs_write_dirty_block_groups() which calls update_block_group_item()
>    for block group X, and there it decides to skip the block group item
>    update, because "used" has a value of 0 and "commit_used" has a value
>    of 0 too.
> 
>    As a result, we end up with a block item having a 32K "used" field but
>    no extents allocated from it.
> 
> When this issue happens, a btrfs check reports an error like this:
> 
>    [1/7] checking root items
>    [2/7] checking extents
>    block group [1104150528 1073741824] used 39796736 but extent items used 0
>    ERROR: errors found in extent allocation tree or chunk allocation
>    (...)
> 
> Fix this by making insert_block_group_item() update the block group's
> "commit_used" field.
> 
> Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
> CC: stable@vger.kernel.org # 6.2+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index fd11d03a50ff..953e9547ca07 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2493,18 +2493,29 @@  static int insert_block_group_item(struct btrfs_trans_handle *trans,
 	struct btrfs_block_group_item bgi;
 	struct btrfs_root *root = btrfs_block_group_root(fs_info);
 	struct btrfs_key key;
+	u64 old_commit_used;
+	int ret;
 
 	spin_lock(&block_group->lock);
 	btrfs_set_stack_block_group_used(&bgi, block_group->used);
 	btrfs_set_stack_block_group_chunk_objectid(&bgi,
 						   block_group->global_root_id);
 	btrfs_set_stack_block_group_flags(&bgi, block_group->flags);
+	old_commit_used = block_group->commit_used;
+	block_group->commit_used = block_group->used;
 	key.objectid = block_group->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	key.offset = block_group->length;
 	spin_unlock(&block_group->lock);
 
-	return btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
+	ret = btrfs_insert_item(trans, root, &key, &bgi, sizeof(bgi));
+	if (ret < 0) {
+		spin_lock(&block_group->lock);
+		block_group->commit_used = old_commit_used;
+		spin_unlock(&block_group->lock);
+	}
+
+	return ret;
 }
 
 static int insert_dev_extent(struct btrfs_trans_handle *trans,