Message ID | f5eda1ba8b7a776d3407d30939078b63d02aaff4.1693825574.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: fix race between finishing block group creation and its item update | expand |
On 2023/9/4 19:10, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Commit 675dfe1223a6 ("btrfs: fix block group item corruption after > inserting new block group") fixed one race that resulted in not persisting > a block group's item when its "used" bytes field decreases to zero. > However there's another race that can happen in a much shorter time window > that results in the same problem. The following sequence of steps explains > how it can happen: > > 1) Task A creates a metadata block group X, its "used" and "commit_used" > fields are initialized to 0; > > 2) Two extents are allocated from block group X, so its "used" field is > updated to 32K, and its "commit_used" field remains as 0; > > 3) Transaction commit starts, by some task B, and it enters > btrfs_start_dirty_block_groups(). There it tries to update the block > group item for block group X, which currently has its "used" field with > a value of 32K and its "commited_used" field with a value of 0. However > that fails since the block group item was not yet inserted, so at > update_block_group_item(), the btrfs_search_slot() call returns 1, and > then we set 'ret' to -ENOENT. Before jumping to the label 'fail'... > > 4) The block group item is inserted by task A, when for example > btrfs_create_pending_block_groups() is called when releasing its > transaction handle. This results in insert_block_group_item() inserting > the block group item in the extent tree (or block group tree), with a > "used" field having a value of 32K and setting "commit_used", in struct > btrfs_block_group, to the same value (32K); > > 5) Task B jumps to the 'fail' label and then resets the "commit_used" > field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was > returned from update_block_group_item(), we add the block group again > to the list of dirty block groups, so that we will try again in the > critical section of the transaction commit when calling > btrfs_write_dirty_block_groups(); > > 6) Later the two extents from block group X are freed, so its "used" field > becomes 0; > > 7) If no more extents are allocated from block group X before we get into > btrfs_write_dirty_block_groups(), then when we call > update_block_group_item() again for block group X, we will not update > the block group item to reflect that it has 0 bytes used, because the > "used" and "commit_used" fields in struct btrfs_block_group have the > same value, a value of 0. > > As a result after committing the transaction we have an empty block > group with its block group item having a 32K value for its "used" field. > This will trigger errors from fsck ("btrfs check" command) and after > mounting again the fs, the cleaner kthread will not automatically delete > the empty block group, since its "used" field is not 0. Possibly there > are other issues due to this incosistency. > > When this issue happens, the error reported by fsck is like this: > > [1/7] checking root items > [2/7] checking extents > block group [1104150528 1073741824] used 39796736 but extent items used 0 > ERROR: errors found in extent allocation tree or chunk allocation > (...) > > So fix this by not resetting the "commit_used" field of a block group when > we don't find the block group item at update_block_group_item(). > > Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Although considering we have hit at least two bugs around "commit_used", can we have a more generic way like setting "commit_used" to some impossible values (e.g, U64_MAX) so that the bg is ensured to be updated? Thanks, Qu > --- > fs/btrfs/block-group.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c > index 0cb1dee965a0..b2e5107b7cec 100644 > --- a/fs/btrfs/block-group.c > +++ b/fs/btrfs/block-group.c > @@ -3028,8 +3028,16 @@ static int update_block_group_item(struct btrfs_trans_handle *trans, > btrfs_mark_buffer_dirty(leaf); > fail: > btrfs_release_path(path); > - /* We didn't update the block group item, need to revert @commit_used. */ > - if (ret < 0) { > + /* > + * We didn't update the block group item, need to revert commit_used > + * unless the block group item didn't exist yet - this is to prevent a > + * race with a concurrent insertion of the block group item, with > + * insert_block_group_item(), that happened just after we attempted to > + * update. In that case we would reset commit_used to 0 just after the > + * insertion set it to a value greater than 0 - if the block group later > + * becomes with 0 used bytes, we would incorrectly skip its update. > + */ > + if (ret < 0 && ret != -ENOENT) { > spin_lock(&cache->lock); > cache->commit_used = old_commit_used; > spin_unlock(&cache->lock);
On Mon, Sep 4, 2023 at 1:22 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2023/9/4 19:10, fdmanana@kernel.org wrote: > > From: Filipe Manana <fdmanana@suse.com> > > > > Commit 675dfe1223a6 ("btrfs: fix block group item corruption after > > inserting new block group") fixed one race that resulted in not persisting > > a block group's item when its "used" bytes field decreases to zero. > > However there's another race that can happen in a much shorter time window > > that results in the same problem. The following sequence of steps explains > > how it can happen: > > > > 1) Task A creates a metadata block group X, its "used" and "commit_used" > > fields are initialized to 0; > > > > 2) Two extents are allocated from block group X, so its "used" field is > > updated to 32K, and its "commit_used" field remains as 0; > > > > 3) Transaction commit starts, by some task B, and it enters > > btrfs_start_dirty_block_groups(). There it tries to update the block > > group item for block group X, which currently has its "used" field with > > a value of 32K and its "commited_used" field with a value of 0. However > > that fails since the block group item was not yet inserted, so at > > update_block_group_item(), the btrfs_search_slot() call returns 1, and > > then we set 'ret' to -ENOENT. Before jumping to the label 'fail'... > > > > 4) The block group item is inserted by task A, when for example > > btrfs_create_pending_block_groups() is called when releasing its > > transaction handle. This results in insert_block_group_item() inserting > > the block group item in the extent tree (or block group tree), with a > > "used" field having a value of 32K and setting "commit_used", in struct > > btrfs_block_group, to the same value (32K); > > > > 5) Task B jumps to the 'fail' label and then resets the "commit_used" > > field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was > > returned from update_block_group_item(), we add the block group again > > to the list of dirty block groups, so that we will try again in the > > critical section of the transaction commit when calling > > btrfs_write_dirty_block_groups(); > > > > 6) Later the two extents from block group X are freed, so its "used" field > > becomes 0; > > > > 7) If no more extents are allocated from block group X before we get into > > btrfs_write_dirty_block_groups(), then when we call > > update_block_group_item() again for block group X, we will not update > > the block group item to reflect that it has 0 bytes used, because the > > "used" and "commit_used" fields in struct btrfs_block_group have the > > same value, a value of 0. > > > > As a result after committing the transaction we have an empty block > > group with its block group item having a 32K value for its "used" field. > > This will trigger errors from fsck ("btrfs check" command) and after > > mounting again the fs, the cleaner kthread will not automatically delete > > the empty block group, since its "used" field is not 0. Possibly there > > are other issues due to this incosistency. > > > > When this issue happens, the error reported by fsck is like this: > > > > [1/7] checking root items > > [2/7] checking extents > > block group [1104150528 1073741824] used 39796736 but extent items used 0 > > ERROR: errors found in extent allocation tree or chunk allocation > > (...) > > > > So fix this by not resetting the "commit_used" field of a block group when > > we don't find the block group item at update_block_group_item(). > > > > Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > Although considering we have hit at least two bugs around "commit_used", > can we have a more generic way like setting "commit_used" to some > impossible values (e.g, U64_MAX) so that the bg is ensured to be updated? I don't see how initializing commit_used to U64_MAX, or anything else, would have prevented any of these two bugs... > > Thanks, > Qu > > --- > > fs/btrfs/block-group.c | 12 ++++++++++-- > > 1 file changed, 10 insertions(+), 2 deletions(-) > > > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c > > index 0cb1dee965a0..b2e5107b7cec 100644 > > --- a/fs/btrfs/block-group.c > > +++ b/fs/btrfs/block-group.c > > @@ -3028,8 +3028,16 @@ static int update_block_group_item(struct btrfs_trans_handle *trans, > > btrfs_mark_buffer_dirty(leaf); > > fail: > > btrfs_release_path(path); > > - /* We didn't update the block group item, need to revert @commit_used. */ > > - if (ret < 0) { > > + /* > > + * We didn't update the block group item, need to revert commit_used > > + * unless the block group item didn't exist yet - this is to prevent a > > + * race with a concurrent insertion of the block group item, with > > + * insert_block_group_item(), that happened just after we attempted to > > + * update. In that case we would reset commit_used to 0 just after the > > + * insertion set it to a value greater than 0 - if the block group later > > + * becomes with 0 used bytes, we would incorrectly skip its update. > > + */ > > + if (ret < 0 && ret != -ENOENT) { > > spin_lock(&cache->lock); > > cache->commit_used = old_commit_used; > > spin_unlock(&cache->lock);
On Mon, Sep 04, 2023 at 12:10:31PM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > Commit 675dfe1223a6 ("btrfs: fix block group item corruption after > inserting new block group") fixed one race that resulted in not persisting > a block group's item when its "used" bytes field decreases to zero. > However there's another race that can happen in a much shorter time window > that results in the same problem. The following sequence of steps explains > how it can happen: > > 1) Task A creates a metadata block group X, its "used" and "commit_used" > fields are initialized to 0; > > 2) Two extents are allocated from block group X, so its "used" field is > updated to 32K, and its "commit_used" field remains as 0; > > 3) Transaction commit starts, by some task B, and it enters > btrfs_start_dirty_block_groups(). There it tries to update the block > group item for block group X, which currently has its "used" field with > a value of 32K and its "commited_used" field with a value of 0. However > that fails since the block group item was not yet inserted, so at > update_block_group_item(), the btrfs_search_slot() call returns 1, and > then we set 'ret' to -ENOENT. Before jumping to the label 'fail'... > > 4) The block group item is inserted by task A, when for example > btrfs_create_pending_block_groups() is called when releasing its > transaction handle. This results in insert_block_group_item() inserting > the block group item in the extent tree (or block group tree), with a > "used" field having a value of 32K and setting "commit_used", in struct > btrfs_block_group, to the same value (32K); > > 5) Task B jumps to the 'fail' label and then resets the "commit_used" > field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was > returned from update_block_group_item(), we add the block group again > to the list of dirty block groups, so that we will try again in the > critical section of the transaction commit when calling > btrfs_write_dirty_block_groups(); > > 6) Later the two extents from block group X are freed, so its "used" field > becomes 0; > > 7) If no more extents are allocated from block group X before we get into > btrfs_write_dirty_block_groups(), then when we call > update_block_group_item() again for block group X, we will not update > the block group item to reflect that it has 0 bytes used, because the > "used" and "commit_used" fields in struct btrfs_block_group have the > same value, a value of 0. > > As a result after committing the transaction we have an empty block > group with its block group item having a 32K value for its "used" field. > This will trigger errors from fsck ("btrfs check" command) and after > mounting again the fs, the cleaner kthread will not automatically delete > the empty block group, since its "used" field is not 0. Possibly there > are other issues due to this incosistency. > > When this issue happens, the error reported by fsck is like this: > > [1/7] checking root items > [2/7] checking extents > block group [1104150528 1073741824] used 39796736 but extent items used 0 > ERROR: errors found in extent allocation tree or chunk allocation > (...) > > So fix this by not resetting the "commit_used" field of a block group when > we don't find the block group item at update_block_group_item(). > > Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") > Signed-off-by: Filipe Manana <fdmanana@suse.com> Added to misc-next, thanks.
On 2023/9/4 21:26, Filipe Manana wrote: > On Mon, Sep 4, 2023 at 1:22 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2023/9/4 19:10, fdmanana@kernel.org wrote: >>> From: Filipe Manana <fdmanana@suse.com> >>> >>> Commit 675dfe1223a6 ("btrfs: fix block group item corruption after >>> inserting new block group") fixed one race that resulted in not persisting >>> a block group's item when its "used" bytes field decreases to zero. >>> However there's another race that can happen in a much shorter time window >>> that results in the same problem. The following sequence of steps explains >>> how it can happen: >>> >>> 1) Task A creates a metadata block group X, its "used" and "commit_used" >>> fields are initialized to 0; >>> >>> 2) Two extents are allocated from block group X, so its "used" field is >>> updated to 32K, and its "commit_used" field remains as 0; >>> >>> 3) Transaction commit starts, by some task B, and it enters >>> btrfs_start_dirty_block_groups(). There it tries to update the block >>> group item for block group X, which currently has its "used" field with >>> a value of 32K and its "commited_used" field with a value of 0. However >>> that fails since the block group item was not yet inserted, so at >>> update_block_group_item(), the btrfs_search_slot() call returns 1, and >>> then we set 'ret' to -ENOENT. Before jumping to the label 'fail'... >>> >>> 4) The block group item is inserted by task A, when for example >>> btrfs_create_pending_block_groups() is called when releasing its >>> transaction handle. This results in insert_block_group_item() inserting >>> the block group item in the extent tree (or block group tree), with a >>> "used" field having a value of 32K and setting "commit_used", in struct >>> btrfs_block_group, to the same value (32K); >>> >>> 5) Task B jumps to the 'fail' label and then resets the "commit_used" >>> field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was >>> returned from update_block_group_item(), we add the block group again >>> to the list of dirty block groups, so that we will try again in the >>> critical section of the transaction commit when calling >>> btrfs_write_dirty_block_groups(); >>> >>> 6) Later the two extents from block group X are freed, so its "used" field >>> becomes 0; >>> >>> 7) If no more extents are allocated from block group X before we get into >>> btrfs_write_dirty_block_groups(), then when we call >>> update_block_group_item() again for block group X, we will not update >>> the block group item to reflect that it has 0 bytes used, because the >>> "used" and "commit_used" fields in struct btrfs_block_group have the >>> same value, a value of 0. >>> >>> As a result after committing the transaction we have an empty block >>> group with its block group item having a 32K value for its "used" field. >>> This will trigger errors from fsck ("btrfs check" command) and after >>> mounting again the fs, the cleaner kthread will not automatically delete >>> the empty block group, since its "used" field is not 0. Possibly there >>> are other issues due to this incosistency. >>> >>> When this issue happens, the error reported by fsck is like this: >>> >>> [1/7] checking root items >>> [2/7] checking extents >>> block group [1104150528 1073741824] used 39796736 but extent items used 0 >>> ERROR: errors found in extent allocation tree or chunk allocation >>> (...) >>> >>> So fix this by not resetting the "commit_used" field of a block group when >>> we don't find the block group item at update_block_group_item(). >>> >>> Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") >>> Signed-off-by: Filipe Manana <fdmanana@suse.com> >> >> Reviewed-by: Qu Wenruo <wqu@suse.com> >> >> Although considering we have hit at least two bugs around "commit_used", >> can we have a more generic way like setting "commit_used" to some >> impossible values (e.g, U64_MAX) so that the bg is ensured to be updated? > > I don't see how initializing commit_used to U64_MAX, or anything else, > would have prevented any of these two bugs... I'm talking about something like this diff: diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 0cb1dee965a0..c9b582bb7112 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -3031,7 +3031,7 @@ static int update_block_group_item(struct btrfs_trans_handle *trans, /* We didn't update the block group item, need to revert @commit_used. */ if (ret < 0) { spin_lock(&cache->lock); - cache->commit_used = old_commit_used; + cache->commit_used = U64_MAX; spin_unlock(&cache->lock); } return ret; The analyze shows that at step 7), the original code would skip the update as @commit_used is the same as @used, all 0. But with above diff, we would be forced to update the block group item no matter what. And with this change, if there is some unexpected return value we didn't take into consideration, it would still act as a safenet, as the next time the block group would still be updated. Thanks, Qu > >> >> Thanks, >> Qu >>> --- >>> fs/btrfs/block-group.c | 12 ++++++++++-- >>> 1 file changed, 10 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c >>> index 0cb1dee965a0..b2e5107b7cec 100644 >>> --- a/fs/btrfs/block-group.c >>> +++ b/fs/btrfs/block-group.c >>> @@ -3028,8 +3028,16 @@ static int update_block_group_item(struct btrfs_trans_handle *trans, >>> btrfs_mark_buffer_dirty(leaf); >>> fail: >>> btrfs_release_path(path); >>> - /* We didn't update the block group item, need to revert @commit_used. */ >>> - if (ret < 0) { >>> + /* >>> + * We didn't update the block group item, need to revert commit_used >>> + * unless the block group item didn't exist yet - this is to prevent a >>> + * race with a concurrent insertion of the block group item, with >>> + * insert_block_group_item(), that happened just after we attempted to >>> + * update. In that case we would reset commit_used to 0 just after the >>> + * insertion set it to a value greater than 0 - if the block group later >>> + * becomes with 0 used bytes, we would incorrectly skip its update. >>> + */ >>> + if (ret < 0 && ret != -ENOENT) { >>> spin_lock(&cache->lock); >>> cache->commit_used = old_commit_used; >>> spin_unlock(&cache->lock);
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 0cb1dee965a0..b2e5107b7cec 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -3028,8 +3028,16 @@ static int update_block_group_item(struct btrfs_trans_handle *trans, btrfs_mark_buffer_dirty(leaf); fail: btrfs_release_path(path); - /* We didn't update the block group item, need to revert @commit_used. */ - if (ret < 0) { + /* + * We didn't update the block group item, need to revert commit_used + * unless the block group item didn't exist yet - this is to prevent a + * race with a concurrent insertion of the block group item, with + * insert_block_group_item(), that happened just after we attempted to + * update. In that case we would reset commit_used to 0 just after the + * insertion set it to a value greater than 0 - if the block group later + * becomes with 0 used bytes, we would incorrectly skip its update. + */ + if (ret < 0 && ret != -ENOENT) { spin_lock(&cache->lock); cache->commit_used = old_commit_used; spin_unlock(&cache->lock);