diff mbox series

[1/6] btrfs: fix double accounting of ordered extents during errors

Message ID 2faab8a96c6dd2a414a96e4cebae97ecbddf021d.1730269807.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: sector size < page size enhancement | expand

Commit Message

Qu Wenruo Oct. 30, 2024, 6:33 a.m. UTC
[BUG]
Btrfs will fail generic/750 randomly if its sector size is smaller than
page size.

One of the warning looks like this:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 90263 at fs/btrfs/ordered-data.c:360 can_finish_ordered_extent+0x33c/0x390 [btrfs]
 CPU: 1 UID: 0 PID: 90263 Comm: kworker/u18:1 Tainted: G           OE      6.12.0-rc3-custom+ #79
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 pc : can_finish_ordered_extent+0x33c/0x390 [btrfs]
 lr : can_finish_ordered_extent+0xdc/0x390 [btrfs]
 Call trace:
  can_finish_ordered_extent+0x33c/0x390 [btrfs]
  btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
  extent_writepage+0xfc/0x338 [btrfs]
  extent_write_cache_pages+0x1d4/0x4b8 [btrfs]
  btrfs_writepages+0x94/0x158 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  start_delalloc_inodes+0x180/0x3b0 [btrfs]
  btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
  shrink_delalloc+0x11c/0x280 [btrfs]
  flush_space+0x27c/0x310 [btrfs]
  btrfs_async_reclaim_metadata_space+0xcc/0x208 [btrfs]
  process_one_work+0x228/0x670
  worker_thread+0x1bc/0x360
  kthread+0x100/0x118
  ret_from_fork+0x10/0x20
 irq event stamp: 9784200
 hardirqs last  enabled at (9784199): [<ffffd21ec54dc01c>] _raw_spin_unlock_irqrestore+0x74/0x80
 hardirqs last disabled at (9784200): [<ffffd21ec54db374>] _raw_spin_lock_irqsave+0x8c/0xa0
 softirqs last  enabled at (9784148): [<ffffd21ec472ff44>] handle_softirqs+0x45c/0x4b0
 softirqs last disabled at (9784141): [<ffffd21ec46d01e4>] __do_softirq+0x1c/0x28
 ---[ end trace 0000000000000000 ]---
 BTRFS critical (device dm-2): bad ordered extent accounting, root=5 ino=1492 OE offset=1654784 OE len=57344 to_dec=49152 left=0

[CAUSE]
The function btrfs_mark_ordered_io_finished() is called for marking all
ordered extents in the page range as finished, for error handling.

But for sector size < page size cases, we can have multiple ordered
extents in one page.

If extent_writepage_io() failed (the only possible case is
submit_one_sector() failed to grab an extent map), then the call site
inside extent_writepage() will call btrfs_mark_ordered_io_finished() to
finish the created ordered extents.

However some range of the ordered extent may have been submitted already,
then btrfs_mark_ordered_io_finished() is called on the same range, causing
double accounting.

[FIX]
- Introduce a new member btrfs_bio_ctrl::last_submitted
  This will trace the last sector submitted through
  extent_writepage_io().

  So for the above extent_writepage() case, we will know exactly which
  sectors are submitted and should not do the ordered extent accounting.

- Introduce a helper cleanup_ordered_extents()
  This will do a sector-by-sector cleanup with
  btrfs_bio_ctrl::last_submitted and btrfs_bio_ctrl::submit_bitmap into
  consideartion.

  Using @last_submitted is to avoid double accounting on the submitted
  ranges.
  Meanwhile using @submit_bitmap is to avoid touching ranges going
  through compression.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

Comments

Qu Wenruo Nov. 24, 2024, 7:31 a.m. UTC | #1
I know this is part of the subpage patches, but this is really a bug fix 
for the existing subpage handling.

Appreciate if anyone can give this a review.

Thanks,
Qu

在 2024/10/30 17:03, Qu Wenruo 写道:
> [BUG]
> Btrfs will fail generic/750 randomly if its sector size is smaller than
> page size.
> 
> One of the warning looks like this:
> 
>   ------------[ cut here ]------------
>   WARNING: CPU: 1 PID: 90263 at fs/btrfs/ordered-data.c:360 can_finish_ordered_extent+0x33c/0x390 [btrfs]
>   CPU: 1 UID: 0 PID: 90263 Comm: kworker/u18:1 Tainted: G           OE      6.12.0-rc3-custom+ #79
>   Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
>   pc : can_finish_ordered_extent+0x33c/0x390 [btrfs]
>   lr : can_finish_ordered_extent+0xdc/0x390 [btrfs]
>   Call trace:
>    can_finish_ordered_extent+0x33c/0x390 [btrfs]
>    btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
>    extent_writepage+0xfc/0x338 [btrfs]
>    extent_write_cache_pages+0x1d4/0x4b8 [btrfs]
>    btrfs_writepages+0x94/0x158 [btrfs]
>    do_writepages+0x74/0x190
>    filemap_fdatawrite_wbc+0x88/0xc8
>    start_delalloc_inodes+0x180/0x3b0 [btrfs]
>    btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
>    shrink_delalloc+0x11c/0x280 [btrfs]
>    flush_space+0x27c/0x310 [btrfs]
>    btrfs_async_reclaim_metadata_space+0xcc/0x208 [btrfs]
>    process_one_work+0x228/0x670
>    worker_thread+0x1bc/0x360
>    kthread+0x100/0x118
>    ret_from_fork+0x10/0x20
>   irq event stamp: 9784200
>   hardirqs last  enabled at (9784199): [<ffffd21ec54dc01c>] _raw_spin_unlock_irqrestore+0x74/0x80
>   hardirqs last disabled at (9784200): [<ffffd21ec54db374>] _raw_spin_lock_irqsave+0x8c/0xa0
>   softirqs last  enabled at (9784148): [<ffffd21ec472ff44>] handle_softirqs+0x45c/0x4b0
>   softirqs last disabled at (9784141): [<ffffd21ec46d01e4>] __do_softirq+0x1c/0x28
>   ---[ end trace 0000000000000000 ]---
>   BTRFS critical (device dm-2): bad ordered extent accounting, root=5 ino=1492 OE offset=1654784 OE len=57344 to_dec=49152 left=0
> 
> [CAUSE]
> The function btrfs_mark_ordered_io_finished() is called for marking all
> ordered extents in the page range as finished, for error handling.
> 
> But for sector size < page size cases, we can have multiple ordered
> extents in one page.
> 
> If extent_writepage_io() failed (the only possible case is
> submit_one_sector() failed to grab an extent map), then the call site
> inside extent_writepage() will call btrfs_mark_ordered_io_finished() to
> finish the created ordered extents.
> 
> However some range of the ordered extent may have been submitted already,
> then btrfs_mark_ordered_io_finished() is called on the same range, causing
> double accounting.
> 
> [FIX]
> - Introduce a new member btrfs_bio_ctrl::last_submitted
>    This will trace the last sector submitted through
>    extent_writepage_io().
> 
>    So for the above extent_writepage() case, we will know exactly which
>    sectors are submitted and should not do the ordered extent accounting.
> 
> - Introduce a helper cleanup_ordered_extents()
>    This will do a sector-by-sector cleanup with
>    btrfs_bio_ctrl::last_submitted and btrfs_bio_ctrl::submit_bitmap into
>    consideartion.
> 
>    Using @last_submitted is to avoid double accounting on the submitted
>    ranges.
>    Meanwhile using @submit_bitmap is to avoid touching ranges going
>    through compression.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 41 +++++++++++++++++++++++++++++++++++++----
>   1 file changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index e629d2ee152a..427bfbe737f2 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -108,6 +108,14 @@ struct btrfs_bio_ctrl {
>   	 * This is to avoid touching ranges covered by compression/inline.
>   	 */
>   	unsigned long submit_bitmap;
> +
> +	/*
> +	 * The end (exclusive) of the last submitted range in the folio.
> +	 *
> +	 * This is for sector size < page size case where we may hit error
> +	 * half way.
> +	 */
> +	u64 last_submitted;
>   };
>   
>   static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
> @@ -1435,6 +1443,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>   		if (ret < 0)
>   			goto out;
> +		bio_ctrl->last_submitted = cur + fs_info->sectorsize;
>   		submitted_io = true;
>   	}
>   out:
> @@ -1453,6 +1462,24 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   	return ret;
>   }
>   
> +static void cleanup_ordered_extents(struct btrfs_inode *inode,
> +				    struct folio *folio, u64 file_pos,
> +				    u64 num_bytes, unsigned long *bitmap)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	unsigned int cur_bit = (file_pos - folio_pos(folio)) >> fs_info->sectorsize_bits;
> +
> +	for_each_set_bit_from(cur_bit, bitmap, fs_info->sectors_per_page) {
> +		u64 cur_pos = folio_pos(folio) + (cur_bit << fs_info->sectorsize_bits);
> +
> +		if (cur_pos >= file_pos + num_bytes)
> +			break;
> +
> +		btrfs_mark_ordered_io_finished(inode, folio, cur_pos,
> +					       fs_info->sectorsize, false);
> +	}
> +}
> +
>   /*
>    * the writepage semantics are similar to regular writepage.  extent
>    * records are inserted to lock ranges in the tree, and as dirty areas
> @@ -1492,6 +1519,7 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>   	 * The proper bitmap can only be initialized until writepage_delalloc().
>   	 */
>   	bio_ctrl->submit_bitmap = (unsigned long)-1;
> +	bio_ctrl->last_submitted = page_start;
>   	ret = set_folio_extent_mapped(folio);
>   	if (ret < 0)
>   		goto done;
> @@ -1511,8 +1539,10 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>   
>   done:
>   	if (ret) {
> -		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> -					       page_start, PAGE_SIZE, !ret);
> +		cleanup_ordered_extents(BTRFS_I(inode), folio,
> +				bio_ctrl->last_submitted,
> +				page_start + PAGE_SIZE - bio_ctrl->last_submitted,
> +				&bio_ctrl->submit_bitmap);
>   		mapping_set_error(folio->mapping, ret);
>   	}
>   
> @@ -2288,14 +2318,17 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
>   		 * extent_writepage_io() will do the truncation correctly.
>   		 */
>   		bio_ctrl.submit_bitmap = (unsigned long)-1;
> +		bio_ctrl.last_submitted = cur;
>   		ret = extent_writepage_io(BTRFS_I(inode), folio, cur, cur_len,
>   					  &bio_ctrl, i_size);
>   		if (ret == 1)
>   			goto next_page;
>   
>   		if (ret) {
> -			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
> -						       cur, cur_len, !ret);
> +			cleanup_ordered_extents(BTRFS_I(inode), folio,
> +					bio_ctrl.last_submitted,
> +					cur_end + 1 - bio_ctrl.last_submitted,
> +					&bio_ctrl.submit_bitmap);
>   			mapping_set_error(mapping, ret);
>   		}
>   		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
David Sterba Nov. 26, 2024, 4:08 p.m. UTC | #2
On Sun, Nov 24, 2024 at 06:01:27PM +1030, Qu Wenruo wrote:
> I know this is part of the subpage patches, but this is really a bug fix 
> for the existing subpage handling.
> 
> Appreciate if anyone can give this a review.

Looks correct to me. One suggestion to clean up the parameters and to
pass bio_ctrl and read the last_sibmitted and the bitmap directly, so
something like that:

cleanup_ordered_extents(BTRFS_I(inode), folio, &bio_ctrl, cur_end + 1);

replacing the parameters with the values in the function. Though after
another thought, the explicit expressions like
"page_start + PAGE_SIZE - bio_ctrl->last_submitted"
and "cur_end + 1 - bio_ctrl.last_submitted" make it a bit readable. Up
to you.
Qu Wenruo Nov. 26, 2024, 8:19 p.m. UTC | #3
在 2024/11/27 02:38, David Sterba 写道:
> On Sun, Nov 24, 2024 at 06:01:27PM +1030, Qu Wenruo wrote:
>> I know this is part of the subpage patches, but this is really a bug fix
>> for the existing subpage handling.
>>
>> Appreciate if anyone can give this a review.
> 
> Looks correct to me. One suggestion to clean up the parameters and to
> pass bio_ctrl and read the last_sibmitted and the bitmap directly, so
> something like that:
> 
> cleanup_ordered_extents(BTRFS_I(inode), folio, &bio_ctrl, cur_end + 1);
> 
> replacing the parameters with the values in the function. Though after
> another thought, the explicit expressions like
> "page_start + PAGE_SIZE - bio_ctrl->last_submitted"
> and "cur_end + 1 - bio_ctrl.last_submitted" make it a bit readable. Up
> to you.

This one is replaced by this series:

https://lore.kernel.org/linux-btrfs/cover.1732596971.git.wqu@suse.com/

However I'm still hitting hangs where some ordered extent never finishes.
(At least better than crash, but still not ideal)

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e629d2ee152a..427bfbe737f2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -108,6 +108,14 @@  struct btrfs_bio_ctrl {
 	 * This is to avoid touching ranges covered by compression/inline.
 	 */
 	unsigned long submit_bitmap;
+
+	/*
+	 * The end (exclusive) of the last submitted range in the folio.
+	 *
+	 * This is for sector size < page size case where we may hit error
+	 * half way.
+	 */
+	u64 last_submitted;
 };
 
 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -1435,6 +1443,7 @@  static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
 		if (ret < 0)
 			goto out;
+		bio_ctrl->last_submitted = cur + fs_info->sectorsize;
 		submitted_io = true;
 	}
 out:
@@ -1453,6 +1462,24 @@  static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	return ret;
 }
 
+static void cleanup_ordered_extents(struct btrfs_inode *inode,
+				    struct folio *folio, u64 file_pos,
+				    u64 num_bytes, unsigned long *bitmap)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	unsigned int cur_bit = (file_pos - folio_pos(folio)) >> fs_info->sectorsize_bits;
+
+	for_each_set_bit_from(cur_bit, bitmap, fs_info->sectors_per_page) {
+		u64 cur_pos = folio_pos(folio) + (cur_bit << fs_info->sectorsize_bits);
+
+		if (cur_pos >= file_pos + num_bytes)
+			break;
+
+		btrfs_mark_ordered_io_finished(inode, folio, cur_pos,
+					       fs_info->sectorsize, false);
+	}
+}
+
 /*
  * the writepage semantics are similar to regular writepage.  extent
  * records are inserted to lock ranges in the tree, and as dirty areas
@@ -1492,6 +1519,7 @@  static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 	 * The proper bitmap can only be initialized until writepage_delalloc().
 	 */
 	bio_ctrl->submit_bitmap = (unsigned long)-1;
+	bio_ctrl->last_submitted = page_start;
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0)
 		goto done;
@@ -1511,8 +1539,10 @@  static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 
 done:
 	if (ret) {
-		btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-					       page_start, PAGE_SIZE, !ret);
+		cleanup_ordered_extents(BTRFS_I(inode), folio,
+				bio_ctrl->last_submitted,
+				page_start + PAGE_SIZE - bio_ctrl->last_submitted,
+				&bio_ctrl->submit_bitmap);
 		mapping_set_error(folio->mapping, ret);
 	}
 
@@ -2288,14 +2318,17 @@  void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
 		 * extent_writepage_io() will do the truncation correctly.
 		 */
 		bio_ctrl.submit_bitmap = (unsigned long)-1;
+		bio_ctrl.last_submitted = cur;
 		ret = extent_writepage_io(BTRFS_I(inode), folio, cur, cur_len,
 					  &bio_ctrl, i_size);
 		if (ret == 1)
 			goto next_page;
 
 		if (ret) {
-			btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
-						       cur, cur_len, !ret);
+			cleanup_ordered_extents(BTRFS_I(inode), folio,
+					bio_ctrl.last_submitted,
+					cur_end + 1 - bio_ctrl.last_submitted,
+					&bio_ctrl.submit_bitmap);
 			mapping_set_error(mapping, ret);
 		}
 		btrfs_folio_end_lock(fs_info, folio, cur, cur_len);