diff mbox series

[v2] btrfs: avoid deadlock when reading a partial uptodate folio

Message ID 62bf73ada7be2888d45a787c2b6fd252103a5d25.1729725088.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: avoid deadlock when reading a partial uptodate folio | expand

Commit Message

Qu Wenruo Oct. 23, 2024, 11:13 p.m. UTC
[BUG]
This is for a deadlock only possible after the out-of-tree patch
"btrfs: allow buffered write to skip full page if it's sector aligned".

For now it's impossible to hit the deadlock, the reason will be
explained in [CAUSE] section.

If the sector size is smaller than page size, and we allow btrfs to
avoid reading the full page because the buffered write range is
sector aligned, we can hit a hang with generic/095 runs:

  __switch_to+0xf8/0x168
  __schedule+0x328/0x8a8
  schedule+0x54/0x140
  io_schedule+0x44/0x68
  folio_wait_bit_common+0x198/0x3f8
  __folio_lock+0x24/0x40
  extent_write_cache_pages+0x2e0/0x4c0 [btrfs]
  btrfs_writepages+0x94/0x158 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  __filemap_fdatawrite_range+0x6c/0xa8
  filemap_fdatawrite_range+0x1c/0x30
  btrfs_start_ordered_extent+0x264/0x2e0 [btrfs]
  btrfs_lock_and_flush_ordered_range+0x8c/0x160 [btrfs]
  __get_extent_map+0xa0/0x220 [btrfs]
  btrfs_do_readpage+0x1bc/0x5d8 [btrfs]
  btrfs_read_folio+0x50/0xa0 [btrfs]
  filemap_read_folio+0x54/0x110
  filemap_update_page+0x2e0/0x3b8
  filemap_get_pages+0x228/0x4d8
  filemap_read+0x11c/0x3b8
  btrfs_file_read_iter+0x74/0x90 [btrfs]
  new_sync_read+0xd0/0x1d0
  vfs_read+0x1a0/0x1f0

There is also the minimal fio reproducer extracted from that test case
to reproduce the deadlock:

  [global]
  bs=8k
  iodepth=1
  randrepeat=1
  size=256k
  directory=$mnt
  numjobs=1
  [job1]
  ioengine=sync
  bs=512
  direct=1
  rw=randread
  filename=file1
  [job2]
  ioengine=libaio
  rw=randwrite
  direct=1
  filename=file1
  [job3]
  ioengine=posixaio
  rw=randwrite
  filename=file1

[CAUSE]
The above call trace shows that, during the folio read a writeback is
triggered on the same folio.
And since during btrfs_do_readpage(), the folio is locked, the writeback
will never be able to lock the folio, thus it is waiting on itself thus
causing the deadlock.

The root cause is a little complex, the system is 64K page sized, with
4K sector size:

1) The folio has its range [48K, 64K) marked dirty by buffered write

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- sector Uptodate|Dirty

2) Writeback finished for [48K, 64K), but ordered extent not yet finished

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- sector Uptodate
					        extent map PINNED
						OE still here

3) The folio is released from page cache
   This can be triggered by direct IO through the following call chain:

   iomap_dio_rw()
   \- kiocb_invalidate_pages()
    \- filemap_invalidate_pages()
     \- invalidate_inode_pages2_range()
      \- invalidate_complete_folio2()
       \- filemap_release_folio()
        \- btrfs_release_folio()
	 \- __btrfs_release_folio()
	  \- try_release_extent_mapping()

   Since there is no extent state with EXTENT_LOCKED flag in the folio
   range, btrfs allows the folio to be released.
   Now there is no folio->private to record which block is uptodate.
   But extent map and OE are still here.

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- extent map PINNED
						OE still here

4) Buffered write dirtied range [0, 16K)
   Since it's sector aligned, btrfs didn't read the full folio from disk.

   0          16K         32K          48K         64K
   |//////////|                        |///////////|
       \- sector Uptodate|Dirty              \- extent map PINNED
						OE still here

5) Read on the folio is triggered
   For the range [0, 16K), since it's already uptodate, btrfs skips this
   range.
   For the range [16K, 48K), btrfs submit the read from disk.

   The problem comes to the range [48K, 64K), the following call chain
   happens:

   btrfs_do_readpage()
   \- __get_extent_map()
    \- btrfs_lock_and_flush_ordered_range()
     \- btrfs_start_ordered_extent()
      \- filemap_fdatawrite_range()
       \- btrfs_writepages()
        \- extent_write_cache_pages()
	 \- folio_lock()

   Since the folio indeed has dirty sectors in range [0, 16K), the range
   will be written back.

   But the folio is already locked by the folio read, the writeback
   will never be able to lock the folio, thus lead to the deadlock.

This sequence can only happen if all the following conditions are met:

- The sector size is smaller than page size.
  Or we won't have mixed dirty blocks in the same folio we're reading.

- We allow the buffered write to skip the folio read if it's sector
  aligned.
  This is done by the incoming patch
  "btrfs: allow buffered write to skip full page if it's sector aligned".

  The ultimate goal of that patch is to reduce unnecessary read for sector
  size < page size cases, and to pass generic/563.

  Otherwise the folio will be read from the disk during buffered write,
  before marking it dirty.
  Thus will not trigger the deadlock.

[FIX]
Break the step 5) of the above case.

By passing an optional @locked_folio into btrfs_start_ordered_extent()
and btrfs_lock_and_flush_ordered_range().
If we got such locked folio skip the writeback for ranges of that folio.

Here we also do extra asserts to make sure the target
range is already not dirty, or the ordered extent we wait will never be
able to finish, since part of the ordered extent is never submitted.

So far only the call site inside __get_extent_map() is passing the new
parameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Update the commit message to be more clear of each call chain

- Update the commit message to fix grammar errors

- Remove the unnecessary change on the range of __get_extent_map()
  The root fix is inside the btrfs_start_ordered_extent().
  The change to the __get_extent_map() range has no effect at all.

RFC->v1:
- Go with extra @locked_folio parameter for btrfs_start_ordered_extent()
  This is more straightforward compared to skipping folio releasing.
  This also solves some painful slowdown of other test cases.
---
 fs/btrfs/defrag.c       |  2 +-
 fs/btrfs/direct-io.c    |  2 +-
 fs/btrfs/extent_io.c    |  3 +-
 fs/btrfs/file.c         |  8 ++---
 fs/btrfs/inode.c        |  6 ++--
 fs/btrfs/ordered-data.c | 67 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/ordered-data.h |  8 +++--
 7 files changed, 75 insertions(+), 21 deletions(-)

Comments

Christoph Hellwig Oct. 24, 2024, 5:14 a.m. UTC | #1
On Thu, Oct 24, 2024 at 09:43:47AM +1030, Qu Wenruo wrote:
> There is also the minimal fio reproducer extracted from that test case
> to reproduce the deadlock:

Please add this to xfstests.
Qu Wenruo Oct. 24, 2024, 5:24 a.m. UTC | #2
在 2024/10/24 15:44, Christoph Hellwig 写道:
> On Thu, Oct 24, 2024 at 09:43:47AM +1030, Qu Wenruo wrote:
>> There is also the minimal fio reproducer extracted from that test case
>> to reproduce the deadlock:
> 
> Please add this to xfstests.
> 

This is just a subset of generic/095.

And generic/095 has a much higher chance to trigger it than the minimal one.
The minimal one is only easier to debug and faster to finish one run.

Do I still need to add the minimal one to fstests?

Thanks,
Qu
Christoph Hellwig Oct. 24, 2024, 5:27 a.m. UTC | #3
On Thu, Oct 24, 2024 at 03:54:54PM +1030, Qu Wenruo wrote:
> And generic/095 has a much higher chance to trigger it than the minimal one.
> The minimal one is only easier to debug and faster to finish one run.
> 
> Do I still need to add the minimal one to fstests?

If I have bug with a well known isolated reproducer I love
having it in xfstests.  I don't know if everyone agrees, though.
Zorro Lang Oct. 24, 2024, 1:20 p.m. UTC | #4
On Wed, Oct 23, 2024 at 10:27:06PM -0700, Christoph Hellwig wrote:
> On Thu, Oct 24, 2024 at 03:54:54PM +1030, Qu Wenruo wrote:
> > And generic/095 has a much higher chance to trigger it than the minimal one.
> > The minimal one is only easier to debug and faster to finish one run.
> > 
> > Do I still need to add the minimal one to fstests?
> 
> If I have bug with a well known isolated reproducer I love
> having it in xfstests.  I don't know if everyone agrees, though.

generic/095 isn't a specific test case for a known issue, it's a common test case.
If you want to have a reproducer for a particular bug, feel free add it to xfstests,
mark _fixed_by_xxxx and describe more details about that bug in the test case.
If one day we update g/095, it might not cover the bug you need. Then a specific
regression test case will help.

Thanks,
Zorro

> 
>
Qu Wenruo Oct. 24, 2024, 8:24 p.m. UTC | #5
在 2024/10/24 23:50, Zorro Lang 写道:
> On Wed, Oct 23, 2024 at 10:27:06PM -0700, Christoph Hellwig wrote:
>> On Thu, Oct 24, 2024 at 03:54:54PM +1030, Qu Wenruo wrote:
>>> And generic/095 has a much higher chance to trigger it than the minimal one.
>>> The minimal one is only easier to debug and faster to finish one run.
>>>
>>> Do I still need to add the minimal one to fstests?
>>
>> If I have bug with a well known isolated reproducer I love
>> having it in xfstests.  I don't know if everyone agrees, though.
>
> generic/095 isn't a specific test case for a known issue, it's a common test case.
> If you want to have a reproducer for a particular bug, feel free add it to xfstests,
> mark _fixed_by_xxxx and describe more details about that bug in the test case.
> If one day we update g/095, it might not cover the bug you need. Then a specific
> regression test case will help.

Thanks for the advice, this answers the uncertainty pretty well.

I'll add a dedicated generic test case for it.

Thanks,
Qu
>
> Thanks,
> Zorro
>
>>
>>
>
>
diff mbox series

Patch

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 1644470b9df7..2467990d6ac7 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -902,7 +902,7 @@  static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 			break;
 
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		folio_lock(folio);
 		/*
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index a7c3e221378d..2fb02aa19be0 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -103,7 +103,7 @@  static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 */
 			if (writing ||
 			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
-				btrfs_start_ordered_extent(ordered);
+				btrfs_start_ordered_extent(ordered, NULL);
 			else
 				ret = nowait ? -EAGAIN : -ENOTBLK;
 			btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7680dd94fddf..d83ee70707cf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -922,7 +922,8 @@  static struct extent_map *__get_extent_map(struct inode *inode,
 		*em_cached = NULL;
 	}
 
-	btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, start + len - 1, &cached_state);
+	btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), folio, start,
+					   start + len - 1, &cached_state);
 	em = btrfs_get_extent(BTRFS_I(inode), folio, start, len);
 	if (!IS_ERR(em)) {
 		BUG_ON(*em_cached);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 676eddc9daaf..7521fbefa9fd 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -987,7 +987,7 @@  lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
 				      cached_state);
 			folio_unlock(folio);
 			folio_put(folio);
-			btrfs_start_ordered_extent(ordered);
+			btrfs_start_ordered_extent(ordered, NULL);
 			btrfs_put_ordered_extent(ordered);
 			return -EAGAIN;
 		}
@@ -1055,8 +1055,8 @@  int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 			return -EAGAIN;
 		}
 	} else {
-		btrfs_lock_and_flush_ordered_range(inode, lockstart, lockend,
-						   &cached_state);
+		btrfs_lock_and_flush_ordered_range(inode, NULL, lockstart,
+						   lockend, &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
 			       NULL, nowait, false);
@@ -1895,7 +1895,7 @@  static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		unlock_extent(io_tree, page_start, page_end, &cached_state);
 		folio_unlock(folio);
 		up_read(&BTRFS_I(inode)->i_mmap_lock);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a21701571cbb..56bd33cf864b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2773,7 +2773,7 @@  static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		unlock_extent(&inode->io_tree, page_start, page_end,
 			      &cached_state);
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
@@ -4783,7 +4783,7 @@  int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		unlock_extent(io_tree, block_start, block_end, &cached_state);
 		folio_unlock(folio);
 		folio_put(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
@@ -4918,7 +4918,7 @@  int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 	if (size <= hole_start)
 		return 0;
 
-	btrfs_lock_and_flush_ordered_range(inode, hole_start, block_end - 1,
+	btrfs_lock_and_flush_ordered_range(inode, NULL, hole_start, block_end - 1,
 					   &cached_state);
 	cur_offset = hole_start;
 	while (1) {
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 2104d60c2161..35b2c441ee6f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -729,7 +729,7 @@  static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
 	struct btrfs_ordered_extent *ordered;
 
 	ordered = container_of(work, struct btrfs_ordered_extent, flush_work);
-	btrfs_start_ordered_extent(ordered);
+	btrfs_start_ordered_extent(ordered, NULL);
 	complete(&ordered->completion);
 }
 
@@ -845,12 +845,14 @@  void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
  * Wait on page writeback for all the pages in the extent and the IO completion
  * code to insert metadata into the btree corresponding to the extent.
  */
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				struct folio *locked_folio)
 {
 	u64 start = entry->file_offset;
 	u64 end = start + entry->num_bytes - 1;
 	struct btrfs_inode *inode = entry->inode;
 	bool freespace_inode;
+	bool skip_writeback = false;
 
 	trace_btrfs_ordered_extent_start(inode, entry);
 
@@ -860,13 +862,59 @@  void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
 	 */
 	freespace_inode = btrfs_is_free_space_inode(inode);
 
+	/*
+	 * The locked folio covers the ordered extent range and the full
+	 * folio is dirty.
+	 * We can not trigger writeback on it, as we will try to lock
+	 * the same folio we already hold.
+	 *
+	 * This only happens for sector size < page size case, and even
+	 * that happens we're still safe because this can only happen
+	 * when the range is submitted and finished, but OE is not yet
+	 * finished.
+	 */
+	if (locked_folio) {
+		const u64 skip_start = max_t(u64, folio_pos(locked_folio), start);
+		const u64 skip_end = min_t(u64,
+				folio_pos(locked_folio) + folio_size(locked_folio),
+				end + 1) - 1;
+
+		ASSERT(folio_test_locked(locked_folio));
+
+		/* The folio should intersect with the OE range. */
+		ASSERT(folio_pos(locked_folio) <= end ||
+		       folio_pos(locked_folio) + folio_size(locked_folio) > start);
+
+		/*
+		 * The range must not be dirty.
+		 *
+		 * Since we will skip writeback for the folio, if the involved range
+		 * is dirty the range will never be submitted, thus the ordered
+		 * extent we are going to wait will never finish, cause another deadlock.
+		 */
+		btrfs_folio_assert_not_dirty(inode->root->fs_info, locked_folio,
+					     skip_start, skip_end  + 1 - skip_start);
+		skip_writeback = true;
+	}
 	/*
 	 * pages in the range can be dirty, clean or writeback.  We
 	 * start IO on any dirty ones so the wait doesn't stall waiting
 	 * for the flusher thread to find them
 	 */
-	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
-		filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) {
+		if (!skip_writeback) {
+			filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+		} else {
+			/* Need to skip the locked folio range. */
+			if (start < folio_pos(locked_folio))
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
+						start, folio_pos(locked_folio) - 1);
+			if (end + 1 > folio_pos(locked_folio) + folio_size(locked_folio))
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
+						folio_pos(locked_folio) + folio_size(locked_folio),
+						end);
+		}
+	}
 
 	if (!freespace_inode)
 		btrfs_might_wait_for_event(inode->root->fs_info, btrfs_ordered_extent);
@@ -921,7 +969,7 @@  int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
 			btrfs_put_ordered_extent(ordered);
 			break;
 		}
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, NULL);
 		end = ordered->file_offset;
 		/*
 		 * If the ordered extent had an error save the error but don't
@@ -1141,6 +1189,8 @@  struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
  * @inode:        Inode whose ordered tree is to be searched
  * @start:        Beginning of range to flush
  * @end:          Last byte of range to lock
+ * @locked_folio: If passed, will not start writeback of this folio, to avoid
+ *		  locking the same folio already locked by the caller.
  * @cached_state: If passed, will return the extent state responsible for the
  *                locked range. It's the caller's responsibility to free the
  *                cached state.
@@ -1148,8 +1198,9 @@  struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
  * Always return with the given range locked, ensuring after it's called no
  * order extent can be pending.
  */
-void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
-					u64 end,
+void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode,
+					struct folio *locked_folio,
+					u64 start, u64 end,
 					struct extent_state **cached_state)
 {
 	struct btrfs_ordered_extent *ordered;
@@ -1174,7 +1225,7 @@  void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 			break;
 		}
 		unlock_extent(&inode->io_tree, start, end, cachedp);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, locked_folio);
 		btrfs_put_ordered_extent(ordered);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4e152736d06c..a4bb24572c73 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -191,7 +191,8 @@  void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
 							 u64 file_offset);
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				struct folio *locked_folio);
 int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
 struct btrfs_ordered_extent *
 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
@@ -207,8 +208,9 @@  u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 			       const struct btrfs_block_group *bg);
 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 			      const struct btrfs_block_group *bg);
-void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
-					u64 end,
+void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode,
+					struct folio *locked_folio,
+					u64 start, u64 end,
 					struct extent_state **cached_state);
 bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct extent_state **cached_state);