[v2] btrfs: avoid deadlock when reading a partial uptodate folio

[BUG]
This is for a deadlock only possible after the out-of-tree patch
"btrfs: allow buffered write to skip full page if it's sector aligned".

For now it's impossible to hit the deadlock, the reason will be
explained in [CAUSE] section.

If the sector size is smaller than page size, and we allow btrfs to
avoid reading the full page because the buffered write range is
sector aligned, we can hit a hang with generic/095 runs:

  __switch_to+0xf8/0x168
  __schedule+0x328/0x8a8
  schedule+0x54/0x140
  io_schedule+0x44/0x68
  folio_wait_bit_common+0x198/0x3f8
  __folio_lock+0x24/0x40
  extent_write_cache_pages+0x2e0/0x4c0 [btrfs]
  btrfs_writepages+0x94/0x158 [btrfs]
  do_writepages+0x74/0x190
  filemap_fdatawrite_wbc+0x88/0xc8
  __filemap_fdatawrite_range+0x6c/0xa8
  filemap_fdatawrite_range+0x1c/0x30
  btrfs_start_ordered_extent+0x264/0x2e0 [btrfs]
  btrfs_lock_and_flush_ordered_range+0x8c/0x160 [btrfs]
  __get_extent_map+0xa0/0x220 [btrfs]
  btrfs_do_readpage+0x1bc/0x5d8 [btrfs]
  btrfs_read_folio+0x50/0xa0 [btrfs]
  filemap_read_folio+0x54/0x110
  filemap_update_page+0x2e0/0x3b8
  filemap_get_pages+0x228/0x4d8
  filemap_read+0x11c/0x3b8
  btrfs_file_read_iter+0x74/0x90 [btrfs]
  new_sync_read+0xd0/0x1d0
  vfs_read+0x1a0/0x1f0

There is also the minimal fio reproducer extracted from that test case
to reproduce the deadlock:

  [global]
  bs=8k
  iodepth=1
  randrepeat=1
  size=256k
  directory=$mnt
  numjobs=1
  [job1]
  ioengine=sync
  bs=512
  direct=1
  rw=randread
  filename=file1
  [job2]
  ioengine=libaio
  rw=randwrite
  direct=1
  filename=file1
  [job3]
  ioengine=posixaio
  rw=randwrite
  filename=file1

[CAUSE]
The above call trace shows that, during the folio read a writeback is
triggered on the same folio.
And since during btrfs_do_readpage(), the folio is locked, the writeback
will never be able to lock the folio, thus it is waiting on itself thus
causing the deadlock.

The root cause is a little complex, the system is 64K page sized, with
4K sector size:

1) The folio has its range [48K, 64K) marked dirty by buffered write

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- sector Uptodate|Dirty

2) Writeback finished for [48K, 64K), but ordered extent not yet finished

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- sector Uptodate
					        extent map PINNED
						OE still here

3) The folio is released from page cache
   This can be triggered by direct IO through the following call chain:

   iomap_dio_rw()
   \- kiocb_invalidate_pages()
    \- filemap_invalidate_pages()
     \- invalidate_inode_pages2_range()
      \- invalidate_complete_folio2()
       \- filemap_release_folio()
        \- btrfs_release_folio()
	 \- __btrfs_release_folio()
	  \- try_release_extent_mapping()

   Since there is no extent state with EXTENT_LOCKED flag in the folio
   range, btrfs allows the folio to be released.
   Now there is no folio->private to record which block is uptodate.
   But extent map and OE are still here.

   0          16K         32K          48K         64K
   |                                   |///////////|
                                             \- extent map PINNED
						OE still here

4) Buffered write dirtied range [0, 16K)
   Since it's sector aligned, btrfs didn't read the full folio from disk.

   0          16K         32K          48K         64K
   |//////////|                        |///////////|
       \- sector Uptodate|Dirty              \- extent map PINNED
						OE still here

5) Read on the folio is triggered
   For the range [0, 16K), since it's already uptodate, btrfs skips this
   range.
   For the range [16K, 48K), btrfs submit the read from disk.

   The problem comes to the range [48K, 64K), the following call chain
   happens:

   btrfs_do_readpage()
   \- __get_extent_map()
    \- btrfs_lock_and_flush_ordered_range()
     \- btrfs_start_ordered_extent()
      \- filemap_fdatawrite_range()
       \- btrfs_writepages()
        \- extent_write_cache_pages()
	 \- folio_lock()

   Since the folio indeed has dirty sectors in range [0, 16K), the range
   will be written back.

   But the folio is already locked by the folio read, the writeback
   will never be able to lock the folio, thus lead to the deadlock.

This sequence can only happen if all the following conditions are met:

- The sector size is smaller than page size.
  Or we won't have mixed dirty blocks in the same folio we're reading.

- We allow the buffered write to skip the folio read if it's sector
  aligned.
  This is done by the incoming patch
  "btrfs: allow buffered write to skip full page if it's sector aligned".

  The ultimate goal of that patch is to reduce unnecessary read for sector
  size < page size cases, and to pass generic/563.

  Otherwise the folio will be read from the disk during buffered write,
  before marking it dirty.
  Thus will not trigger the deadlock.

[FIX]
Break the step 5) of the above case.

By passing an optional @locked_folio into btrfs_start_ordered_extent()
and btrfs_lock_and_flush_ordered_range().
If we got such locked folio skip the writeback for ranges of that folio.

Here we also do extra asserts to make sure the target
range is already not dirty, or the ordered extent we wait will never be
able to finish, since part of the ordered extent is never submitted.

So far only the call site inside __get_extent_map() is passing the new
parameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Update the commit message to be more clear of each call chain

- Update the commit message to fix grammar errors

- Remove the unnecessary change on the range of __get_extent_map()
  The root fix is inside the btrfs_start_ordered_extent().
  The change to the __get_extent_map() range has no effect at all.

RFC->v1:
- Go with extra @locked_folio parameter for btrfs_start_ordered_extent()
  This is more straightforward compared to skipping folio releasing.
  This also solves some painful slowdown of other test cases.
---
 fs/btrfs/defrag.c       |  2 +-
 fs/btrfs/direct-io.c    |  2 +-
 fs/btrfs/extent_io.c    |  3 +-
 fs/btrfs/file.c         |  8 ++---
 fs/btrfs/inode.c        |  6 ++--
 fs/btrfs/ordered-data.c | 67 ++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/ordered-data.h |  8 +++--
 7 files changed, 75 insertions(+), 21 deletions(-)

Message ID	62bf73ada7be2888d45a787c2b6fd252103a5d25.1729725088.git.wqu@suse.com (mailing list archive)
State	New, archived
Headers	show Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 28B64155345 for <linux-btrfs@vger.kernel.org>; Wed, 23 Oct 2024 23:14:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=195.135.223.130 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729725251; cv=none; b=L6AxXX7LqHAKbK0LdPLGh18Mw3yZihb0105+jlKa/BsMxCJib8Yo5cFg9PntTOiDGz40nJeVIWlbk69CWBvygv7hy/XHTl1VGOEBZqF7+lmGjnHcGalAbIGQ3nOC4ikBA6ClcVAiHwklpP6mTtEP+Nmg168wNFyLRKWdlpuxOx0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1729725251; c=relaxed/simple; bh=yEagmO0nRHQmmnLnDD6w/HnAzIyAwToaZK0U9OZSNNI=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=ecXQ0NXPPHEi9XXDB9/A6whEhPlNFCZtlnIzMtt/1qPd2AxM5Oe3veIHFYuPhTS9CMjxLyIwduhZrP2Xk3+44/fN6vcTWGbVO31px0USL4WHKnsH6+UCPIAq0khAV3CXpufdk6pJOuxJ1rhQcSGKC3SL6dYtCy/wXFU2mL/ER6s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com; spf=pass smtp.mailfrom=suse.com; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b=k8+0txDg; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b=k8+0txDg; arc=none smtp.client-ip=195.135.223.130 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=suse.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=suse.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="k8+0txDg"; dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="k8+0txDg" Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 3D6D721A73 for <linux-btrfs@vger.kernel.org>; Wed, 23 Oct 2024 23:14:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1729725246; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=RAPeT9Cp22OZhZO+4d+2H6Nisa1RNN6Dzat84puQcq0=; b=k8+0txDgzprp0Y7MUajEff63ohWluydVu84Ucc7QqNuzkeGEmvcSDqeeTob/GvNqlUCMsV bDWkKk4ZF+fJ64xi+nlzNDmOVqOM9fgoH5vYQ8rAIyKbEeuYvqWTKJpjYNfqjjLEgkg8+N tvXGv+bHmcyN6HW2YveVgasEHrmd4YE= Authentication-Results: smtp-out1.suse.de; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1729725246; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=RAPeT9Cp22OZhZO+4d+2H6Nisa1RNN6Dzat84puQcq0=; b=k8+0txDgzprp0Y7MUajEff63ohWluydVu84Ucc7QqNuzkeGEmvcSDqeeTob/GvNqlUCMsV bDWkKk4ZF+fJ64xi+nlzNDmOVqOM9fgoH5vYQ8rAIyKbEeuYvqWTKJpjYNfqjjLEgkg8+N tvXGv+bHmcyN6HW2YveVgasEHrmd4YE= Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 6858C13A63 for <linux-btrfs@vger.kernel.org>; Wed, 23 Oct 2024 23:14:05 +0000 (UTC) Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167]) by imap1.dmz-prg2.suse.org with ESMTPSA id J+yxCT2DGWc3JwAAD6G6ig (envelope-from <wqu@suse.com>) for <linux-btrfs@vger.kernel.org>; Wed, 23 Oct 2024 23:14:05 +0000 From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH v2] btrfs: avoid deadlock when reading a partial uptodate folio Date: Thu, 24 Oct 2024 09:43:47 +1030 Message-ID: <62bf73ada7be2888d45a787c2b6fd252103a5d25.1729725088.git.wqu@suse.com> X-Mailer: git-send-email 2.47.0 Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: <linux-btrfs.vger.kernel.org> List-Subscribe: <mailto:linux-btrfs+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-btrfs+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Level: X-Spamd-Result: default: False [-2.80 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MID_CONTAINS_FROM(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_MISSING_CHARSET(0.50)[]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; ARC_NA(0.00)[]; DKIM_SIGNED(0.00)[suse.com:s=susede1]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:email,suse.com:mid]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[linux-btrfs@vger.kernel.org]; RCVD_TLS_ALL(0.00)[] X-Spam-Score: -2.80 X-Spam-Flag: NO
Series	[v2] btrfs: avoid deadlock when reading a partial uptodate folio \| expand [v2] btrfs: avoid deadlock when reading a partial uptodate folio

[v2] btrfs: avoid deadlock when reading a partial uptodate folio

Commit Message

Comments

Patch