[v3,1/8] btrfs: disable uncached writes for now

Message ID	25f0ab13b113ff37ae66cab26be7e458321db74f.1740990125.git.wqu@suse.com (mailing list archive)
State	New
Headers	show Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 565421E98FF for <linux-btrfs@vger.kernel.org>; Mon, 3 Mar 2025 08:35:50 +0000 (UTC) From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, Matthew Wilcox <willy@infradead.org> Subject: [PATCH v3 1/8] btrfs: disable uncached writes for now Date: Mon, 3 Mar 2025 19:05:09 +1030 Message-ID: <25f0ab13b113ff37ae66cab26be7e458321db74f.1740990125.git.wqu@suse.com> In-Reply-To: <cover.1740990125.git.wqu@suse.com> References: <cover.1740990125.git.wqu@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	btrfs: make subpage handling be feature full \| expand [v3,0/8] btrfs: make subpage handling be feature full [v3,1/8] btrfs: disable uncached writes for now [v3,2/8] btrfs: prevent inline data extents read from touching blocks beyond its range [v3,3/8] btrfs: fix the qgroup data free range for inline data extents [v3,4/8] btrfs: introduce a read path dedicated extent lock helper [v3,5/8] btrfs: make btrfs_do_readpage() to do block-by-block read [v3,6/8] btrfs: allow buffered write to avoid full page read if it's block aligned [v3,7/8] btrfs: allow inline data extents creation if block size < page size [v3,8/8] btrfs: remove the subpage related warning message

Message ID

25f0ab13b113ff37ae66cab26be7e458321db74f.1740990125.git.wqu@suse.com (mailing list archive)

State

New

Headers

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>,
	Matthew Wilcox <willy@infradead.org>
Subject: [PATCH v3 1/8] btrfs: disable uncached writes for now
Date: Mon,  3 Mar 2025 19:05:09 +1030
Message-ID: 
 <25f0ab13b113ff37ae66cab26be7e458321db74f.1740990125.git.wqu@suse.com>
In-Reply-To: <cover.1740990125.git.wqu@suse.com>
References: <cover.1740990125.git.wqu@suse.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

btrfs: make subpage handling be feature full | expand

Commit Message

Qu Wenruo March 3, 2025, 8:35 a.m. UTC

Currently btrfs calls folio_end_writeback() inside a spinlock, to
prevent races of async extents for the same folio.

The async extent mechanism is utilized for compressed writes, which
allows a folio (or part of a folio) to be kept locked, and queue the
range into a workqueue and do the compression.
After the compression is done, then submit the compressed data and set
involved blocks writeback and unlock the range.

Such the async extent mechanism disrupts the regular writeback behavior,
where normally we submit all the involved blocks inside the same folio
in one go.
Now with async extent parts of the same folio can be randomly marked
writeback at any time, by different threads.

Thus for fs block size < page size cases, btrfs always hold a spinlock
when setting/clearing the folio writeback flag, to avoid async extents
to race on the same folio.

But now with the dropbehind folio flag, folio_end_writeback() is no longer
safe to be called inside a spinlock:

  folio_end_writeback()
  |- folio_end_dropbehind_write()
     |- if (in_task() && folio_trylock())
        |  Since all btrfs endio functions happen inside a workqueue,
	|  it will always pass in_task() check.
	|
        |- folio_unmap_invalidate()
	   |- folio_launder()
	      !! MAY SLEEP !!
	      And there is no gfp flag to let the fs to avoid sleeping.

Furthermore, for fs blocks < page size cases, we can even deadlock on
the same subpage spinlock:

 btrfs_subpage_clear_writeback()
 |- spin_lock_irqsave(&subpage->lock)
 |- folio_end_writeback()
    |- folio_end_dropbehind_write()
       |- folio_unmap_invalidate()
          |- filemap_release_folio()
	     |- __btrfs_release_folio()
	        |- wait_subpage_spinlock()
		   |- spin_lock_irq(&subpage->lock);
                   !! DEADLOCK !!

So for now let's disable uncached write for btrfs, until we solve
all problems with planned solutions:

- Use atomic to trace writeback status
  Need to remove the COW fixup (handling of out-of-band dirty folio)
  routine first and align the member to iomap_folio_status structure
  first.

- Better async extent handling
  Instead of leaving the folios locked and set writeback flag after
  compression, change it to set writeback flags then start compression.

Fixes: fcc7e3306e11 ("btrfs: add support for uncached writes (RWF_DONTCACHE)")
Cc: Jens Axboe <axboe@kernel.dk>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Thankfully the btrfs uncached writes patch is not yet pushed to
upstream, I can remove it from for-next branch if this patch got enough
reviews.

But I prefer not to, as that commit still contains some good cleanup on
the FGP flags.
---
 fs/btrfs/file.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

David Sterba March 3, 2025, 8:57 a.m. UTC | #1

On Mon, Mar 03, 2025 at 07:05:09PM +1030, Qu Wenruo wrote:
> Currently btrfs calls folio_end_writeback() inside a spinlock, to
> prevent races of async extents for the same folio.
> 
> The async extent mechanism is utilized for compressed writes, which
> allows a folio (or part of a folio) to be kept locked, and queue the
> range into a workqueue and do the compression.
> After the compression is done, then submit the compressed data and set
> involved blocks writeback and unlock the range.
> 
> Such the async extent mechanism disrupts the regular writeback behavior,
> where normally we submit all the involved blocks inside the same folio
> in one go.
> Now with async extent parts of the same folio can be randomly marked
> writeback at any time, by different threads.
> 
> Thus for fs block size < page size cases, btrfs always hold a spinlock
> when setting/clearing the folio writeback flag, to avoid async extents
> to race on the same folio.
> 
> But now with the dropbehind folio flag, folio_end_writeback() is no longer
> safe to be called inside a spinlock:
> 
>   folio_end_writeback()
>   |- folio_end_dropbehind_write()
>      |- if (in_task() && folio_trylock())
>         |  Since all btrfs endio functions happen inside a workqueue,
> 	|  it will always pass in_task() check.
> 	|
>         |- folio_unmap_invalidate()
> 	   |- folio_launder()
> 	      !! MAY SLEEP !!
> 	      And there is no gfp flag to let the fs to avoid sleeping.
> 
> Furthermore, for fs blocks < page size cases, we can even deadlock on
> the same subpage spinlock:
> 
>  btrfs_subpage_clear_writeback()
>  |- spin_lock_irqsave(&subpage->lock)
>  |- folio_end_writeback()
>     |- folio_end_dropbehind_write()
>        |- folio_unmap_invalidate()
>           |- filemap_release_folio()
> 	     |- __btrfs_release_folio()
> 	        |- wait_subpage_spinlock()
> 		   |- spin_lock_irq(&subpage->lock);
>                    !! DEADLOCK !!
> 
> So for now let's disable uncached write for btrfs, until we solve
> all problems with planned solutions:
> 
> - Use atomic to trace writeback status
>   Need to remove the COW fixup (handling of out-of-band dirty folio)
>   routine first and align the member to iomap_folio_status structure
>   first.
> 
> - Better async extent handling
>   Instead of leaving the folios locked and set writeback flag after
>   compression, change it to set writeback flags then start compression.
> 
> Fixes: fcc7e3306e11 ("btrfs: add support for uncached writes (RWF_DONTCACHE)")

The commit id will change so it's better to refer to it just by the
subject and not put it to Fixes.

> Cc: Jens Axboe <axboe@kernel.dk>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Thankfully the btrfs uncached writes patch is not yet pushed to
> upstream, I can remove it from for-next branch if this patch got enough
> reviews.
> 
> But I prefer not to, as that commit still contains some good cleanup on
> the FGP flags.

Thanks for catching the problems early.

This is not the cleanest way but in the long run we want the uncached
writes so with the preparatory patch merged and disabled by one line it
should be easier to test it. Also want to keep your analysis of the
deadlock in a visible place like changelog so we can refer to it.

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e87d4a37c929..fe9e98f916f4 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1099,8 +1099,17 @@  ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
 		fgp_flags |= FGP_NOWAIT;
 	}
 
+	/*
+	 * DONTCACHE will make folio reclaim happen immediately at
+	 * folio_end_writeback(), for fs block size < page size cases it will
+	 * happen inside a spinlock (due to possible async extents races),
+	 * and such folio_end_writeback() may cause sleep inside a spinlock.
+	 *
+	 * So disable DONTCACHE until we either reworked async extent, or find
+	 * a better way to handle per-block writeback tracking.
+	 */
 	if (iocb->ki_flags & IOCB_DONTCACHE)
-		fgp_flags |= FGP_DONTCACHE;
+		return -EOPNOTSUPP;
 
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
@@ -3679,7 +3688,7 @@  const struct file_operations btrfs_file_operations = {
 #endif
 	.remap_file_range = btrfs_remap_file_range,
 	.uring_cmd	= btrfs_uring_cmd,
-	.fop_flags	= FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | FOP_DONTCACHE,
+	.fop_flags	= FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC,
 };
 
 int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end)

[v3,1/8] btrfs: disable uncached writes for now

Commit Message

Comments

Patch