Message ID | 25f0ab13b113ff37ae66cab26be7e458321db74f.1740990125.git.wqu@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: make subpage handling be feature full | expand |
On Mon, Mar 03, 2025 at 07:05:09PM +1030, Qu Wenruo wrote: > Currently btrfs calls folio_end_writeback() inside a spinlock, to > prevent races of async extents for the same folio. > > The async extent mechanism is utilized for compressed writes, which > allows a folio (or part of a folio) to be kept locked, and queue the > range into a workqueue and do the compression. > After the compression is done, then submit the compressed data and set > involved blocks writeback and unlock the range. > > Such the async extent mechanism disrupts the regular writeback behavior, > where normally we submit all the involved blocks inside the same folio > in one go. > Now with async extent parts of the same folio can be randomly marked > writeback at any time, by different threads. > > Thus for fs block size < page size cases, btrfs always hold a spinlock > when setting/clearing the folio writeback flag, to avoid async extents > to race on the same folio. > > But now with the dropbehind folio flag, folio_end_writeback() is no longer > safe to be called inside a spinlock: > > folio_end_writeback() > |- folio_end_dropbehind_write() > |- if (in_task() && folio_trylock()) > | Since all btrfs endio functions happen inside a workqueue, > | it will always pass in_task() check. > | > |- folio_unmap_invalidate() > |- folio_launder() > !! MAY SLEEP !! > And there is no gfp flag to let the fs to avoid sleeping. > > Furthermore, for fs blocks < page size cases, we can even deadlock on > the same subpage spinlock: > > btrfs_subpage_clear_writeback() > |- spin_lock_irqsave(&subpage->lock) > |- folio_end_writeback() > |- folio_end_dropbehind_write() > |- folio_unmap_invalidate() > |- filemap_release_folio() > |- __btrfs_release_folio() > |- wait_subpage_spinlock() > |- spin_lock_irq(&subpage->lock); > !! DEADLOCK !! > > So for now let's disable uncached write for btrfs, until we solve > all problems with planned solutions: > > - Use atomic to trace writeback status > Need to remove the COW fixup (handling of out-of-band dirty folio) > routine first and align the member to iomap_folio_status structure > first. > > - Better async extent handling > Instead of leaving the folios locked and set writeback flag after > compression, change it to set writeback flags then start compression. > > Fixes: fcc7e3306e11 ("btrfs: add support for uncached writes (RWF_DONTCACHE)") The commit id will change so it's better to refer to it just by the subject and not put it to Fixes. > Cc: Jens Axboe <axboe@kernel.dk> > Suggested-by: Matthew Wilcox <willy@infradead.org> > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > Thankfully the btrfs uncached writes patch is not yet pushed to > upstream, I can remove it from for-next branch if this patch got enough > reviews. > > But I prefer not to, as that commit still contains some good cleanup on > the FGP flags. Thanks for catching the problems early. This is not the cleanest way but in the long run we want the uncached writes so with the preparatory patch merged and disabled by one line it should be easier to test it. Also want to keep your analysis of the deadlock in a visible place like changelog so we can refer to it.
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index e87d4a37c929..fe9e98f916f4 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1099,8 +1099,17 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) fgp_flags |= FGP_NOWAIT; } + /* + * DONTCACHE will make folio reclaim happen immediately at + * folio_end_writeback(), for fs block size < page size cases it will + * happen inside a spinlock (due to possible async extents races), + * and such folio_end_writeback() may cause sleep inside a spinlock. + * + * So disable DONTCACHE until we either reworked async extent, or find + * a better way to handle per-block writeback tracking. + */ if (iocb->ki_flags & IOCB_DONTCACHE) - fgp_flags |= FGP_DONTCACHE; + return -EOPNOTSUPP; ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags); if (ret < 0) @@ -3679,7 +3688,7 @@ const struct file_operations btrfs_file_operations = { #endif .remap_file_range = btrfs_remap_file_range, .uring_cmd = btrfs_uring_cmd, - .fop_flags = FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC | FOP_DONTCACHE, + .fop_flags = FOP_BUFFER_RASYNC | FOP_BUFFER_WASYNC, }; int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end)
Currently btrfs calls folio_end_writeback() inside a spinlock, to prevent races of async extents for the same folio. The async extent mechanism is utilized for compressed writes, which allows a folio (or part of a folio) to be kept locked, and queue the range into a workqueue and do the compression. After the compression is done, then submit the compressed data and set involved blocks writeback and unlock the range. Such the async extent mechanism disrupts the regular writeback behavior, where normally we submit all the involved blocks inside the same folio in one go. Now with async extent parts of the same folio can be randomly marked writeback at any time, by different threads. Thus for fs block size < page size cases, btrfs always hold a spinlock when setting/clearing the folio writeback flag, to avoid async extents to race on the same folio. But now with the dropbehind folio flag, folio_end_writeback() is no longer safe to be called inside a spinlock: folio_end_writeback() |- folio_end_dropbehind_write() |- if (in_task() && folio_trylock()) | Since all btrfs endio functions happen inside a workqueue, | it will always pass in_task() check. | |- folio_unmap_invalidate() |- folio_launder() !! MAY SLEEP !! And there is no gfp flag to let the fs to avoid sleeping. Furthermore, for fs blocks < page size cases, we can even deadlock on the same subpage spinlock: btrfs_subpage_clear_writeback() |- spin_lock_irqsave(&subpage->lock) |- folio_end_writeback() |- folio_end_dropbehind_write() |- folio_unmap_invalidate() |- filemap_release_folio() |- __btrfs_release_folio() |- wait_subpage_spinlock() |- spin_lock_irq(&subpage->lock); !! DEADLOCK !! So for now let's disable uncached write for btrfs, until we solve all problems with planned solutions: - Use atomic to trace writeback status Need to remove the COW fixup (handling of out-of-band dirty folio) routine first and align the member to iomap_folio_status structure first. - Better async extent handling Instead of leaving the folios locked and set writeback flag after compression, change it to set writeback flags then start compression. Fixes: fcc7e3306e11 ("btrfs: add support for uncached writes (RWF_DONTCACHE)") Cc: Jens Axboe <axboe@kernel.dk> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Qu Wenruo <wqu@suse.com> --- Thankfully the btrfs uncached writes patch is not yet pushed to upstream, I can remove it from for-next branch if this patch got enough reviews. But I prefer not to, as that commit still contains some good cleanup on the FGP flags. --- fs/btrfs/file.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-)