mbox series

[v2,0/8] btrfs: make subpage handling be feature full

Message ID cover.1740635497.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: make subpage handling be feature full | expand

Message

Qu Wenruo Feb. 27, 2025, 5:54 a.m. UTC
[CHANGLOG]
v2:
- Add a new bug fix which is exposed by recent 2K block size tests on
  x86_64
  It's a self deadlock where the folio_end_writeback() is called with
  subpage lock hold, and folio_end_writeback() will eventually call
  btrfs_release_folio() and try to lock the same spin lock.

- Various grammar fixes and commit message/comments updates

- Address comments from Filipe and adds his tags

v1:
- Merge previous two patches into one
- Re-order the patches so preparation/fixes are always before feature
  enablement
- Update the commit message of the first patch
  So that we do not focus on the out-of-tree part, but explain why it's
  not ideal in the first place (double page zeroing), and just mention
  the behavior change in the future.

Since the introduction of btrfs subapge support in v5.15, there are
quite some limitations:

- No RAID56 support
  Added in v5.19.

- No memory efficient scrub support
  Added in v6.4.

- No block perfect compressed write support
  Previously btrfs requires the dirty range to be fully page aligned, or
  it will skip compression completely.

  Added in v6.13.

- Various subpage related error handling fixes
  Added in v6.14.

- No support to create inline data extent
- No partial uptodate page support
  This is a long existing optimization supported by EXT4/XFS and
  is required to pass generic/563 with subpage block size.

The last two are addressed in this patchset.

And since all features are implemented for subpage cases, the long
existing experimental warning message can be dropped, as it is already
causing a lot of concerns for users who checks the dmesg carefully.

Qu Wenruo (8):
  btrfs: prevent inline data extents read from touching blocks beyond
    its range
  btrfs: subpage: do not hold subpage spin lock when clearing folio
    writeback
  btrfs: fix the qgroup data free range for inline data extents
  btrfs: introduce a read path dedicated extent lock helper
  btrfs: make btrfs_do_readpage() to do block-by-block read
  btrfs: allow buffered write to avoid full page read if it's block
    aligned
  btrfs: allow inline data extents creation if block size < page size
  btrfs: remove the subpage related warning message

 fs/btrfs/disk-io.c      |   5 -
 fs/btrfs/extent_io.c    | 229 +++++++++++++++++++++++++++++++++++-----
 fs/btrfs/file.c         |   5 +-
 fs/btrfs/inode.c        |  29 ++---
 fs/btrfs/ordered-data.c |  23 +++-
 fs/btrfs/ordered-data.h |   8 +-
 fs/btrfs/subpage.c      |  10 +-
 7 files changed, 246 insertions(+), 63 deletions(-)

Comments

David Sterba Feb. 27, 2025, 11:16 a.m. UTC | #1
On Thu, Feb 27, 2025 at 04:24:38PM +1030, Qu Wenruo wrote:
> [CHANGLOG]
> v2:
> - Add a new bug fix which is exposed by recent 2K block size tests on
>   x86_64
>   It's a self deadlock where the folio_end_writeback() is called with
>   subpage lock hold, and folio_end_writeback() will eventually call
>   btrfs_release_folio() and try to lock the same spin lock.
> 
> Since the introduction of btrfs subapge support in v5.15, there are
> quite some limitations:
> 
> - No RAID56 support
>   Added in v5.19.
> 
> - No memory efficient scrub support
>   Added in v6.4.
> 
> - No block perfect compressed write support
>   Previously btrfs requires the dirty range to be fully page aligned, or
>   it will skip compression completely.
> 
>   Added in v6.13.
> 
> - Various subpage related error handling fixes
>   Added in v6.14.
> 
> - No support to create inline data extent
> - No partial uptodate page support
>   This is a long existing optimization supported by EXT4/XFS and
>   is required to pass generic/563 with subpage block size.
> 
> The last two are addressed in this patchset.

That's great, thank you very much. I think not all patches have a
reviewed-by tag but I'd like to get it to for-next very soon.  The next
is rc5 and we should have all features in. This also means the 2K
nodesize block so we can give it more testing.
Qu Wenruo Feb. 28, 2025, 3:14 a.m. UTC | #2
在 2025/2/27 21:46, David Sterba 写道:
> On Thu, Feb 27, 2025 at 04:24:38PM +1030, Qu Wenruo wrote:
>> [CHANGLOG]
>> v2:
>> - Add a new bug fix which is exposed by recent 2K block size tests on
>>    x86_64
>>    It's a self deadlock where the folio_end_writeback() is called with
>>    subpage lock hold, and folio_end_writeback() will eventually call
>>    btrfs_release_folio() and try to lock the same spin lock.
>>
>> Since the introduction of btrfs subapge support in v5.15, there are
>> quite some limitations:
>>
>> - No RAID56 support
>>    Added in v5.19.
>>
>> - No memory efficient scrub support
>>    Added in v6.4.
>>
>> - No block perfect compressed write support
>>    Previously btrfs requires the dirty range to be fully page aligned, or
>>    it will skip compression completely.
>>
>>    Added in v6.13.
>>
>> - Various subpage related error handling fixes
>>    Added in v6.14.
>>
>> - No support to create inline data extent
>> - No partial uptodate page support
>>    This is a long existing optimization supported by EXT4/XFS and
>>    is required to pass generic/563 with subpage block size.
>>
>> The last two are addressed in this patchset.
> 
> That's great, thank you very much. I think not all patches have a
> reviewed-by tag but I'd like to get it to for-next very soon.  The next
> is rc5 and we should have all features in. This also means the 2K
> nodesize block so we can give it more testing.

Now pushed to for-next, with all patches reviewed (or acked) by Filipe.

Great thanks to Filipe for the review!

For the 2K one, since it's just two small patches I'm also fine pushing 
them now.
Just do not forget that we need progs patches, and a dozen of known 
failures from fstests, and I'm not yet able to address them all any time 
soon.

Thanks,
Qu
David Sterba Feb. 28, 2025, 12:42 p.m. UTC | #3
On Fri, Feb 28, 2025 at 01:44:04PM +1030, Qu Wenruo wrote:
> For the 2K one, since it's just two small patches I'm also fine pushing 
> them now.
> Just do not forget that we need progs patches, and a dozen of known 
> failures from fstests, and I'm not yet able to address them all any time 
> soon.

Yeah, the mkfs support can go to the next minor progs release. About the
status we can print a warning and document it. No need to focus on
fixing the fstests, I think stress testing will be sufficient for now.
Qu Wenruo March 2, 2025, 10:48 p.m. UTC | #4
在 2025/2/28 23:12, David Sterba 写道:
> On Fri, Feb 28, 2025 at 01:44:04PM +1030, Qu Wenruo wrote:
>> For the 2K one, since it's just two small patches I'm also fine pushing
>> them now.
>> Just do not forget that we need progs patches, and a dozen of known
>> failures from fstests, and I'm not yet able to address them all any time
>> soon.
>
> Yeah, the mkfs support can go to the next minor progs release. About the
> status we can print a warning and document it. No need to focus on
> fixing the fstests, I think stress testing will be sufficient for now.
>

It turns out that this will not be a smooth sailing.

There is a very huge conflicts between our async extent and subpage
handling for writeback.

Our async extent mechanism can mark a folio writeback at any time.
(At submission time we keep the range locked, and let the compression
work happen in another thread).

If we have two async extent inside the same folio, we will have the
following race: (64K page size, 4K fs block size)

    0          32K         64K
    |<- AE 1 ->|<- AE 2 ->|

             Thread A (AE 1)           |            Thread B (AE 2)
--------------------------------------+------------------------------
submit_one_async_extent()             |
|- process_one_folio()                |
    |- subpage_set_writeback()         |
                                       |
/* IO finished */                     |
end_compressed_writeback()            |
|- btrfs_folio_clear_writeback()      |
    |  /* this is the last writeback   |
    |     holder, should end the folio |
    |     writeback flag */            |
    |- last = true                     |
    |                                  | submit_one_async_extent()
    |                                  | |- process_one_folio()
    |                                  |    |- subpage_set_writeback()
    |                                  |
    |                                  | /* IO finished */
    |                                  | end_compressed_writeback()
    |                                  | |- btrfs_folio_clear_writeback()
    |                                  |    | /* Again the last holder */
    |                                  |    |- last = true
    |- folio_end_writeback()           |    |- folio_end_writeback()

This leads to two threads calling folio_end_writeback() on the same folio.

This will eventually lead to VM_BUG_ON_FOLIO() or other problems.

Furthermore we can not rely on the folio->private to do anything after
folio_end_writeback() call.
Because that call may unmap/invalidate the folio.

What's worse is, the iomap's extra writeback accounting won't help.
Iomap will hold one extra writeback count before submitting the blocks
inside the folio, then reduce the writeback count after all blocks have
been marked writeback (submitted).

That solution requires that all the blocks inside the folio to be
submitted and marked writeback at the same time.

But our async extent breaks that requirement completely.


So far I have no better solution, but to disable the block-perfect
compression first, then introduce the same iomap's extra count solution.

The proper solution is not only the iomap solution, but to make the
async extent submission to mark the folios writeback.
That will be quite some work (part of the iomap migration plan).

Thanks,
Qu