diff mbox series

[5/7] btrfs: allow buffered write to avoid full page read if it's block aligned

Message ID c496e3bdc3be2d828684c5536800d6a6554afa5a.1740354271.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series btrfs: make subpage handling be feature full | expand

Commit Message

Qu Wenruo Feb. 23, 2025, 11:46 p.m. UTC
[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:

    --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
    +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
    @@ -3,7 +3,8 @@
     read is in range
     write is in range
     write -> read/write
    -read is in range
    +read has value of 8388608
    +read is NOT in range -33792 .. 33792
     write is in range
    ...

[CAUSE]
The test case creates a 8MiB file, then buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full block and
page, btrfs will no bother reading the page, just like what XFS and EXT4
do.

But 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned any more, thus btrfs will read the full
page, causing more read than expected and fail the test case.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is block
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate blocks during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Comments

Filipe Manana Feb. 25, 2025, 5:55 p.m. UTC | #1
On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> Since the support of block size (sector size) < page size for btrfs,
> test case generic/563 fails with 4K block size and 64K page size:
>
>     --- tests/generic/563.out   2024-04-25 18:13:45.178550333 +0930
>     +++ /home/adam/xfstests-dev/results//generic/563.out.bad    2024-09-30 09:09:16.155312379 +0930
>     @@ -3,7 +3,8 @@
>      read is in range
>      write is in range
>      write -> read/write
>     -read is in range
>     +read has value of 8388608
>     +read is NOT in range -33792 .. 33792
>      write is in range
>     ...
>
> [CAUSE]
> The test case creates a 8MiB file, then buffered write into the 8MiB
> using 4K block size, to overwrite the whole file.
>
> On 4K page sized systems, since the write range covers the full block and
> page, btrfs will no bother reading the page, just like what XFS and EXT4
> do.
>
> But 64K page sized systems, although the 4K sized write is still block
> aligned, it's not page aligned any more, thus btrfs will read the full
> page, causing more read than expected and fail the test case.

This part "causing more read than expected" got me puzzled, but it's explained
in the "Fix" section below. It's more like "causing previously dirty
blocks within the page to be zeroed out".

>
> [FIX]
> To skip the full page read, we need to do the following modification:
>
> - Do not trigger full page read as long as the buffered write is block
>   aligned
>   This is pretty simple by modifying the check inside
>   prepare_uptodate_page().
>
> - Skip already uptodate blocks during full page read
>   Or we can lead to the following data corruption:
>
>   0       32K        64K
>   |///////|          |
>
>   Where the file range [0, 32K) is dirtied by buffered write, the
>   remaining range [32K, 64K) is not.
>
>   When reading the full page, since [0,32K) is only dirtied but not
>   written back, there is no data extent map for it, but a hole covering
>   [0, 64k).
>
>   If we continue reading the full page range [0, 64K), the dirtied range
>   will be filled with 0 (since there is only a hole covering the whole
>   range).
>   This causes the dirtied range to get lost.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  fs/btrfs/file.c      | 5 +++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b3a4a94212c9..d7240e295bfc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -977,6 +977,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>                         end_folio_read(folio, true, cur, end - cur + 1);
>                         break;
>                 }
> +               if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +                       end_folio_read(folio, true, cur, blocksize);
> +                       continue;
> +               }
>                 em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
>                 if (IS_ERR(em)) {
>                         end_folio_read(folio, false, cur, end + 1 - cur);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 00c68b7b2206..e3d63192281d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
>  {
>         u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>         u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
> +       const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>         int ret = 0;
>
>         if (folio_test_uptodate(folio))
>                 return 0;
>
>         if (!force_uptodate &&
> -           IS_ALIGNED(clamp_start, PAGE_SIZE) &&
> -           IS_ALIGNED(clamp_end, PAGE_SIZE))
> +           IS_ALIGNED(clamp_start, sectorsize) &&
> +           IS_ALIGNED(clamp_end, sectorsize))
>                 return 0;
>
>         ret = btrfs_read_folio(NULL, folio);
> --
> 2.48.1
>
>
Qu Wenruo Feb. 25, 2025, 9:16 p.m. UTC | #2
在 2025/2/26 04:25, Filipe Manana 写道:
> On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> Since the support of block size (sector size) < page size for btrfs,
>> test case generic/563 fails with 4K block size and 64K page size:
>>
>>      --- tests/generic/563.out   2024-04-25 18:13:45.178550333 +0930
>>      +++ /home/adam/xfstests-dev/results//generic/563.out.bad    2024-09-30 09:09:16.155312379 +0930
>>      @@ -3,7 +3,8 @@
>>       read is in range
>>       write is in range
>>       write -> read/write
>>      -read is in range
>>      +read has value of 8388608
>>      +read is NOT in range -33792 .. 33792
>>       write is in range
>>      ...
>>
>> [CAUSE]
>> The test case creates a 8MiB file, then buffered write into the 8MiB
>> using 4K block size, to overwrite the whole file.
>>
>> On 4K page sized systems, since the write range covers the full block and
>> page, btrfs will no bother reading the page, just like what XFS and EXT4
>> do.
>>
>> But 64K page sized systems, although the 4K sized write is still block
>> aligned, it's not page aligned any more, thus btrfs will read the full
>> page, causing more read than expected and fail the test case.
> 
> This part "causing more read than expected" got me puzzled, but it's explained
> in the "Fix" section below. It's more like "causing previously dirty
> blocks within the page to be zeroed out".

It's not exactly the same.

Before this patch, we will never allow a folio to be dirtied if it's not 
uptodate (unless we're dirtying the full folio).

So there is no "previously dirty blocks to be zeroed out".

I have no better idea to explain the situation here, it's really about 
if we need to read the folio before buffered write.

I'm super happy if you have a better expression here so that no one 
would be confused.

Thanks,
Qu

> 
>>
>> [FIX]
>> To skip the full page read, we need to do the following modification:
>>
>> - Do not trigger full page read as long as the buffered write is block
>>    aligned
>>    This is pretty simple by modifying the check inside
>>    prepare_uptodate_page().
>>
>> - Skip already uptodate blocks during full page read
>>    Or we can lead to the following data corruption:
>>
>>    0       32K        64K
>>    |///////|          |
>>
>>    Where the file range [0, 32K) is dirtied by buffered write, the
>>    remaining range [32K, 64K) is not.
>>
>>    When reading the full page, since [0,32K) is only dirtied but not
>>    written back, there is no data extent map for it, but a hole covering
>>    [0, 64k).
>>
>>    If we continue reading the full page range [0, 64K), the dirtied range
>>    will be filled with 0 (since there is only a hole covering the whole
>>    range).
>>    This causes the dirtied range to get lost.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Looks good, thanks.
> 
>> ---
>>   fs/btrfs/extent_io.c | 4 ++++
>>   fs/btrfs/file.c      | 5 +++--
>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index b3a4a94212c9..d7240e295bfc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -977,6 +977,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>>                          end_folio_read(folio, true, cur, end - cur + 1);
>>                          break;
>>                  }
>> +               if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
>> +                       end_folio_read(folio, true, cur, blocksize);
>> +                       continue;
>> +               }
>>                  em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
>>                  if (IS_ERR(em)) {
>>                          end_folio_read(folio, false, cur, end + 1 - cur);
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 00c68b7b2206..e3d63192281d 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
>>   {
>>          u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>>          u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
>> +       const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>>          int ret = 0;
>>
>>          if (folio_test_uptodate(folio))
>>                  return 0;
>>
>>          if (!force_uptodate &&
>> -           IS_ALIGNED(clamp_start, PAGE_SIZE) &&
>> -           IS_ALIGNED(clamp_end, PAGE_SIZE))
>> +           IS_ALIGNED(clamp_start, sectorsize) &&
>> +           IS_ALIGNED(clamp_end, sectorsize))
>>                  return 0;
>>
>>          ret = btrfs_read_folio(NULL, folio);
>> --
>> 2.48.1
>>
>>
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b3a4a94212c9..d7240e295bfc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -977,6 +977,10 @@  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
 		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
 		if (IS_ERR(em)) {
 			end_folio_read(folio, false, cur, end + 1 - cur);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 00c68b7b2206..e3d63192281d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -804,14 +804,15 @@  static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
 {
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
+	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
 	int ret = 0;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, sectorsize) &&
+	    IS_ALIGNED(clamp_end, sectorsize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);