diff mbox series

[2/2] btrfs: allow buffered write to skip full page if it's sector aligned

Message ID ac2639ec4e9ac176d33e95ef7ecf008fa6be5461.1727833878.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix generic/563 failures with sectorsize < PAGE_SIZE | expand

Commit Message

Qu Wenruo Oct. 2, 2024, 1:52 a.m. UTC
[BUG]
Since the support of sector size < page size for btrfs, test case
generic/563 fails with 4K sector size and 64K page size:

    --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
    +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
    @@ -3,7 +3,8 @@
     read is in range
     write is in range
     write -> read/write
    -read is in range
    +read has value of 8388608
    +read is NOT in range -33792 .. 33792
     write is in range
    ...

[CAUSE]
The test case creates a 8MiB file, then buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full sector and
page, btrfs will no bother reading the page, just like what XFS and EXT4
do.

But 64K page sized systems, although the write is sector aligned, it's
not page aligned, thus btrfs still goes the full page alignment check,
and read the full page out.

This causes extra data read, and fail the test case.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is sector
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate sectors during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 6 ++++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 9 insertions(+), 2 deletions(-)

Comments

Qu Wenruo Oct. 9, 2024, 2:57 a.m. UTC | #1
在 2024/10/2 11:22, Qu Wenruo 写道:
> [BUG]
> Since the support of sector size < page size for btrfs, test case
> generic/563 fails with 4K sector size and 64K page size:
> 
>      --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
>      +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
>      @@ -3,7 +3,8 @@
>       read is in range
>       write is in range
>       write -> read/write
>      -read is in range
>      +read has value of 8388608
>      +read is NOT in range -33792 .. 33792
>       write is in range
>      ...
> 
> [CAUSE]
> The test case creates a 8MiB file, then buffered write into the 8MiB
> using 4K block size, to overwrite the whole file.
> 
> On 4K page sized systems, since the write range covers the full sector and
> page, btrfs will no bother reading the page, just like what XFS and EXT4
> do.
> 
> But 64K page sized systems, although the write is sector aligned, it's
> not page aligned, thus btrfs still goes the full page alignment check,
> and read the full page out.
> 
> This causes extra data read, and fail the test case.
> 
> [FIX]
> To skip the full page read, we need to do the following modification:
> 
> - Do not trigger full page read as long as the buffered write is sector
>    aligned
>    This is pretty simple by modifying the check inside
>    prepare_uptodate_page().
> 
> - Skip already uptodate sectors during full page read
>    Or we can lead to the following data corruption:
> 
>    0       32K        64K
>    |///////|          |
> 
>    Where the file range [0, 32K) is dirtied by buffered write, the
>    remaining range [32K, 64K) is not.
> 
>    When reading the full page, since [0,32K) is only dirtied but not
>    written back, there is no data extent map for it, but a hole covering
>    [0, 64k).
> 
>    If we continue reading the full page range [0, 64K), the dirtied range
>    will be filled with 0 (since there is only a hole covering the whole
>    range).
>    This causes the dirtied range to get lost.

After more testing, this patch seems to cause generic/095 to hang, 
caused by some folio not being proper unlocked.

I'll dig deeper to find out why this is happening.

Thanks,
Qu
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 6 ++++++
>   fs/btrfs/file.c      | 5 +++--
>   2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 09eb8a204375..ea118c89e365 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -981,6 +981,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>   			end_folio_read(folio, true, cur, end - cur + 1);
>   			break;
>   		}
> +
> +		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +			end_folio_read(folio, true, cur, blocksize);
> +			continue;
> +		}
> +
>   		em = __get_extent_map(inode, folio, cur, end - cur + 1,
>   				      em_cached);
>   		if (IS_ERR(em)) {
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fe4c3b31447a..64e28ebd2d0b 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -860,6 +860,7 @@ static int prepare_uptodate_page(struct inode *inode,
>   				 struct page *page, u64 pos,
>   				 u64 len, bool force_uptodate)
>   {
> +	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>   	struct folio *folio = page_folio(page);
>   	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>   	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
> @@ -869,8 +870,8 @@ static int prepare_uptodate_page(struct inode *inode,
>   		return 0;
>   
>   	if (!force_uptodate &&
> -	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
> -	    IS_ALIGNED(clamp_end, PAGE_SIZE))
> +	    IS_ALIGNED(clamp_start, sectorsize) &&
> +	    IS_ALIGNED(clamp_end, sectorsize))
>   		return 0;
>   
>   	ret = btrfs_read_folio(NULL, folio);
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 09eb8a204375..ea118c89e365 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -981,6 +981,12 @@  static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
+
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
+
 		em = __get_extent_map(inode, folio, cur, end - cur + 1,
 				      em_cached);
 		if (IS_ERR(em)) {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fe4c3b31447a..64e28ebd2d0b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -860,6 +860,7 @@  static int prepare_uptodate_page(struct inode *inode,
 				 struct page *page, u64 pos,
 				 u64 len, bool force_uptodate)
 {
+	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
 	struct folio *folio = page_folio(page);
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
@@ -869,8 +870,8 @@  static int prepare_uptodate_page(struct inode *inode,
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, sectorsize) &&
+	    IS_ALIGNED(clamp_end, sectorsize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);