diff mbox series

[v5,1/2] iomap: fix zero padding data issue in concurrent append writes

Message ID 20241127063503.2200005-1-leo.lilong@huawei.com (mailing list archive)
State New
Headers show
Series [v5,1/2] iomap: fix zero padding data issue in concurrent append writes | expand

Commit Message

Long Li Nov. 27, 2024, 6:35 a.m. UTC
During concurrent append writes to XFS filesystem, zero padding data
may appear in the file after power failure. This happens due to imprecise
disk size updates when handling write completion.

Consider this scenario with concurrent append writes same file:

  Thread 1:                  Thread 2:
  ------------               -----------
  write [A, A+B]
  update inode size to A+B
  submit I/O [A, A+BS]
                             write [A+B, A+B+C]
                             update inode size to A+B+C
  <I/O completes, updates disk size to min(A+B+C, A+BS)>
  <power failure>

After reboot:
  1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]

  |<         Block Size (BS)      >|
  |DDDDDDDDDDDDDDDD0000000000000000|
  ^               ^        ^
  A              A+B     A+B+C
                         (EOF)

  2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]

  |<         Block Size (BS)      >|<           Block Size (BS)    >|
  |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
  ^               ^                ^               ^
  A              A+B              A+BS           A+B+C
                                  (EOF)

  D = Valid Data
  0 = Zero Padding

The issue stems from disk size being set to min(io_offset + io_size,
inode->i_size) at I/O completion. Since io_offset+io_size is block
size granularity, it may exceed the actual valid file data size. In
the case of concurrent append writes, inode->i_size may be larger
than the actual range of valid file data written to disk, leading to
inaccurate disk size updates.

This patch modifies the meaning of io_size to represent the size of
valid data within EOF in an ioend. If the ioend spans beyond i_size,
io_size will be trimmed to provide the file with more accurate size
information. This is particularly useful for on-disk size updates
at completion time.

After this change, ioends that span i_size will not grow or merge with
other ioends in concurrent scenarios. However, these cases that need
growth/merging rarely occur and it seems no noticeable performance impact.
Although rounding up io_size could enable ioend growth/merging in these
scenarios, we decided to keep the code simple after discussion [1].

Another benefit is that it makes the xfs_ioend_is_append() check more
accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
in certain scenarios, such as repeated writes at the file tail without
extending the file size.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
	ioend growth/merging to keep the code simple. 
 fs/iomap/buffered-io.c | 10 ++++++++++
 include/linux/iomap.h  |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Nov. 27, 2024, 4:28 p.m. UTC | #1
On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote:
> During concurrent append writes to XFS filesystem, zero padding data
> may appear in the file after power failure. This happens due to imprecise
> disk size updates when handling write completion.
> 
> Consider this scenario with concurrent append writes same file:
> 
>   Thread 1:                  Thread 2:
>   ------------               -----------
>   write [A, A+B]
>   update inode size to A+B
>   submit I/O [A, A+BS]
>                              write [A+B, A+B+C]
>                              update inode size to A+B+C
>   <I/O completes, updates disk size to min(A+B+C, A+BS)>
>   <power failure>
> 
> After reboot:
>   1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
> 
>   |<         Block Size (BS)      >|
>   |DDDDDDDDDDDDDDDD0000000000000000|
>   ^               ^        ^
>   A              A+B     A+B+C
>                          (EOF)
> 
>   2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
> 
>   |<         Block Size (BS)      >|<           Block Size (BS)    >|
>   |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
>   ^               ^                ^               ^
>   A              A+B              A+BS           A+B+C
>                                   (EOF)
> 
>   D = Valid Data
>   0 = Zero Padding
> 
> The issue stems from disk size being set to min(io_offset + io_size,
> inode->i_size) at I/O completion. Since io_offset+io_size is block
> size granularity, it may exceed the actual valid file data size. In
> the case of concurrent append writes, inode->i_size may be larger
> than the actual range of valid file data written to disk, leading to
> inaccurate disk size updates.
> 
> This patch modifies the meaning of io_size to represent the size of
> valid data within EOF in an ioend. If the ioend spans beyond i_size,
> io_size will be trimmed to provide the file with more accurate size
> information. This is particularly useful for on-disk size updates
> at completion time.
> 
> After this change, ioends that span i_size will not grow or merge with
> other ioends in concurrent scenarios. However, these cases that need
> growth/merging rarely occur and it seems no noticeable performance impact.
> Although rounding up io_size could enable ioend growth/merging in these
> scenarios, we decided to keep the code simple after discussion [1].
> 
> Another benefit is that it makes the xfs_ioend_is_append() check more
> accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
> in certain scenarios, such as repeated writes at the file tail without
> extending the file size.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

Er... filesystem iomap wasn't in 2.6.12, how did you come up with this?
At most this is a fix against a roughly 6.1 era kernel, right?

> Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
> v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
> 	ioend growth/merging to keep the code simple. 
>  fs/iomap/buffered-io.c | 10 ++++++++++
>  include/linux/iomap.h  |  2 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d42f01e0fc1c..dc360c8e5641 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
> +	loff_t isize = i_size_read(inode);
>  	int error;
>  
>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  
>  	if (ifs)
>  		atomic_add(len, &ifs->write_bytes_pending);
> +
> +	/*
> +	 * If the ioend spans i_size, trim io_size to the former to provide
> +	 * the fs with more accurate size information. This is useful for
> +	 * completion time on-disk size updates.

I think it's useful to preserve the diagram showing exactly what problem
you're solving:

	/*
	 * Clamp io_offset and io_size to the incore EOF so that ondisk
	 * file size updates in the ioend completion are byte-accurate.
	 * This avoids recovering files with zeroed tail regions when
	 * writeback races with appending writes:
	 *
	 *    Thread 1:                  Thread 2:
	 *    ------------               -----------
	 *    write [A, A+B]
	 *    update inode size to A+B
	 *    submit I/O [A, A+BS]
	 *                               write [A+B, A+B+C]
	 *                               update inode size to A+B+C
	 *    <I/O completes, updates disk size to min(A+B+C, A+BS)>
	 *    <power failure>
	 *
	 *  After reboot:
	 *    1) with A+B+C < A+BS, the file has zero padding in range
	 *       [A+B, A+B+C]
	 *
	 *    |<     Block Size (BS)    >|
	 *    |DDDDDDDDDDDD00000000000000|
	 *    ^           ^        ^
	 *    A          A+B     A+B+C
	 *                       (EOF)
	 *
	 *    2) with A+B+C > A+BS, the file has zero padding in range
	 *       [A+B, A+BS]
	 *
	 *    |<     Block Size (BS)    >|<      Block Size (BS)    >|
	 *    |DDDDDDDDDDDD00000000000000|000000000000000000000000000|
	 *    ^           ^              ^           ^
	 *    A          A+B            A+BS       A+B+C
	 *                              (EOF)
	 *
	 *    D = Valid Data
	 *    0 = Zero Padding
	 *
	 * Note that this defeats the ability to chain the ioends of
	 * appending writes.
	 */

(I reduced the blocksize a bit for wrapping purposes)

The logic looks ok, but I'm curious about how you landed at 2.6.12-rc
for the fixes tag.

--D

> +	 */
>  	wpc->ioend->io_size += len;
> +	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
> +		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
> +
>  	wbc_account_cgroup_owner(wbc, folio, len);
>  	return 0;
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5675af6b740c..75bf54e76f3b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -335,7 +335,7 @@ struct iomap_ioend {
>  	u16			io_type;
>  	u16			io_flags;	/* IOMAP_F_* */
>  	struct inode		*io_inode;	/* file being written to */
> -	size_t			io_size;	/* size of the extent */
> +	size_t			io_size;	/* size of data within eof */
>  	loff_t			io_offset;	/* offset in the file */
>  	sector_t		io_sector;	/* start sector of ioend */
>  	struct bio		io_bio;		/* MUST BE LAST! */
> -- 
> 2.39.2
>
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d42f01e0fc1c..dc360c8e5641 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1774,6 +1774,7 @@  static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
+	loff_t isize = i_size_read(inode);
 	int error;
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
@@ -1789,7 +1790,16 @@  static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 
 	if (ifs)
 		atomic_add(len, &ifs->write_bytes_pending);
+
+	/*
+	 * If the ioend spans i_size, trim io_size to the former to provide
+	 * the fs with more accurate size information. This is useful for
+	 * completion time on-disk size updates.
+	 */
 	wpc->ioend->io_size += len;
+	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
+		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
+
 	wbc_account_cgroup_owner(wbc, folio, len);
 	return 0;
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5675af6b740c..75bf54e76f3b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -335,7 +335,7 @@  struct iomap_ioend {
 	u16			io_type;
 	u16			io_flags;	/* IOMAP_F_* */
 	struct inode		*io_inode;	/* file being written to */
-	size_t			io_size;	/* size of the extent */
+	size_t			io_size;	/* size of data within eof */
 	loff_t			io_offset;	/* offset in the file */
 	sector_t		io_sector;	/* start sector of ioend */
 	struct bio		io_bio;		/* MUST BE LAST! */