diff mbox series

iomap: fix zero padding data issue in concurrent append writes

Message ID 20241108122738.2617669-1-leo.lilong@huawei.com (mailing list archive)
State Deferred, archived
Headers show
Series iomap: fix zero padding data issue in concurrent append writes | expand

Commit Message

Long Li Nov. 8, 2024, 12:27 p.m. UTC
During concurrent append writes to XFS filesystem, zero padding data
may appear in the file after power failure. This happens due to imprecise
disk size updates when handling write completion.

Consider this scenario with concurrent append writes same file:

  Thread 1:                  Thread 2:
  ------------               -----------
  write [A, A+B]
  update inode size to A+B
  submit I/O [A, A+BS]
                             write [A+B, A+B+C]
                             update inode size to A+B+C
  <I/O completes, updates disk size to A+B+C>
  <power failure>

After reboot, file has zero padding in range [A+B, A+B+C]:

  |<         Block Size (BS)      >|
  |DDDDDDDDDDDDDDDD0000000000000000|
  ^               ^        ^
  A              A+B      A+B+C (EOF)

  D = Valid Data
  0 = Zero Padding

The issue stems from disk size being set to min(io_offset + io_size,
inode->i_size) at I/O completion. Since io_offset+io_size is block
size granularity, it may exceed the actual valid file data size. In
the case of concurrent append writes, inode->i_size may be larger
than the actual range of valid file data written to disk, leading to
inaccurate disk size updates.

This patch introduce ioend->io_end to trace the end position of the
valid data in ioend, rather than solely relying on ioend->io_size.
It ensures more precise disk size updates and avoids the zero padding
issue. Another benefit is that it makes the xfs_ioend_is_append()
check more accurate, which can reduce unnecessary end bio callbacks
of xfs_end_bio() in certain scenarios, such as repeated writes at the
file tail without extending the file size.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/iomap/buffered-io.c | 4 ++++
 fs/xfs/xfs_aops.c      | 7 ++++---
 include/linux/iomap.h  | 1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Nov. 8, 2024, 2:55 p.m. UTC | #1
On Fri, Nov 08, 2024 at 08:27:38PM +0800, Long Li wrote:
> After reboot, file has zero padding in range [A+B, A+B+C]:
> 
>   |<         Block Size (BS)      >|
>   |DDDDDDDDDDDDDDDD0000000000000000|
>   ^               ^        ^
>   A              A+B      A+B+C (EOF)
> 
>   D = Valid Data
>   0 = Zero Padding
> 
> The issue stems from disk size being set to min(io_offset + io_size,
> inode->i_size) at I/O completion. Since io_offset+io_size is block
> size granularity, it may exceed the actual valid file data size. In
> the case of concurrent append writes, inode->i_size may be larger
> than the actual range of valid file data written to disk, leading to
> inaccurate disk size updates.

Oh, interesting one.  Do you have a reproducer we could wire up
to xfstests?

> This patch introduce ioend->io_end to trace the end position of the
> valid data in ioend, rather than solely relying on ioend->io_size.
> It ensures more precise disk size updates and avoids the zero padding
> issue. Another benefit is that it makes the xfs_ioend_is_append()
> check more accurate, which can reduce unnecessary end bio callbacks
> of xfs_end_bio() in certain scenarios, such as repeated writes at the
> file tail without extending the file size.

Hmm.  Can we do away with two members for the size by just rounding
up the block size for the block based operations?
Long Li Nov. 9, 2024, 7:13 a.m. UTC | #2
On Fri, Nov 08, 2024 at 06:55:23AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 08:27:38PM +0800, Long Li wrote:
> > After reboot, file has zero padding in range [A+B, A+B+C]:
> > 
> >   |<         Block Size (BS)      >|
> >   |DDDDDDDDDDDDDDDD0000000000000000|
> >   ^               ^        ^
> >   A              A+B      A+B+C (EOF)
> > 
> >   D = Valid Data
> >   0 = Zero Padding
> > 
> > The issue stems from disk size being set to min(io_offset + io_size,
> > inode->i_size) at I/O completion. Since io_offset+io_size is block
> > size granularity, it may exceed the actual valid file data size. In
> > the case of concurrent append writes, inode->i_size may be larger
> > than the actual range of valid file data written to disk, leading to
> > inaccurate disk size updates.
> 
> Oh, interesting one.  Do you have a reproducer we could wire up
> to xfstests?
> 

Yes, I have a simple reproducer, but it would require significant
work to incorporate it into xfstestis.

> > This patch introduce ioend->io_end to trace the end position of the
> > valid data in ioend, rather than solely relying on ioend->io_size.
> > It ensures more precise disk size updates and avoids the zero padding
> > issue. Another benefit is that it makes the xfs_ioend_is_append()
> > check more accurate, which can reduce unnecessary end bio callbacks
> > of xfs_end_bio() in certain scenarios, such as repeated writes at the
> > file tail without extending the file size.
> 
> Hmm.  Can we do away with two members for the size by just rounding
> up the block size for the block based operations?
> 

If we only use one size record, we can remove io_size and keep only
io_end to record the tail end of valid file data in ioend. Meanwhile,
we can add a wrapper function iomep_ioend_iosize() to get the extent
size of ioend, replacing the existing ioend->io_size. Would this work?

Thanks,
Long Li
Christoph Hellwig Nov. 11, 2024, 5:45 a.m. UTC | #3
On Sat, Nov 09, 2024 at 03:13:53PM +0800, Long Li wrote:
> > Oh, interesting one.  Do you have a reproducer we could wire up
> > to xfstests?
> > 
> 
> Yes, I have a simple reproducer, but it would require significant
> work to incorporate it into xfstestis.

Can you at least shared it?  We might be able to help turning it into
a test.

> If we only use one size record, we can remove io_size and keep only
> io_end to record the tail end of valid file data in ioend. Meanwhile,
> we can add a wrapper function iomep_ioend_iosize() to get the extent
> size of ioend, replacing the existing ioend->io_size. Would this work?

I'd probably still use offset + size to avoid churn because it feels
more natural and causes less churn, but otherwise this sounds good to
me.
Long Li Nov. 11, 2024, 2:16 p.m. UTC | #4
On Sun, Nov 10, 2024 at 09:45:19PM -0800, Christoph Hellwig wrote:
> On Sat, Nov 09, 2024 at 03:13:53PM +0800, Long Li wrote:
> > > Oh, interesting one.  Do you have a reproducer we could wire up
> > > to xfstests?
> > > 
> > 
> > Yes, I have a simple reproducer, but it would require significant
> > work to incorporate it into xfstestis.
> 
> Can you at least shared it?  We might be able to help turning it into
> a test.
> 

At first, we used the following script to find the problem, but it was
difficult to reproduce the problem, run test.sh after system startup.

--------------------filesystem.sh---------------------
#!/bin/bash
index=$1
value=$2

while true; do
    echo "$value" >> /mnt/fs_"$index"/file1
    echo "$value" >> /mnt/fs_"$index"/file2
    cp /mnt/fs_"$index"/file1 /mnt/fs_"$index"/file3
    cat /mnt/fs_"$index"/file1 /mnt/fs_"$index"/file2
    mv /mnt/fs_"$index"/file3 /mnt/fs_"$index"/file1
done

--------------------test.sh--------------------------
#!/bin/bash
mount /dev/sda /mnt
cat -v /mnt/* | grep @
if [ $? == 0 ] ;then
        echo "find padding data"
        exit 1
fi

sh -x filesystem.sh 1 1111 &>/dev/null &
sh -x filesystem.sh 1 2222 &>/dev/null &
sh -x filesystem.sh 1 3333 &>/dev/null &
sleep $(($RANDOM%30))
echo "reboot..."
echo b > /proc/sysrq-trigger
------------------------------------------------------

I later reproduce it by adding a delay to the kernel code
and verified the fixed patch. 

1) add some sleep in xfs_end_ioend

--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -130,8 +130,10 @@ xfs_end_ioend(
        else if (ioend->io_type == IOMAP_UNWRITTEN)
                error = xfs_iomap_write_unwritten(ip, offset, size, false);
 
-       if (!error && xfs_ioend_is_append(ioend))
+       if (!error && xfs_ioend_is_append(ioend)) {
+               msleep(30000);
                error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
+       }
 done:
        iomap_finish_ioends(ioend, error);
        memalloc_nofs_restore(nofs_flag);


2) run rep.sh and reboot system
-----------------------rep.sh-------------------------
#!/bin/bash

mkfs.xfs -f /dev/sda
mount /dev/sda /mnt/test
touch /mnt/test/file
xfs_io -c "pwrite 0 20 -S 0x31" /mnt/test/file
sync &
sleep 5

echo 100000 > /proc/sys/vm/dirty_writeback_centisecs
echo 100000 > /proc/sys/vm/dirty_expire_centisecs
xfs_io -c "pwrite 20 20 -S 0x31" /mnt/test/file 
sleep 40

echo b > /proc/sysrq-trigger
------------------------------------------------------

3) after reboot, check file.

mount /dev/sda /mnt/test
cat -v /mnt/test/file | grep @

> > If we only use one size record, we can remove io_size and keep only
> > io_end to record the tail end of valid file data in ioend. Meanwhile,
> > we can add a wrapper function iomep_ioend_iosize() to get the extent
> > size of ioend, replacing the existing ioend->io_size. Would this work?
> 
> I'd probably still use offset + size to avoid churn because it feels
> more natural and causes less churn, but otherwise this sounds good to
> me.
> 

Ok, I got it. However, we need to change the meaning of "io_size" to
the size of the valid file data in ioend.

Thanks,
Long Li
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ce73d2a48c1e..43c183476870 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1637,6 +1637,7 @@  iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends)
 			break;
 		list_move_tail(&next->io_list, &ioend->io_list);
 		ioend->io_size += next->io_size;
+		ioend->io_end = next->io_end;
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_ioend_try_merge);
@@ -1723,6 +1724,7 @@  static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 	ioend->io_inode = inode;
 	ioend->io_size = 0;
 	ioend->io_offset = pos;
+	ioend->io_end = pos;
 	ioend->io_sector = bio->bi_iter.bi_sector;
 
 	wpc->nr_folios = 0;
@@ -1768,6 +1770,7 @@  static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
+	loff_t isize = i_size_read(inode);
 	int error;
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
@@ -1784,6 +1787,7 @@  static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 	if (ifs)
 		atomic_add(len, &ifs->write_bytes_pending);
 	wpc->ioend->io_size += len;
+	wpc->ioend->io_end = min_t(loff_t, pos + len, isize);
 	wbc_account_cgroup_owner(wbc, folio, len);
 	return 0;
 }
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 559a3a577097..e98e1b597156 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -37,8 +37,8 @@  XFS_WPC(struct iomap_writepage_ctx *ctx)
  */
 static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend)
 {
-	return ioend->io_offset + ioend->io_size >
-		XFS_I(ioend->io_inode)->i_disk_size;
+	WARN_ON_ONCE(ioend->io_end > ioend->io_offset + ioend->io_size);
+	return ioend->io_end > XFS_I(ioend->io_inode)->i_disk_size;
 }
 
 /*
@@ -86,6 +86,7 @@  xfs_end_ioend(
 	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	xfs_off_t		offset = ioend->io_offset;
+	xfs_off_t		end = ioend->io_end;
 	size_t			size = ioend->io_size;
 	unsigned int		nofs_flag;
 	int			error;
@@ -131,7 +132,7 @@  xfs_end_ioend(
 		error = xfs_iomap_write_unwritten(ip, offset, size, false);
 
 	if (!error && xfs_ioend_is_append(ioend))
-		error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
+		error = xfs_setfilesize(ip, offset, end - offset);
 done:
 	iomap_finish_ioends(ioend, error);
 	memalloc_nofs_restore(nofs_flag);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f61407e3b121..faed1f498b3b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -332,6 +332,7 @@  struct iomap_ioend {
 	struct inode		*io_inode;	/* file being written to */
 	size_t			io_size;	/* size of the extent */
 	loff_t			io_offset;	/* offset in the file */
+	loff_t			io_end;		/* end of valid data */
 	sector_t		io_sector;	/* start sector of ioend */
 	struct bio		io_bio;		/* MUST BE LAST! */
 };