Message ID | 20241113091907.56937-1-leo.lilong@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] iomap: fix zero padding data issue in concurrent append writes | expand |
On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > During concurrent append writes to XFS filesystem, zero padding data > may appear in the file after power failure. This happens due to imprecise > disk size updates when handling write completion. > > Consider this scenario with concurrent append writes same file: > > Thread 1: Thread 2: > ------------ ----------- > write [A, A+B] > update inode size to A+B > submit I/O [A, A+BS] > write [A+B, A+B+C] > update inode size to A+B+C > <I/O completes, updates disk size to A+B+C> > <power failure> > > After reboot, file has zero padding in range [A+B, A+B+C]: > > |< Block Size (BS) >| > |DDDDDDDDDDDDDDDD0000000000000000| > ^ ^ ^ > A A+B A+B+C (EOF) > > D = Valid Data > 0 = Zero Padding > > The issue stems from disk size being set to min(io_offset + io_size, > inode->i_size) at I/O completion. Since io_offset+io_size is block > size granularity, it may exceed the actual valid file data size. In > the case of concurrent append writes, inode->i_size may be larger > than the actual range of valid file data written to disk, leading to > inaccurate disk size updates. > > This patch changes the meaning of io_size to represent the size of > valid data in ioend, while the extent size of ioend can be obtained > by rounding up based on block size. It ensures more precise disk > size updates and avoids the zero padding issue. Another benefit is > that it makes the xfs_ioend_is_append() check more accurate, which > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > scenarios, such as repeated writes at the file tail without extending > the file size. > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Signed-off-by: Long Li <leo.lilong@huawei.com> How does this differs from V1? Please, if you are sending a new version, add the changes you've made since the previous one, so nobody needs to keep comparing both. Carlos > --- > fs/iomap/buffered-io.c | 21 +++++++++++++++------ > include/linux/iomap.h | 7 ++++++- > 2 files changed, 21 insertions(+), 7 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index ce73d2a48c1e..a2a75876cda6 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); > static bool > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > { > + size_t size = iomap_ioend_extent_size(ioend); > + > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > return false; > if ((ioend->io_flags & IOMAP_F_SHARED) ^ > @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > if ((ioend->io_type == IOMAP_UNWRITTEN) ^ > (next->io_type == IOMAP_UNWRITTEN)) > return false; > - if (ioend->io_offset + ioend->io_size != next->io_offset) > + if (ioend->io_offset + size != next->io_offset) > return false; > /* > * Do not merge physically discontiguous ioends. The filesystem > @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > * submission so does not point to the start sector of the bio at > * completion. > */ > - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) > + if (ioend->io_sector + (size >> 9) != next->io_sector) > return false; > return true; > } > @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) > if (!iomap_ioend_can_merge(ioend, next)) > break; > list_move_tail(&next->io_list, &ioend->io_list); > - ioend->io_size += next->io_size; > + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; > } > } > EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); > @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) > return false; > if (wpc->iomap.type != wpc->ioend->io_type) > return false; > - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) > + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) > return false; > if (iomap_sector(&wpc->iomap, pos) != > bio_end_sector(&wpc->ioend->io_bio)) > @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > { > struct iomap_folio_state *ifs = folio->private; > size_t poff = offset_in_folio(folio, pos); > + loff_t isize = i_size_read(inode); > + struct iomap_ioend *ioend; > int error; > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > } > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > + ioend = wpc->ioend; > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > goto new_ioend; > > if (ifs) > atomic_add(len, &ifs->write_bytes_pending); > - wpc->ioend->io_size += len; > + > + ioend->io_size = iomap_ioend_extent_size(ioend) + len; > + if (ioend->io_offset + ioend->io_size > isize) > + ioend->io_size = isize - ioend->io_offset; > + > wbc_account_cgroup_owner(wbc, folio, len); > return 0; > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index f61407e3b121..2984eccfa213 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -330,7 +330,7 @@ struct iomap_ioend { > u16 io_type; > u16 io_flags; /* IOMAP_F_* */ > struct inode *io_inode; /* file being written to */ > - size_t io_size; /* size of the extent */ > + size_t io_size; /* size of valid data */ > loff_t io_offset; /* offset in the file */ > sector_t io_sector; /* start sector of ioend */ > struct bio io_bio; /* MUST BE LAST! */ > @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) > return container_of(bio, struct iomap_ioend, io_bio); > } > > +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) > +{ > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > +} > + > struct iomap_writeback_ops { > /* > * Required, maps the blocks so that writeback can be performed on > -- > 2.39.2 >
On Wed, Nov 13, 2024 at 10:44:08AM +0100, Carlos Maiolino wrote: > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > During concurrent append writes to XFS filesystem, zero padding data > > may appear in the file after power failure. This happens due to imprecise > > disk size updates when handling write completion. > > > > Consider this scenario with concurrent append writes same file: > > > > Thread 1: Thread 2: > > ------------ ----------- > > write [A, A+B] > > update inode size to A+B > > submit I/O [A, A+BS] > > write [A+B, A+B+C] > > update inode size to A+B+C > > <I/O completes, updates disk size to A+B+C> > > <power failure> > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > |< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000| > > ^ ^ ^ > > A A+B A+B+C (EOF) > > > > D = Valid Data > > 0 = Zero Padding > > > > The issue stems from disk size being set to min(io_offset + io_size, > > inode->i_size) at I/O completion. Since io_offset+io_size is block > > size granularity, it may exceed the actual valid file data size. In > > the case of concurrent append writes, inode->i_size may be larger > > than the actual range of valid file data written to disk, leading to > > inaccurate disk size updates. > > > > This patch changes the meaning of io_size to represent the size of > > valid data in ioend, while the extent size of ioend can be obtained > > by rounding up based on block size. It ensures more precise disk > > size updates and avoids the zero padding issue. Another benefit is > > that it makes the xfs_ioend_is_append() check more accurate, which > > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > > scenarios, such as repeated writes at the file tail without extending > > the file size. > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > > How does this differs from V1? Please, if you are sending a new version, add the > changes you've made since the previous one, so nobody needs to keep comparing > both. > > Carlos > Thank you for pointing these out. I missed the change information from v1 to v2. I will be more careful to avoid such omissions in the future. Let me add it: V1->V2: Changed the meaning of io_size to record the length of valid data, instead of introducing an additional member io_end. This results in fewer code changes. > > --- > > fs/iomap/buffered-io.c | 21 +++++++++++++++------ > > include/linux/iomap.h | 7 ++++++- > > 2 files changed, 21 insertions(+), 7 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index ce73d2a48c1e..a2a75876cda6 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); > > static bool > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > { > > + size_t size = iomap_ioend_extent_size(ioend); > > + > > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > > return false; > > if ((ioend->io_flags & IOMAP_F_SHARED) ^ > > @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > if ((ioend->io_type == IOMAP_UNWRITTEN) ^ > > (next->io_type == IOMAP_UNWRITTEN)) > > return false; > > - if (ioend->io_offset + ioend->io_size != next->io_offset) > > + if (ioend->io_offset + size != next->io_offset) > > return false; > > /* > > * Do not merge physically discontiguous ioends. The filesystem > > @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > * submission so does not point to the start sector of the bio at > > * completion. > > */ > > - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) > > + if (ioend->io_sector + (size >> 9) != next->io_sector) > > return false; > > return true; > > } > > @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) > > if (!iomap_ioend_can_merge(ioend, next)) > > break; > > list_move_tail(&next->io_list, &ioend->io_list); > > - ioend->io_size += next->io_size; > > + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; > > } > > } > > EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); > > @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) > > return false; > > if (wpc->iomap.type != wpc->ioend->io_type) > > return false; > > - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) > > + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) > > return false; > > if (iomap_sector(&wpc->iomap, pos) != > > bio_end_sector(&wpc->ioend->io_bio)) > > @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > { > > struct iomap_folio_state *ifs = folio->private; > > size_t poff = offset_in_folio(folio, pos); > > + loff_t isize = i_size_read(inode); > > + struct iomap_ioend *ioend; > > int error; > > > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > > @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > > } > > > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > > + ioend = wpc->ioend; > > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > > goto new_ioend; > > > > if (ifs) > > atomic_add(len, &ifs->write_bytes_pending); > > - wpc->ioend->io_size += len; > > + > > + ioend->io_size = iomap_ioend_extent_size(ioend) + len; > > + if (ioend->io_offset + ioend->io_size > isize) > > + ioend->io_size = isize - ioend->io_offset; > > + > > wbc_account_cgroup_owner(wbc, folio, len); > > return 0; > > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index f61407e3b121..2984eccfa213 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -330,7 +330,7 @@ struct iomap_ioend { > > u16 io_type; > > u16 io_flags; /* IOMAP_F_* */ > > struct inode *io_inode; /* file being written to */ > > - size_t io_size; /* size of the extent */ > > + size_t io_size; /* size of valid data */ > > loff_t io_offset; /* offset in the file */ > > sector_t io_sector; /* start sector of ioend */ > > struct bio io_bio; /* MUST BE LAST! */ > > @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) > > return container_of(bio, struct iomap_ioend, io_bio); > > } > > > > +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) > > +{ > > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > > +} > > + > > struct iomap_writeback_ops { > > /* > > * Required, maps the blocks so that writeback can be performed on > > -- > > 2.39.2 > > >
On Wed, Nov 13, 2024 at 07:38:35PM +0800, Long Li wrote: > On Wed, Nov 13, 2024 at 10:44:08AM +0100, Carlos Maiolino wrote: > > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > > During concurrent append writes to XFS filesystem, zero padding data > > > may appear in the file after power failure. This happens due to imprecise > > > disk size updates when handling write completion. > > > > > > Consider this scenario with concurrent append writes same file: > > > > > > Thread 1: Thread 2: > > > ------------ ----------- > > > write [A, A+B] > > > update inode size to A+B > > > submit I/O [A, A+BS] > > > write [A+B, A+B+C] > > > update inode size to A+B+C > > > <I/O completes, updates disk size to A+B+C> > > > <power failure> > > > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > > > |< Block Size (BS) >| > > > |DDDDDDDDDDDDDDDD0000000000000000| > > > ^ ^ ^ > > > A A+B A+B+C (EOF) > > > > > > D = Valid Data > > > 0 = Zero Padding > > > > > > The issue stems from disk size being set to min(io_offset + io_size, > > > inode->i_size) at I/O completion. Since io_offset+io_size is block > > > size granularity, it may exceed the actual valid file data size. In > > > the case of concurrent append writes, inode->i_size may be larger > > > than the actual range of valid file data written to disk, leading to > > > inaccurate disk size updates. > > > > > > This patch changes the meaning of io_size to represent the size of > > > valid data in ioend, while the extent size of ioend can be obtained > > > by rounding up based on block size. It ensures more precise disk > > > size updates and avoids the zero padding issue. Another benefit is > > > that it makes the xfs_ioend_is_append() check more accurate, which > > > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > > > scenarios, such as repeated writes at the file tail without extending > > > the file size. > > > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > > > > > How does this differs from V1? Please, if you are sending a new version, add the > > changes you've made since the previous one, so nobody needs to keep comparing > > both. > > > > Carlos > > > > Thank you for pointing these out. I missed the change information from > v1 to v2. I will be more careful to avoid such omissions in the future. > Let me add it: > > V1->V2: Changed the meaning of io_size to record the length of valid data, > instead of introducing an additional member io_end. This results > in fewer code changes. Thanks! Carlos > > > > --- > > > fs/iomap/buffered-io.c | 21 +++++++++++++++------ > > > include/linux/iomap.h | 7 ++++++- > > > 2 files changed, 21 insertions(+), 7 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index ce73d2a48c1e..a2a75876cda6 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); > > > static bool > > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > { > > > + size_t size = iomap_ioend_extent_size(ioend); > > > + > > > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > > > return false; > > > if ((ioend->io_flags & IOMAP_F_SHARED) ^ > > > @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > if ((ioend->io_type == IOMAP_UNWRITTEN) ^ > > > (next->io_type == IOMAP_UNWRITTEN)) > > > return false; > > > - if (ioend->io_offset + ioend->io_size != next->io_offset) > > > + if (ioend->io_offset + size != next->io_offset) > > > return false; > > > /* > > > * Do not merge physically discontiguous ioends. The filesystem > > > @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > * submission so does not point to the start sector of the bio at > > > * completion. > > > */ > > > - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) > > > + if (ioend->io_sector + (size >> 9) != next->io_sector) > > > return false; > > > return true; > > > } > > > @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) > > > if (!iomap_ioend_can_merge(ioend, next)) > > > break; > > > list_move_tail(&next->io_list, &ioend->io_list); > > > - ioend->io_size += next->io_size; > > > + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; > > > } > > > } > > > EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); > > > @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) > > > return false; > > > if (wpc->iomap.type != wpc->ioend->io_type) > > > return false; > > > - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) > > > + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) > > > return false; > > > if (iomap_sector(&wpc->iomap, pos) != > > > bio_end_sector(&wpc->ioend->io_bio)) > > > @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > > { > > > struct iomap_folio_state *ifs = folio->private; > > > size_t poff = offset_in_folio(folio, pos); > > > + loff_t isize = i_size_read(inode); > > > + struct iomap_ioend *ioend; > > > int error; > > > > > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > > > @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > > > } > > > > > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > > > + ioend = wpc->ioend; > > > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > > > goto new_ioend; > > > > > > if (ifs) > > > atomic_add(len, &ifs->write_bytes_pending); > > > - wpc->ioend->io_size += len; > > > + > > > + ioend->io_size = iomap_ioend_extent_size(ioend) + len; > > > + if (ioend->io_offset + ioend->io_size > isize) > > > + ioend->io_size = isize - ioend->io_offset; > > > + > > > wbc_account_cgroup_owner(wbc, folio, len); > > > return 0; > > > } > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > > index f61407e3b121..2984eccfa213 100644 > > > --- a/include/linux/iomap.h > > > +++ b/include/linux/iomap.h > > > @@ -330,7 +330,7 @@ struct iomap_ioend { > > > u16 io_type; > > > u16 io_flags; /* IOMAP_F_* */ > > > struct inode *io_inode; /* file being written to */ > > > - size_t io_size; /* size of the extent */ > > > + size_t io_size; /* size of valid data */ > > > loff_t io_offset; /* offset in the file */ > > > sector_t io_sector; /* start sector of ioend */ > > > struct bio io_bio; /* MUST BE LAST! */ > > > @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) > > > return container_of(bio, struct iomap_ioend, io_bio); > > > } > > > > > > +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) > > > +{ > > > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > > > +} > > > + > > > struct iomap_writeback_ops { > > > /* > > > * Required, maps the blocks so that writeback can be performed on > > > -- > > > 2.39.2 > > > > > >
FYI, you probably want to include linux-fsdevel on iomap patches. On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > During concurrent append writes to XFS filesystem, zero padding data > may appear in the file after power failure. This happens due to imprecise > disk size updates when handling write completion. > > Consider this scenario with concurrent append writes same file: > > Thread 1: Thread 2: > ------------ ----------- > write [A, A+B] > update inode size to A+B > submit I/O [A, A+BS] > write [A+B, A+B+C] > update inode size to A+B+C > <I/O completes, updates disk size to A+B+C> > <power failure> > > After reboot, file has zero padding in range [A+B, A+B+C]: > > |< Block Size (BS) >| > |DDDDDDDDDDDDDDDD0000000000000000| > ^ ^ ^ > A A+B A+B+C (EOF) > Thanks for the diagram. FWIW, I found the description a little confusing because A+B+C to me implies that we'd update i_size to the end of the write from thread 2, but it seems that is only true up to the end of the block. I.e., with 4k FSB and if thread 1 writes [0, 2k], then thread 2 writes from [2, 16k], the write completion from the thread 1 write will set i_size to 4k, not 16k, right? > D = Valid Data > 0 = Zero Padding > > The issue stems from disk size being set to min(io_offset + io_size, > inode->i_size) at I/O completion. Since io_offset+io_size is block > size granularity, it may exceed the actual valid file data size. In > the case of concurrent append writes, inode->i_size may be larger > than the actual range of valid file data written to disk, leading to > inaccurate disk size updates. > > This patch changes the meaning of io_size to represent the size of > valid data in ioend, while the extent size of ioend can be obtained > by rounding up based on block size. It ensures more precise disk > size updates and avoids the zero padding issue. Another benefit is > that it makes the xfs_ioend_is_append() check more accurate, which > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > scenarios, such as repeated writes at the file tail without extending > the file size. > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Signed-off-by: Long Li <leo.lilong@huawei.com> > --- > fs/iomap/buffered-io.c | 21 +++++++++++++++------ > include/linux/iomap.h | 7 ++++++- > 2 files changed, 21 insertions(+), 7 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index ce73d2a48c1e..a2a75876cda6 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); > static bool > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > { > + size_t size = iomap_ioend_extent_size(ioend); > + The function name is kind of misleading IMO because this may not necessarily reflect "extent size." Maybe something like _ioend_size_aligned() would be more accurate..? I also find it moderately annoying that we have to change pretty much every usage of this field to use the wrapper just so the setfilesize path can do the right thing. Though I see that was an explicit request from v1 to avoid a new field, so it's not the biggest deal. What urks me a bit are: 1. It kind of feels like a landmine in an area where block alignment is typically expected. I wonder if a rename to something like io_bytes would help at all with that. 2. Some of the rounding sites below sort of feel gratuitous. For example, if we run through the _add_to_ioend() path where we actually trim off bytes from the EOF block due to i_size, would we ever expect to tack more onto that ioend such that the iomap_ioend_extent_size() calls are actually effective? It kind of seems like something is wrong in that case where the wrapper call actually matters, but maybe I'm missing something. Another randomish idea might be to define a flag like IOMAP_F_EOF_TRIMMED for ioends that are trimmed to EOF. Then perhaps we could make an explicit decision not to grow or merge such ioends, and let the associated code use io_size as is. But I dunno.. just thinking out loud. I'm ambivalent on all of the above so I'm just sharing thoughts in the event that it triggers more thoughts/ideas/useful discussion. I'd probably not change anything until/unless others chime in on any of this... Brian > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > return false; > if ((ioend->io_flags & IOMAP_F_SHARED) ^ > @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > if ((ioend->io_type == IOMAP_UNWRITTEN) ^ > (next->io_type == IOMAP_UNWRITTEN)) > return false; > - if (ioend->io_offset + ioend->io_size != next->io_offset) > + if (ioend->io_offset + size != next->io_offset) > return false; > /* > * Do not merge physically discontiguous ioends. The filesystem > @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > * submission so does not point to the start sector of the bio at > * completion. > */ > - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) > + if (ioend->io_sector + (size >> 9) != next->io_sector) > return false; > return true; > } > @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) > if (!iomap_ioend_can_merge(ioend, next)) > break; > list_move_tail(&next->io_list, &ioend->io_list); > - ioend->io_size += next->io_size; > + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; > } > } > EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); > @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) > return false; > if (wpc->iomap.type != wpc->ioend->io_type) > return false; > - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) > + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) > return false; > if (iomap_sector(&wpc->iomap, pos) != > bio_end_sector(&wpc->ioend->io_bio)) > @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > { > struct iomap_folio_state *ifs = folio->private; > size_t poff = offset_in_folio(folio, pos); > + loff_t isize = i_size_read(inode); > + struct iomap_ioend *ioend; > int error; > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > } > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > + ioend = wpc->ioend; > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > goto new_ioend; > > if (ifs) > atomic_add(len, &ifs->write_bytes_pending); > - wpc->ioend->io_size += len; > + > + ioend->io_size = iomap_ioend_extent_size(ioend) + len; > + if (ioend->io_offset + ioend->io_size > isize) > + ioend->io_size = isize - ioend->io_offset; > + > wbc_account_cgroup_owner(wbc, folio, len); > return 0; > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index f61407e3b121..2984eccfa213 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -330,7 +330,7 @@ struct iomap_ioend { > u16 io_type; > u16 io_flags; /* IOMAP_F_* */ > struct inode *io_inode; /* file being written to */ > - size_t io_size; /* size of the extent */ > + size_t io_size; /* size of valid data */ > loff_t io_offset; /* offset in the file */ > sector_t io_sector; /* start sector of ioend */ > struct bio io_bio; /* MUST BE LAST! */ > @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) > return container_of(bio, struct iomap_ioend, io_bio); > } > > +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) > +{ > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > +} > + > struct iomap_writeback_ops { > /* > * Required, maps the blocks so that writeback can be performed on > -- > 2.39.2 > >
On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > FYI, you probably want to include linux-fsdevel on iomap patches. > > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > During concurrent append writes to XFS filesystem, zero padding data > > may appear in the file after power failure. This happens due to imprecise > > disk size updates when handling write completion. > > > > Consider this scenario with concurrent append writes same file: > > > > Thread 1: Thread 2: > > ------------ ----------- > > write [A, A+B] > > update inode size to A+B > > submit I/O [A, A+BS] > > write [A+B, A+B+C] > > update inode size to A+B+C > > <I/O completes, updates disk size to A+B+C> > > <power failure> > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > |< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000| > > ^ ^ ^ > > A A+B A+B+C (EOF) > > > > Thanks for the diagram. FWIW, I found the description a little confusing > because A+B+C to me implies that we'd update i_size to the end of the > write from thread 2, but it seems that is only true up to the end of the > block. > > I.e., with 4k FSB and if thread 1 writes [0, 2k], then thread 2 writes > from [2, 16k], the write completion from the thread 1 write will set > i_size to 4k, not 16k, right? > Not right, the problem I'm trying to describe is: 1) thread 1 writes [0, 2k] 2) thread 2 writes [2k, 3k] 3) write completion from the thread 1 write set i_size to 3K 4) power failure 5) after reboot, [2k, 3K] of the file filled with zero and the file size is 3k > > D = Valid Data > > 0 = Zero Padding > > > > The issue stems from disk size being set to min(io_offset + io_size, > > inode->i_size) at I/O completion. Since io_offset+io_size is block > > size granularity, it may exceed the actual valid file data size. In > > the case of concurrent append writes, inode->i_size may be larger > > than the actual range of valid file data written to disk, leading to > > inaccurate disk size updates. > > > > This patch changes the meaning of io_size to represent the size of > > valid data in ioend, while the extent size of ioend can be obtained > > by rounding up based on block size. It ensures more precise disk > > size updates and avoids the zero padding issue. Another benefit is > > that it makes the xfs_ioend_is_append() check more accurate, which > > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > > scenarios, such as repeated writes at the file tail without extending > > the file size. > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > --- > > fs/iomap/buffered-io.c | 21 +++++++++++++++------ > > include/linux/iomap.h | 7 ++++++- > > 2 files changed, 21 insertions(+), 7 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index ce73d2a48c1e..a2a75876cda6 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); > > static bool > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > { > > + size_t size = iomap_ioend_extent_size(ioend); > > + > > The function name is kind of misleading IMO because this may not > necessarily reflect "extent size." Maybe something like > _ioend_size_aligned() would be more accurate..? > Previously, io_size represented the extent size in ioend, so I wanted to maintain that description. Indeed, _ioend_size_aligned() does seem more accurate. > I also find it moderately annoying that we have to change pretty much > every usage of this field to use the wrapper just so the setfilesize > path can do the right thing. Though I see that was an explicit request > from v1 to avoid a new field, so it's not the biggest deal. > > What urks me a bit are: > > 1. It kind of feels like a landmine in an area where block alignment is > typically expected. I wonder if a rename to something like io_bytes > would help at all with that. > I think that clearly documenting the meaning of io_size is more important, as changing the name doesn't fundamentally address the underlying concerns. > 2. Some of the rounding sites below sort of feel gratuitous. For > example, if we run through the _add_to_ioend() path where we actually > trim off bytes from the EOF block due to i_size, would we ever expect to > tack more onto that ioend such that the iomap_ioend_extent_size() calls > are actually effective? It kind of seems like something is wrong in that > case where the wrapper call actually matters, but maybe I'm missing > something. > I believe using iomap_ioend_extent_size() at merge decision points is valuable. Consider this scenario (with 4k FSB): 1) thread 1 writes [0, 2k] //io_size set to 2K 2) thread 2 writes [4k, 8k] If these IOs complete simultaneously, the ioends can be merged, resulting in an io_size of 8k. Similarly, we can merge as many as possible ioend in _add_to_ioend(). > Another randomish idea might be to define a flag like > IOMAP_F_EOF_TRIMMED for ioends that are trimmed to EOF. Then perhaps we > could make an explicit decision not to grow or merge such ioends, and > let the associated code use io_size as is. > > But I dunno.. just thinking out loud. I'm ambivalent on all of the above > so I'm just sharing thoughts in the event that it triggers more > thoughts/ideas/useful discussion. I'd probably not change anything > until/unless others chime in on any of this... > > Brian > Thank you for your reply and thoughtful considerations. :) Thanks, Long Li > > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > > return false; > > if ((ioend->io_flags & IOMAP_F_SHARED) ^ > > @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > if ((ioend->io_type == IOMAP_UNWRITTEN) ^ > > (next->io_type == IOMAP_UNWRITTEN)) > > return false; > > - if (ioend->io_offset + ioend->io_size != next->io_offset) > > + if (ioend->io_offset + size != next->io_offset) > > return false; > > /* > > * Do not merge physically discontiguous ioends. The filesystem > > @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > * submission so does not point to the start sector of the bio at > > * completion. > > */ > > - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) > > + if (ioend->io_sector + (size >> 9) != next->io_sector) > > return false; > > return true; > > } > > @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) > > if (!iomap_ioend_can_merge(ioend, next)) > > break; > > list_move_tail(&next->io_list, &ioend->io_list); > > - ioend->io_size += next->io_size; > > + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; > > } > > } > > EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); > > @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) > > return false; > > if (wpc->iomap.type != wpc->ioend->io_type) > > return false; > > - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) > > + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) > > return false; > > if (iomap_sector(&wpc->iomap, pos) != > > bio_end_sector(&wpc->ioend->io_bio)) > > @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > { > > struct iomap_folio_state *ifs = folio->private; > > size_t poff = offset_in_folio(folio, pos); > > + loff_t isize = i_size_read(inode); > > + struct iomap_ioend *ioend; > > int error; > > > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > > @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > > } > > > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > > + ioend = wpc->ioend; > > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > > goto new_ioend; > > > > if (ifs) > > atomic_add(len, &ifs->write_bytes_pending); > > - wpc->ioend->io_size += len; > > + > > + ioend->io_size = iomap_ioend_extent_size(ioend) + len; > > + if (ioend->io_offset + ioend->io_size > isize) > > + ioend->io_size = isize - ioend->io_offset; > > + > > wbc_account_cgroup_owner(wbc, folio, len); > > return 0; > > } > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > index f61407e3b121..2984eccfa213 100644 > > --- a/include/linux/iomap.h > > +++ b/include/linux/iomap.h > > @@ -330,7 +330,7 @@ struct iomap_ioend { > > u16 io_type; > > u16 io_flags; /* IOMAP_F_* */ > > struct inode *io_inode; /* file being written to */ > > - size_t io_size; /* size of the extent */ > > + size_t io_size; /* size of valid data */ > > loff_t io_offset; /* offset in the file */ > > sector_t io_sector; /* start sector of ioend */ > > struct bio io_bio; /* MUST BE LAST! */ > > @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) > > return container_of(bio, struct iomap_ioend, io_bio); > > } > > > > +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) > > +{ > > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > > +} > > + > > struct iomap_writeback_ops { > > /* > > * Required, maps the blocks so that writeback can be performed on > > -- > > 2.39.2 > > > > > >
On Thu, Nov 14, 2024 at 10:34:26AM +0800, Long Li wrote: > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > FYI, you probably want to include linux-fsdevel on iomap patches. > > > > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > > During concurrent append writes to XFS filesystem, zero padding data > > > may appear in the file after power failure. This happens due to imprecise > > > disk size updates when handling write completion. > > > > > > Consider this scenario with concurrent append writes same file: > > > > > > Thread 1: Thread 2: > > > ------------ ----------- > > > write [A, A+B] > > > update inode size to A+B > > > submit I/O [A, A+BS] > > > write [A+B, A+B+C] > > > update inode size to A+B+C > > > <I/O completes, updates disk size to A+B+C> > > > <power failure> > > > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > > > |< Block Size (BS) >| > > > |DDDDDDDDDDDDDDDD0000000000000000| > > > ^ ^ ^ > > > A A+B A+B+C (EOF) > > > > > > > Thanks for the diagram. FWIW, I found the description a little confusing > > because A+B+C to me implies that we'd update i_size to the end of the > > write from thread 2, but it seems that is only true up to the end of the > > block. > > > > I.e., with 4k FSB and if thread 1 writes [0, 2k], then thread 2 writes > > from [2, 16k], the write completion from the thread 1 write will set > > i_size to 4k, not 16k, right? > > > > Not right, the problem I'm trying to describe is: > > 1) thread 1 writes [0, 2k] > 2) thread 2 writes [2k, 3k] > 3) write completion from the thread 1 write set i_size to 3K > 4) power failure > 5) after reboot, [2k, 3K] of the file filled with zero and the file size is 3k > Yeah, I get the subblock case. What I am saying above is it seems like "update inode size to A+B+C" is only true for certain, select values that describe the subblock case. I.e., what is the resulting i_size if we replace C=1k in the example above with something >= FSB size, like C=4k? Note this isn't all that important. I was just trying to say that the overly general description made this a little more confusing to grok at first than it needed to be, because to me it subtly implies there is logic around somewhere that explicitly writes in-core i_size to disk, when that is not actually what is happening. > > > > D = Valid Data > > > 0 = Zero Padding > > > > > > The issue stems from disk size being set to min(io_offset + io_size, > > > inode->i_size) at I/O completion. Since io_offset+io_size is block > > > size granularity, it may exceed the actual valid file data size. In > > > the case of concurrent append writes, inode->i_size may be larger > > > than the actual range of valid file data written to disk, leading to > > > inaccurate disk size updates. > > > > > > This patch changes the meaning of io_size to represent the size of > > > valid data in ioend, while the extent size of ioend can be obtained > > > by rounding up based on block size. It ensures more precise disk > > > size updates and avoids the zero padding issue. Another benefit is > > > that it makes the xfs_ioend_is_append() check more accurate, which > > > can reduce unnecessary end bio callbacks of xfs_end_bio() in certain > > > scenarios, such as repeated writes at the file tail without extending > > > the file size. > > > > > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > > --- > > > fs/iomap/buffered-io.c | 21 +++++++++++++++------ > > > include/linux/iomap.h | 7 ++++++- > > > 2 files changed, 21 insertions(+), 7 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index ce73d2a48c1e..a2a75876cda6 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > > @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); > > > static bool > > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > { > > > + size_t size = iomap_ioend_extent_size(ioend); > > > + > > > > The function name is kind of misleading IMO because this may not > > necessarily reflect "extent size." Maybe something like > > _ioend_size_aligned() would be more accurate..? > > > > Previously, io_size represented the extent size in ioend, so I wanted > to maintain that description. Indeed, _ioend_size_aligned() does seem > more accurate. > > > I also find it moderately annoying that we have to change pretty much > > every usage of this field to use the wrapper just so the setfilesize > > path can do the right thing. Though I see that was an explicit request > > from v1 to avoid a new field, so it's not the biggest deal. > > > > What urks me a bit are: > > > > 1. It kind of feels like a landmine in an area where block alignment is > > typically expected. I wonder if a rename to something like io_bytes > > would help at all with that. > > > > I think that clearly documenting the meaning of io_size is more important, > as changing the name doesn't fundamentally address the underlying concerns. > True. > > 2. Some of the rounding sites below sort of feel gratuitous. For > > example, if we run through the _add_to_ioend() path where we actually > > trim off bytes from the EOF block due to i_size, would we ever expect to > > tack more onto that ioend such that the iomap_ioend_extent_size() calls > > are actually effective? It kind of seems like something is wrong in that > > case where the wrapper call actually matters, but maybe I'm missing > > something. > > > > I believe using iomap_ioend_extent_size() at merge decision points is > valuable. Consider this scenario (with 4k FSB): > > 1) thread 1 writes [0, 2k] //io_size set to 2K > 2) thread 2 writes [4k, 8k] > > If these IOs complete simultaneously, the ioends can be merged, resulting > in an io_size of 8k. Similarly, we can merge as many as possible ioend in > _add_to_ioend(). > That's not the _add_to_ioend() case I was referring to above. Is there any practical use case where the rounding is effective there? I suppose you could use it for the ioend merging case, but I'm skeptical of the value. Isn't the common case that those two writes end up as a single ioend anyways? ISTM that for the above merge scenario to happen we'd either need writeback of the thread 1 write to race just right with the thread 2 write, or have two writeback cycles where the completion of the first is still pending by the time the second completes. Either of those seem far less likely than either writeback seeing i_size == 8k from the start, or the thread 2 write completing sometime after the thread 1 ioend has already been completed. Hm? Brian > > Another randomish idea might be to define a flag like > > IOMAP_F_EOF_TRIMMED for ioends that are trimmed to EOF. Then perhaps we > > could make an explicit decision not to grow or merge such ioends, and > > let the associated code use io_size as is. > > > > But I dunno.. just thinking out loud. I'm ambivalent on all of the above > > so I'm just sharing thoughts in the event that it triggers more > > thoughts/ideas/useful discussion. I'd probably not change anything > > until/unless others chime in on any of this... > > > > Brian > > > > Thank you for your reply and thoughtful considerations. :) > > Thanks, > Long Li > > > > > if (ioend->io_bio.bi_status != next->io_bio.bi_status) > > > return false; > > > if ((ioend->io_flags & IOMAP_F_SHARED) ^ > > > @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > if ((ioend->io_type == IOMAP_UNWRITTEN) ^ > > > (next->io_type == IOMAP_UNWRITTEN)) > > > return false; > > > - if (ioend->io_offset + ioend->io_size != next->io_offset) > > > + if (ioend->io_offset + size != next->io_offset) > > > return false; > > > /* > > > * Do not merge physically discontiguous ioends. The filesystem > > > @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > * submission so does not point to the start sector of the bio at > > > * completion. > > > */ > > > - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) > > > + if (ioend->io_sector + (size >> 9) != next->io_sector) > > > return false; > > > return true; > > > } > > > @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) > > > if (!iomap_ioend_can_merge(ioend, next)) > > > break; > > > list_move_tail(&next->io_list, &ioend->io_list); > > > - ioend->io_size += next->io_size; > > > + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; > > > } > > > } > > > EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); > > > @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) > > > return false; > > > if (wpc->iomap.type != wpc->ioend->io_type) > > > return false; > > > - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) > > > + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) > > > return false; > > > if (iomap_sector(&wpc->iomap, pos) != > > > bio_end_sector(&wpc->ioend->io_bio)) > > > @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > > { > > > struct iomap_folio_state *ifs = folio->private; > > > size_t poff = offset_in_folio(folio, pos); > > > + loff_t isize = i_size_read(inode); > > > + struct iomap_ioend *ioend; > > > int error; > > > > > > if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { > > > @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, > > > wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); > > > } > > > > > > - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) > > > + ioend = wpc->ioend; > > > + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) > > > goto new_ioend; > > > > > > if (ifs) > > > atomic_add(len, &ifs->write_bytes_pending); > > > - wpc->ioend->io_size += len; > > > + > > > + ioend->io_size = iomap_ioend_extent_size(ioend) + len; > > > + if (ioend->io_offset + ioend->io_size > isize) > > > + ioend->io_size = isize - ioend->io_offset; > > > + > > > wbc_account_cgroup_owner(wbc, folio, len); > > > return 0; > > > } > > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > > > index f61407e3b121..2984eccfa213 100644 > > > --- a/include/linux/iomap.h > > > +++ b/include/linux/iomap.h > > > @@ -330,7 +330,7 @@ struct iomap_ioend { > > > u16 io_type; > > > u16 io_flags; /* IOMAP_F_* */ > > > struct inode *io_inode; /* file being written to */ > > > - size_t io_size; /* size of the extent */ > > > + size_t io_size; /* size of valid data */ > > > loff_t io_offset; /* offset in the file */ > > > sector_t io_sector; /* start sector of ioend */ > > > struct bio io_bio; /* MUST BE LAST! */ > > > @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) > > > return container_of(bio, struct iomap_ioend, io_bio); > > > } > > > > > > +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) > > > +{ > > > + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); > > > +} > > > + > > > struct iomap_writeback_ops { > > > /* > > > * Required, maps the blocks so that writeback can be performed on > > > -- > > > 2.39.2 > > > > > > > > > > >
On Thu, Nov 14, 2024 at 01:04:31PM -0500, Brian Foster wrote: > On Thu, Nov 14, 2024 at 10:34:26AM +0800, Long Li wrote: > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > ISTM that for the above merge scenario to happen we'd either need > writeback of the thread 1 write to race just right with the thread 2 > write, or have two writeback cycles where the completion of the first is > still pending by the time the second completes. Either of those seem far > less likely than either writeback seeing i_size == 8k from the start, or > the thread 2 write completing sometime after the thread 1 ioend has > already been completed. Hm? I think that this should be fairly easy to trigger with concurrent sub-block buffered writes to O_DSYNC|O_APPEND opened files. The fact we drop the IOLOCK before calling generic_write_sync() to flush the data pretty much guarantees that there will be IO in flight whilst other write() calls have extended the file in memory and are then waiting for the current writeback on the folio to complete before submitting their own writeback IO. Cheers, Dave.
On Thu, Nov 14, 2024 at 01:04:31PM -0500, Brian Foster wrote: > On Thu, Nov 14, 2024 at 10:34:26AM +0800, Long Li wrote: > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > FYI, you probably want to include linux-fsdevel on iomap patches. > > > > > > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > > > During concurrent append writes to XFS filesystem, zero padding data > > > > may appear in the file after power failure. This happens due to imprecise > > > > disk size updates when handling write completion. > > > > > > > > Consider this scenario with concurrent append writes same file: > > > > > > > > Thread 1: Thread 2: > > > > ------------ ----------- > > > > write [A, A+B] > > > > update inode size to A+B > > > > submit I/O [A, A+BS] > > > > write [A+B, A+B+C] > > > > update inode size to A+B+C > > > > <I/O completes, updates disk size to A+B+C> > > > > <power failure> > > > > > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > > > > > |< Block Size (BS) >| > > > > |DDDDDDDDDDDDDDDD0000000000000000| > > > > ^ ^ ^ > > > > A A+B A+B+C (EOF) > > > > > > > > > > Thanks for the diagram. FWIW, I found the description a little confusing > > > because A+B+C to me implies that we'd update i_size to the end of the > > > write from thread 2, but it seems that is only true up to the end of the > > > block. > > > > > > I.e., with 4k FSB and if thread 1 writes [0, 2k], then thread 2 writes > > > from [2, 16k], the write completion from the thread 1 write will set > > > i_size to 4k, not 16k, right? > > > > > > > Not right, the problem I'm trying to describe is: > > > > 1) thread 1 writes [0, 2k] > > 2) thread 2 writes [2k, 3k] > > 3) write completion from the thread 1 write set i_size to 3K > > 4) power failure > > 5) after reboot, [2k, 3K] of the file filled with zero and the file size is 3k > > > > Yeah, I get the subblock case. What I am saying above is it seems like > "update inode size to A+B+C" is only true for certain, select values > that describe the subblock case. I.e., what is the resulting i_size if > we replace C=1k in the example above with something >= FSB size, like > C=4k? > > Note this isn't all that important. I was just trying to say that the > overly general description made this a little more confusing to grok at > first than it needed to be, because to me it subtly implies there is > logic around somewhere that explicitly writes in-core i_size to disk, > when that is not actually what is happening. > > > Sorry for my previous misunderstanding. You are correct - my commit message description didn't cover the case where A+B+C > block size. In such scenarios, the final file size might end up being 4K, which is not what we would expect. Initially, I incorrectly thought this wasn't a significant issue and thus overlooked this case. Let me update the diagram to address this. Thread 1: Thread 2: ------------ ----------- write [A, A+B] update inode size to A+B submit I/O [A, A+BS] write [A+B, A+B+C] update inode size to A+B+C <I/O completes, updates disk size to A+B+C> <power failure> After reboot: 1) The file has zero padding in the range [A+B, A+BS] 2) The file size is unexpectedly set to A+BS |< Block Size (BS) >|< Block Size (BS) >| |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| ^ ^ ^ ^ A A+B A+BS (EOF) A+B+C It will be update in the next version. Thanks, Long Li
On Fri, Nov 15, 2024 at 07:53:17PM +0800, Long Li wrote: > On Thu, Nov 14, 2024 at 01:04:31PM -0500, Brian Foster wrote: > > On Thu, Nov 14, 2024 at 10:34:26AM +0800, Long Li wrote: > > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > > FYI, you probably want to include linux-fsdevel on iomap patches. > > > > > > > > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > > > > During concurrent append writes to XFS filesystem, zero padding data > > > > > may appear in the file after power failure. This happens due to imprecise > > > > > disk size updates when handling write completion. > > > > > > > > > > Consider this scenario with concurrent append writes same file: > > > > > > > > > > Thread 1: Thread 2: > > > > > ------------ ----------- > > > > > write [A, A+B] > > > > > update inode size to A+B > > > > > submit I/O [A, A+BS] > > > > > write [A+B, A+B+C] > > > > > update inode size to A+B+C > > > > > <I/O completes, updates disk size to A+B+C> > > > > > <power failure> > > > > > > > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > > > > > > > |< Block Size (BS) >| > > > > > |DDDDDDDDDDDDDDDD0000000000000000| > > > > > ^ ^ ^ > > > > > A A+B A+B+C (EOF) > > > > > > > > > > > > > Thanks for the diagram. FWIW, I found the description a little confusing > > > > because A+B+C to me implies that we'd update i_size to the end of the > > > > write from thread 2, but it seems that is only true up to the end of the > > > > block. > > > > > > > > I.e., with 4k FSB and if thread 1 writes [0, 2k], then thread 2 writes > > > > from [2, 16k], the write completion from the thread 1 write will set > > > > i_size to 4k, not 16k, right? > > > > > > > > > > Not right, the problem I'm trying to describe is: > > > > > > 1) thread 1 writes [0, 2k] > > > 2) thread 2 writes [2k, 3k] > > > 3) write completion from the thread 1 write set i_size to 3K > > > 4) power failure > > > 5) after reboot, [2k, 3K] of the file filled with zero and the file size is 3k > > > > > > > Yeah, I get the subblock case. What I am saying above is it seems like > > "update inode size to A+B+C" is only true for certain, select values > > that describe the subblock case. I.e., what is the resulting i_size if > > we replace C=1k in the example above with something >= FSB size, like > > C=4k? > > > > Note this isn't all that important. I was just trying to say that the > > overly general description made this a little more confusing to grok at > > first than it needed to be, because to me it subtly implies there is > > logic around somewhere that explicitly writes in-core i_size to disk, > > when that is not actually what is happening. > > > > > > > Sorry for my previous misunderstanding. You are correct - my commit > message description didn't cover the case where A+B+C > block size. > In such scenarios, the final file size might end up being 4K, which > is not what we would expect. Initially, I incorrectly thought this > wasn't a significant issue and thus overlooked this case. Let me > update the diagram to address this. > Ok no problem.. like I said, just a minor nit. ;) > Thread 1: Thread 2: > ------------ ----------- > write [A, A+B] > update inode size to A+B > submit I/O [A, A+BS] > write [A+B, A+B+C] > update inode size to A+B+C > <I/O completes, updates disk size to A+B+C> > <power failure> > > After reboot: > 1) The file has zero padding in the range [A+B, A+BS] > 2) The file size is unexpectedly set to A+BS > > |< Block Size (BS) >|< Block Size (BS) >| > |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| > ^ ^ ^ ^ > A A+B A+BS (EOF) A+B+C > > > It will be update in the next version. > The text above still says "updates disk size to A+B+C." I'm not sure if you intended to change that to A+BS as well, but regardless LGTM. Thanks. Brian > > Thanks, > Long Li >
On Fri, Nov 15, 2024 at 07:01:58AM +1100, Dave Chinner wrote: > On Thu, Nov 14, 2024 at 01:04:31PM -0500, Brian Foster wrote: > > On Thu, Nov 14, 2024 at 10:34:26AM +0800, Long Li wrote: > > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > ISTM that for the above merge scenario to happen we'd either need > > writeback of the thread 1 write to race just right with the thread 2 > > write, or have two writeback cycles where the completion of the first is > > still pending by the time the second completes. Either of those seem far > > less likely than either writeback seeing i_size == 8k from the start, or > > the thread 2 write completing sometime after the thread 1 ioend has > > already been completed. Hm? > > I think that this should be fairly easy to trigger with concurrent > sub-block buffered writes to O_DSYNC|O_APPEND opened files. The fact > we drop the IOLOCK before calling generic_write_sync() to flush the > data pretty much guarantees that there will be IO in flight whilst > other write() calls have extended the file in memory and are then > waiting for the current writeback on the folio to complete before > submitting their own writeback IO. > Hmmm.. that kind of sounds like opposite to me. Note that the example given was distinctly not append only, as that allows multiple technically "merge-worthy" ioends to be in completion at the same time. If you're doing O_DSYNC|O_APPEND sub-block buffered writes, then by definition there is going to be folio overlap across writes, and we don't submit a dirty&&writeback folio for writeback until preexisting writeback state clears. So ISTM that creates a situation where even if the append I/O is multithreaded, you'll just end up more likely to lockstep across writebacks and won't ever submit the second ioend before the first completes. Hm? That said, we're kind of getting in the weeds here.. I don't think it matters that much if these ioends merge.. I'm basically throwing out the proposition that maybe it's not worth the additional code to ensure that they always do. I.e., suppose we opted to trim the io_size when appropriate based on i_size so the completion side size update is accurate, but otherwise just left off the rounding helper thing and let the existing code work as it is. Would that lack of guaranteed block alignment have any practical impact on functionality or performance? ISTM the add_to_ioend() part is superfluous so it probably wouldn't have much effect on I/O sizes, if any. The skipped merge example Long Li gives seems technically possible, but I'm not aware of a workload where that would occur frequently enough that failing to merge such an ioend would have a noticeable impact.. hm? Again.. not something I care tremendously about, just trying to keep things simple if possible. If it were me and there's not something we can put to that obviously breaks, I'd start with that and then if that does prove wrong, it's obviously simple to fix by tacking on the rounding thing later. Just my .02. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > static bool > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > { > > + size_t size = iomap_ioend_extent_size(ioend); > > + > > The function name is kind of misleading IMO because this may not > necessarily reflect "extent size." Maybe something like > _ioend_size_aligned() would be more accurate..? Agreed. What also would be useful is a comment describing the function and why io_size is not aligned. > 1. It kind of feels like a landmine in an area where block alignment is > typically expected. I wonder if a rename to something like io_bytes > would help at all with that. Fine with me. > Another randomish idea might be to define a flag like > IOMAP_F_EOF_TRIMMED for ioends that are trimmed to EOF. Then perhaps we > could make an explicit decision not to grow or merge such ioends, and > let the associated code use io_size as is. I don't think such a branch is any cheaper than the rounding..
On Sun, Nov 17, 2024 at 10:56:59PM -0800, Christoph Hellwig wrote: > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > static bool > > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > { > > > + size_t size = iomap_ioend_extent_size(ioend); > > > + > > > > The function name is kind of misleading IMO because this may not > > necessarily reflect "extent size." Maybe something like > > _ioend_size_aligned() would be more accurate..? > > Agreed. What also would be useful is a comment describing the > function and why io_size is not aligned. > Ack. > > 1. It kind of feels like a landmine in an area where block alignment is > > typically expected. I wonder if a rename to something like io_bytes > > would help at all with that. > > Fine with me. > > > Another randomish idea might be to define a flag like > > IOMAP_F_EOF_TRIMMED for ioends that are trimmed to EOF. Then perhaps we > > could make an explicit decision not to grow or merge such ioends, and > > let the associated code use io_size as is. > > I don't think such a branch is any cheaper than the rounding.. > True, but I figured it to be more informational/usable than performance oriented. IOW following the train of thought in the other subthread, would any practical workload be affected if we just trimmed io_size when needed by i_size and left it at that? If not, then you could use a flag to be more deliberate/informative that the ioend was slightly special in that it was modified on submission, and then adjacent ioends would simply fail to merge on a flag comparison rather than fail to merge on a contiguity check. Of course if folks would rather just do the rounding helper thing and leave it up to the fs to use it, then I don't see any fundamental problem with that. I just find it kind of a subtle/quirky interface. Brian
On Fri, Nov 15, 2024 at 08:46:13AM -0500, Brian Foster wrote: > On Fri, Nov 15, 2024 at 07:53:17PM +0800, Long Li wrote: > > On Thu, Nov 14, 2024 at 01:04:31PM -0500, Brian Foster wrote: > > > On Thu, Nov 14, 2024 at 10:34:26AM +0800, Long Li wrote: > > > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > > > FYI, you probably want to include linux-fsdevel on iomap patches. > > > > > > > > > > On Wed, Nov 13, 2024 at 05:19:06PM +0800, Long Li wrote: > > > > > > During concurrent append writes to XFS filesystem, zero padding data > > > > > > may appear in the file after power failure. This happens due to imprecise > > > > > > disk size updates when handling write completion. > > > > > > > > > > > > Consider this scenario with concurrent append writes same file: > > > > > > > > > > > > Thread 1: Thread 2: > > > > > > ------------ ----------- > > > > > > write [A, A+B] > > > > > > update inode size to A+B > > > > > > submit I/O [A, A+BS] > > > > > > write [A+B, A+B+C] > > > > > > update inode size to A+B+C > > > > > > <I/O completes, updates disk size to A+B+C> > > > > > > <power failure> > > > > > > > > > > > > After reboot, file has zero padding in range [A+B, A+B+C]: > > > > > > > > > > > > |< Block Size (BS) >| > > > > > > |DDDDDDDDDDDDDDDD0000000000000000| > > > > > > ^ ^ ^ > > > > > > A A+B A+B+C (EOF) > > > > > > > > > > > > > > > > Thanks for the diagram. FWIW, I found the description a little confusing > > > > > because A+B+C to me implies that we'd update i_size to the end of the > > > > > write from thread 2, but it seems that is only true up to the end of the > > > > > block. > > > > > > > > > > I.e., with 4k FSB and if thread 1 writes [0, 2k], then thread 2 writes > > > > > from [2, 16k], the write completion from the thread 1 write will set > > > > > i_size to 4k, not 16k, right? > > > > > > > > > > > > > Not right, the problem I'm trying to describe is: > > > > > > > > 1) thread 1 writes [0, 2k] > > > > 2) thread 2 writes [2k, 3k] > > > > 3) write completion from the thread 1 write set i_size to 3K > > > > 4) power failure > > > > 5) after reboot, [2k, 3K] of the file filled with zero and the file size is 3k > > > > > > > > > > Yeah, I get the subblock case. What I am saying above is it seems like > > > "update inode size to A+B+C" is only true for certain, select values > > > that describe the subblock case. I.e., what is the resulting i_size if > > > we replace C=1k in the example above with something >= FSB size, like > > > C=4k? > > > > > > Note this isn't all that important. I was just trying to say that the > > > overly general description made this a little more confusing to grok at > > > first than it needed to be, because to me it subtly implies there is > > > logic around somewhere that explicitly writes in-core i_size to disk, > > > when that is not actually what is happening. > > > > > > > > > > > Sorry for my previous misunderstanding. You are correct - my commit > > message description didn't cover the case where A+B+C > block size. > > In such scenarios, the final file size might end up being 4K, which > > is not what we would expect. Initially, I incorrectly thought this > > wasn't a significant issue and thus overlooked this case. Let me > > update the diagram to address this. > > > > Ok no problem.. like I said, just a minor nit. ;) > > > Thread 1: Thread 2: > > ------------ ----------- > > write [A, A+B] > > update inode size to A+B > > submit I/O [A, A+BS] > > write [A+B, A+B+C] > > update inode size to A+B+C > > <I/O completes, updates disk size to A+B+C> > > <power failure> > > > > After reboot: > > 1) The file has zero padding in the range [A+B, A+BS] > > 2) The file size is unexpectedly set to A+BS > > > > |< Block Size (BS) >|< Block Size (BS) >| > > |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000| > > ^ ^ ^ ^ > > A A+B A+BS (EOF) A+B+C > > > > > > It will be update in the next version. > > > > The text above still says "updates disk size to A+B+C." I'm not sure if > you intended to change that to A+BS as well, but regardless LGTM. > Thanks. > Yes, forgot to update here. Thanks, Long Li
On Sun, Nov 17, 2024 at 10:56:59PM -0800, Christoph Hellwig wrote: > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > static bool > > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > { > > > + size_t size = iomap_ioend_extent_size(ioend); > > > + > > > > The function name is kind of misleading IMO because this may not > > necessarily reflect "extent size." Maybe something like > > _ioend_size_aligned() would be more accurate..? > > Agreed. What also would be useful is a comment describing the > function and why io_size is not aligned. > Ok, it will be changed in the next version. > > 1. It kind of feels like a landmine in an area where block alignment is > > typically expected. I wonder if a rename to something like io_bytes > > would help at all with that. > > Fine with me. > While continuing to use io_size may introduce some ambiguity, this can be adequately addressed through proper documentation. Furthermore, retaining io_size would minimize code changes. I would like to confirm whether renaming io_size to io_bytes is truly necessary in this context. Thanks, Long Li
On Tue, Nov 19, 2024 at 04:35:22PM +0800, Long Li wrote: > On Sun, Nov 17, 2024 at 10:56:59PM -0800, Christoph Hellwig wrote: > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > > static bool > > > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > > { > > > > + size_t size = iomap_ioend_extent_size(ioend); > > > > + > > > > > > The function name is kind of misleading IMO because this may not > > > necessarily reflect "extent size." Maybe something like > > > _ioend_size_aligned() would be more accurate..? > > > > Agreed. What also would be useful is a comment describing the > > function and why io_size is not aligned. > > > > Ok, it will be changed in the next version. > > > > 1. It kind of feels like a landmine in an area where block alignment is > > > typically expected. I wonder if a rename to something like io_bytes > > > would help at all with that. > > > > Fine with me. > > > > While continuing to use io_size may introduce some ambiguity, this can > be adequately addressed through proper documentation. Furthermore, > retaining io_size would minimize code changes. I would like to > confirm whether renaming io_size to io_bytes is truly necessary in > this context. > I don't think a rename is a requirement. It was just an idea to consider. The whole rounding thing is the one lingering thing for me. In my mind it's not worth the complexity of having a special wrapper like this if we don't have at least one example where it provides tangible performance benefit. It kind of sounds like we're fishing around for examples where it would allow an ioend to merge, but so far don't have anything that reproduces perf. value. Do you agree with that assessment? That said, I agree we have a couple examples where it is technically functional, it does preserve existing logic, and it's not the biggest deal in general. So if we really want to keep it, perhaps a reasonable compromise might be to lift it as a static into buffered-io.c (so it's not exposed to new users via the header, at least for now) and add a nice comment above it to explain when/why the io_size is inferred via rounding and that it's specifically for ioend grow/merge management. Hm? Brian > Thanks, > Long Li > >
On Tue, Nov 19, 2024 at 07:13:59AM -0500, Brian Foster wrote: > On Tue, Nov 19, 2024 at 04:35:22PM +0800, Long Li wrote: > > On Sun, Nov 17, 2024 at 10:56:59PM -0800, Christoph Hellwig wrote: > > > On Wed, Nov 13, 2024 at 11:13:49AM -0500, Brian Foster wrote: > > > > > static bool > > > > > iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) > > > > > { > > > > > + size_t size = iomap_ioend_extent_size(ioend); > > > > > + > > > > > > > > The function name is kind of misleading IMO because this may not > > > > necessarily reflect "extent size." Maybe something like > > > > _ioend_size_aligned() would be more accurate..? > > > > > > Agreed. What also would be useful is a comment describing the > > > function and why io_size is not aligned. > > > > > > > Ok, it will be changed in the next version. > > > > > > 1. It kind of feels like a landmine in an area where block alignment is > > > > typically expected. I wonder if a rename to something like io_bytes > > > > would help at all with that. > > > > > > Fine with me. > > > > > > > While continuing to use io_size may introduce some ambiguity, this can > > be adequately addressed through proper documentation. Furthermore, > > retaining io_size would minimize code changes. I would like to > > confirm whether renaming io_size to io_bytes is truly necessary in > > this context. > > > > I don't think a rename is a requirement. It was just an idea to > consider. > ok. > The whole rounding thing is the one lingering thing for me. In my mind > it's not worth the complexity of having a special wrapper like this if > we don't have at least one example where it provides tangible > performance benefit. It kind of sounds like we're fishing around for > examples where it would allow an ioend to merge, but so far don't have > anything that reproduces perf. value. Do you agree with that assessment? > Yes, I agree with your assessment. The merging through size rounding actually occurs in only a small number of cases. > That said, I agree we have a couple examples where it is technically > functional, it does preserve existing logic, and it's not the biggest > deal in general. So if we really want to keep it, perhaps a reasonable > compromise might be to lift it as a static into buffered-io.c (so it's > not exposed to new users via the header, at least for now) and add a > nice comment above it to explain when/why the io_size is inferred via > rounding and that it's specifically for ioend grow/merge management. Hm? > I agree with you, this approach sounds reasonable to me. Thanks, Long Li
On Mon, Nov 18, 2024 at 09:26:45AM -0500, Brian Foster wrote: > IOW following the train of thought in the other subthread, would any > practical workload be affected if we just trimmed io_size when needed by > i_size and left it at that? Can you explain what you mean with that?
On Wed, Nov 20, 2024 at 01:05:16AM -0800, Christoph Hellwig wrote: > On Mon, Nov 18, 2024 at 09:26:45AM -0500, Brian Foster wrote: > > IOW following the train of thought in the other subthread, would any > > practical workload be affected if we just trimmed io_size when needed by > > i_size and left it at that? > > Can you explain what you mean with that? > I mean what happens if we trimmed io_size just as this patch already does, but left off the rounding stuff? The rounding is only used for adding to or attempting to merge ioends. I don't see when it would ever really make a difference in adding folios to an ioend, since we don't writeback folios beyond i_size and a size extending operation is unlikely to write back between the time an ioend is being constructed/trimmed and submitted. After discussion, it seems there are some scenarios where the rounding allows i_size trimmed ioends to merge, but again I'm not seeing how that can occur frequently enough such that just skipping the rounding and letting trimmed ioends fail to merge would have any noticeable performance impact. But anyways, I think we've reached compromise that pulling the rounding helper into buffered-io.c and documenting it well preserves logic and addresses my concerns by making sure it doesn't proliferate for the time being. Brian
On Wed, Nov 20, 2024 at 08:50:45AM -0500, Brian Foster wrote: > On Wed, Nov 20, 2024 at 01:05:16AM -0800, Christoph Hellwig wrote: > > On Mon, Nov 18, 2024 at 09:26:45AM -0500, Brian Foster wrote: > > > IOW following the train of thought in the other subthread, would any > > > practical workload be affected if we just trimmed io_size when needed by > > > i_size and left it at that? > > > > Can you explain what you mean with that? > > > > I mean what happens if we trimmed io_size just as this patch already > does, but left off the rounding stuff? Ah. > > The rounding is only used for adding to or attempting to merge ioends. I > don't see when it would ever really make a difference in adding folios > to an ioend, since we don't writeback folios beyond i_size and a size > extending operation is unlikely to write back between the time an ioend > is being constructed/trimmed and submitted. True. > After discussion, it seems there are some scenarios where the rounding > allows i_size trimmed ioends to merge, but again I'm not seeing how that > can occur frequently enough such that just skipping the rounding and > letting trimmed ioends fail to merge would have any noticeable > performance impact. Agreed. > But anyways, I think we've reached compromise that pulling the rounding > helper into buffered-io.c and documenting it well preserves logic and > addresses my concerns by making sure it doesn't proliferate for the time > being. I mean if we can do away without the rounding I'm fine with it. Looking at fs/iomap/buffered-io.c I think we should be fine with it, and in XFS everything but the size updaste itself converts to FSBs first, so it should be fine as well.
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index ce73d2a48c1e..a2a75876cda6 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1599,6 +1599,8 @@ EXPORT_SYMBOL_GPL(iomap_finish_ioends); static bool iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) { + size_t size = iomap_ioend_extent_size(ioend); + if (ioend->io_bio.bi_status != next->io_bio.bi_status) return false; if ((ioend->io_flags & IOMAP_F_SHARED) ^ @@ -1607,7 +1609,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) if ((ioend->io_type == IOMAP_UNWRITTEN) ^ (next->io_type == IOMAP_UNWRITTEN)) return false; - if (ioend->io_offset + ioend->io_size != next->io_offset) + if (ioend->io_offset + size != next->io_offset) return false; /* * Do not merge physically discontiguous ioends. The filesystem @@ -1619,7 +1621,7 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next) * submission so does not point to the start sector of the bio at * completion. */ - if (ioend->io_sector + (ioend->io_size >> 9) != next->io_sector) + if (ioend->io_sector + (size >> 9) != next->io_sector) return false; return true; } @@ -1636,7 +1638,7 @@ iomap_ioend_try_merge(struct iomap_ioend *ioend, struct list_head *more_ioends) if (!iomap_ioend_can_merge(ioend, next)) break; list_move_tail(&next->io_list, &ioend->io_list); - ioend->io_size += next->io_size; + ioend->io_size = iomap_ioend_extent_size(ioend) + next->io_size; } } EXPORT_SYMBOL_GPL(iomap_ioend_try_merge); @@ -1736,7 +1738,7 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) return false; if (wpc->iomap.type != wpc->ioend->io_type) return false; - if (pos != wpc->ioend->io_offset + wpc->ioend->io_size) + if (pos != wpc->ioend->io_offset + iomap_ioend_extent_size(wpc->ioend)) return false; if (iomap_sector(&wpc->iomap, pos) != bio_end_sector(&wpc->ioend->io_bio)) @@ -1768,6 +1770,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, { struct iomap_folio_state *ifs = folio->private; size_t poff = offset_in_folio(folio, pos); + loff_t isize = i_size_read(inode); + struct iomap_ioend *ioend; int error; if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) { @@ -1778,12 +1782,17 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos); } - if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff)) + ioend = wpc->ioend; + if (!bio_add_folio(&ioend->io_bio, folio, len, poff)) goto new_ioend; if (ifs) atomic_add(len, &ifs->write_bytes_pending); - wpc->ioend->io_size += len; + + ioend->io_size = iomap_ioend_extent_size(ioend) + len; + if (ioend->io_offset + ioend->io_size > isize) + ioend->io_size = isize - ioend->io_offset; + wbc_account_cgroup_owner(wbc, folio, len); return 0; } diff --git a/include/linux/iomap.h b/include/linux/iomap.h index f61407e3b121..2984eccfa213 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -330,7 +330,7 @@ struct iomap_ioend { u16 io_type; u16 io_flags; /* IOMAP_F_* */ struct inode *io_inode; /* file being written to */ - size_t io_size; /* size of the extent */ + size_t io_size; /* size of valid data */ loff_t io_offset; /* offset in the file */ sector_t io_sector; /* start sector of ioend */ struct bio io_bio; /* MUST BE LAST! */ @@ -341,6 +341,11 @@ static inline struct iomap_ioend *iomap_ioend_from_bio(struct bio *bio) return container_of(bio, struct iomap_ioend, io_bio); } +static inline size_t iomap_ioend_extent_size(struct iomap_ioend *ioend) +{ + return round_up(ioend->io_size, i_blocksize(ioend->io_inode)); +} + struct iomap_writeback_ops { /* * Required, maps the blocks so that writeback can be performed on
During concurrent append writes to XFS filesystem, zero padding data may appear in the file after power failure. This happens due to imprecise disk size updates when handling write completion. Consider this scenario with concurrent append writes same file: Thread 1: Thread 2: ------------ ----------- write [A, A+B] update inode size to A+B submit I/O [A, A+BS] write [A+B, A+B+C] update inode size to A+B+C <I/O completes, updates disk size to A+B+C> <power failure> After reboot, file has zero padding in range [A+B, A+B+C]: |< Block Size (BS) >| |DDDDDDDDDDDDDDDD0000000000000000| ^ ^ ^ A A+B A+B+C (EOF) D = Valid Data 0 = Zero Padding The issue stems from disk size being set to min(io_offset + io_size, inode->i_size) at I/O completion. Since io_offset+io_size is block size granularity, it may exceed the actual valid file data size. In the case of concurrent append writes, inode->i_size may be larger than the actual range of valid file data written to disk, leading to inaccurate disk size updates. This patch changes the meaning of io_size to represent the size of valid data in ioend, while the extent size of ioend can be obtained by rounding up based on block size. It ensures more precise disk size updates and avoids the zero padding issue. Another benefit is that it makes the xfs_ioend_is_append() check more accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio() in certain scenarios, such as repeated writes at the file tail without extending the file size. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/iomap/buffered-io.c | 21 +++++++++++++++------ include/linux/iomap.h | 7 ++++++- 2 files changed, 21 insertions(+), 7 deletions(-)