Message ID | 86987466d8d7863bd0dca81e9d6c3eff7abd4964.1683485700.git.ritesh.list@gmail.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | iomap: Add support for per-block dirty state to improve write performance | expand |
> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > struct writeback_control *wbc, struct inode *inode, > struct folio *folio, u64 end_pos) > { > - struct iomap_page *iop = iop_alloc(inode, folio, 0); > + struct iomap_page *iop = iop_alloc(inode, folio, 0, true); > struct iomap_ioend *ioend, *next; > unsigned len = i_blocksize(inode); > unsigned nblocks = i_blocks_per_folio(inode, folio); > @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > * invalid, grab a new one. > */ > for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { > - if (iop && !iop_test_block_uptodate(folio, i)) > + if (iop && !iop_test_block_dirty(folio, i)) Shouldn't this be if(iop && iop_test_block_dirty(folio, i))? Before we were skipping if the blocks were not uptodate but now we are skipping if the blocks are not dirty (which means they are uptodate)? I am new to iomap but let me know if I am missing something here. > continue; > > error = wpc->ops->map_blocks(wpc, inode, pos);
Pankaj Raghav <p.raghav@samsung.com> writes: >> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, >> struct writeback_control *wbc, struct inode *inode, >> struct folio *folio, u64 end_pos) >> { >> - struct iomap_page *iop = iop_alloc(inode, folio, 0); >> + struct iomap_page *iop = iop_alloc(inode, folio, 0, true); >> struct iomap_ioend *ioend, *next; >> unsigned len = i_blocksize(inode); >> unsigned nblocks = i_blocks_per_folio(inode, folio); >> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, >> * invalid, grab a new one. >> */ >> for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { >> - if (iop && !iop_test_block_uptodate(folio, i)) >> + if (iop && !iop_test_block_dirty(folio, i)) > > Shouldn't this be if(iop && iop_test_block_dirty(folio, i))? > > Before we were skipping if the blocks were not uptodate but now we are > skipping if the blocks are not dirty (which means they are uptodate)? > > I am new to iomap but let me know if I am missing something here. > We set the per-block dirty status in ->write_begin. The check above happens in the writeback path when we are about to write the dirty data to the disk. What the check is doing is that, it checks if the block state is not dirty then skip it which means the block was not written to in the ->write_begin(). Also the block with dirty status always means that the block is uptodate anyways. Hope it helps! -ritesh
> > We set the per-block dirty status in ->write_begin. The check above happens in the > writeback path when we are about to write the dirty data to the disk. > What the check is doing is that, it checks if the block state is not dirty > then skip it which means the block was not written to in the ->write_begin(). > Also the block with dirty status always means that the block is uptodate > anyways. > > Hope it helps! > Definitely. I also checked your 1st RFC version to get more context. Thanks for the explanation.
On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: > When filesystem blocksize is less than folio size (either with > mapping_large_folio_support() or with blocksize < pagesize) and when the > folio is uptodate in pagecache, then even a byte write can cause > an entire folio to be written to disk during writeback. This happens > because we currently don't have a mechanism to track per-block dirty > state within struct iomap_page. We currently only track uptodate state. > > This patch implements support for tracking per-block dirty state in > iomap_page->state bitmap. This should help improve the filesystem write > performance and help reduce write amplification. > > Performance testing of below fio workload reveals ~16x performance > improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) > FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. > > 1. <test_randwrite.fio> > [global] > ioengine=psync > rw=randwrite > overwrite=1 > pre_read=1 > direct=0 > bs=4k > size=1G > dir=./ > numjobs=8 > fdatasync=1 > runtime=60 > iodepth=64 > group_reporting=1 > > [fio-run] > > 2. Also our internal performance team reported that this patch improves > their database workload performance by around ~83% (with XFS on Power) > > Reported-by: Aravinda Herle <araherle@in.ibm.com> > Reported-by: Brian Foster <bfoster@redhat.com> > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/gfs2/aops.c | 2 +- > fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++--- > fs/xfs/xfs_aops.c | 2 +- > fs/zonefs/file.c | 2 +- > include/linux/iomap.h | 1 + > 5 files changed, 112 insertions(+), 10 deletions(-) > ... > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 25f20f269214..c7f41b26280a 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c ... > @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > else > gfp = GFP_NOFS | __GFP_NOFAIL; > > - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), > + /* > + * iop->state tracks two sets of state flags when the > + * filesystem block size is smaller than the folio size. > + * The first state tracks per-block uptodate and the > + * second tracks per-block dirty state. > + */ > + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), > gfp); > if (iop) { > spin_lock_init(&iop->state_lock); > if (folio_test_uptodate(folio)) > iop_set_range(iop, 0, nr_blocks); > + if (is_dirty) > + iop_set_range(iop, nr_blocks, nr_blocks); I find the is_dirty logic here a bit confusing. AFAICT, this is primarily to handle the case where a folio is completely overwritten, so no iop is allocated at write time, and so then writeback needs to allocate the iop as fully dirty to reflect that. I.e., all iop_alloc() callers other than iomap_writepage_map() either pass the result of folio_test_dirty() or explicitly dirty the entire range of the folio anyways. iomap_dirty_folio() essentially does the latter because it needs to dirty the entire folio regardless of whether the iop already exists or not, right? If so and if I'm following all of that correctly, could this complexity be isolated to iomap_writepage_map() by simply checking for the !iop case first, then call iop_alloc() immediately followed by set_range_dirty() of the entire folio? Then presumably iop_alloc() could always just dirty based on folio state with the writeback path exception case handled explicitly. Hm? > folio_attach_private(folio, iop); > } > return iop; ... > @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) > } > EXPORT_SYMBOL_GPL(iomap_invalidate_folio); > > +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) > +{ > + struct iomap_page *iop; > + struct inode *inode = mapping->host; > + size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits; folio_size()? > + > + iop = iop_alloc(inode, folio, 0, false); > + iop_set_range_dirty(inode, folio, 0, len); > + return filemap_dirty_folio(mapping, folio); > +} > +EXPORT_SYMBOL_GPL(iomap_dirty_folio); > + > static void > iomap_write_failed(struct inode *inode, loff_t pos, unsigned len) > { ... > @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode, > } > } > > + /* > + * When we have per-block dirty tracking, there can be > + * blocks within a folio which are marked uptodate > + * but not dirty. In that case it is necessary to punch > + * out such blocks to avoid leaking any delalloc blocks. > + */ > + iop = to_iomap_page(folio); > + if (!iop) > + goto skip_iop_punch; > + last_byte = min_t(loff_t, end_byte - 1, > + (folio_next_index(folio) << PAGE_SHIFT) - 1); > + first_blk = offset_in_folio(folio, start_byte) >> > + blkbits; > + last_blk = offset_in_folio(folio, last_byte) >> > + blkbits; > + blks_per_folio = i_blocks_per_folio(inode, folio); Unused? > + for (i = first_blk; i <= last_blk; i++) { > + if (!iop_test_block_dirty(folio, i)) > + punch(inode, i << blkbits, > + 1 << blkbits); > + } > +skip_iop_punch: Looks reasonable at first glance, though the deep indentation here kind of makes my eyes cross. Could we stuff this loop into a iomap_write_delalloc_scan_folio() helper or some such, and then maybe update the comment at the top of the whole branch to explain both punches? WRT to the !iop case.. I _think_ this is handled correctly here because that means we'd handle the folio as completely dirty at writeback time. Is that the case? If so, it might be nice to document that assumption somewhere (here or perhaps in the writeback path). Brian > /* > * Make sure the next punch start is correctly bound to > * the end of this data range, not the end of the folio. > @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > struct writeback_control *wbc, struct inode *inode, > struct folio *folio, u64 end_pos) > { > - struct iomap_page *iop = iop_alloc(inode, folio, 0); > + struct iomap_page *iop = iop_alloc(inode, folio, 0, true); > struct iomap_ioend *ioend, *next; > unsigned len = i_blocksize(inode); > unsigned nblocks = i_blocks_per_folio(inode, folio); > @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > * invalid, grab a new one. > */ > for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { > - if (iop && !iop_test_block_uptodate(folio, i)) > + if (iop && !iop_test_block_dirty(folio, i)) > continue; > > error = wpc->ops->map_blocks(wpc, inode, pos); > @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > } > } > > + iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio)); > folio_start_writeback(folio); > folio_unlock(folio); > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 2ef78aa1d3f6..77c7332ae197 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = { > .read_folio = xfs_vm_read_folio, > .readahead = xfs_vm_readahead, > .writepages = xfs_vm_writepages, > - .dirty_folio = filemap_dirty_folio, > + .dirty_folio = iomap_dirty_folio, > .release_folio = iomap_release_folio, > .invalidate_folio = iomap_invalidate_folio, > .bmap = xfs_vm_bmap, > diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c > index 132f01d3461f..e508c8e97372 100644 > --- a/fs/zonefs/file.c > +++ b/fs/zonefs/file.c > @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = { > .read_folio = zonefs_read_folio, > .readahead = zonefs_readahead, > .writepages = zonefs_writepages, > - .dirty_folio = filemap_dirty_folio, > + .dirty_folio = iomap_dirty_folio, > .release_folio = iomap_release_folio, > .invalidate_folio = iomap_invalidate_folio, > .migrate_folio = filemap_migrate_folio, > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 0f8123504e5e..0c2bee80565c 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count); > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos); > bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags); > void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len); > +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio); > int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, > const struct iomap_ops *ops); > int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, > -- > 2.39.2 >
Brian Foster <bfoster@redhat.com> writes: > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: >> When filesystem blocksize is less than folio size (either with >> mapping_large_folio_support() or with blocksize < pagesize) and when the >> folio is uptodate in pagecache, then even a byte write can cause >> an entire folio to be written to disk during writeback. This happens >> because we currently don't have a mechanism to track per-block dirty >> state within struct iomap_page. We currently only track uptodate state. >> >> This patch implements support for tracking per-block dirty state in >> iomap_page->state bitmap. This should help improve the filesystem write >> performance and help reduce write amplification. >> >> Performance testing of below fio workload reveals ~16x performance >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. >> >> 1. <test_randwrite.fio> >> [global] >> ioengine=psync >> rw=randwrite >> overwrite=1 >> pre_read=1 >> direct=0 >> bs=4k >> size=1G >> dir=./ >> numjobs=8 >> fdatasync=1 >> runtime=60 >> iodepth=64 >> group_reporting=1 >> >> [fio-run] >> >> 2. Also our internal performance team reported that this patch improves >> their database workload performance by around ~83% (with XFS on Power) >> >> Reported-by: Aravinda Herle <araherle@in.ibm.com> >> Reported-by: Brian Foster <bfoster@redhat.com> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> --- >> fs/gfs2/aops.c | 2 +- >> fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++--- >> fs/xfs/xfs_aops.c | 2 +- >> fs/zonefs/file.c | 2 +- >> include/linux/iomap.h | 1 + >> 5 files changed, 112 insertions(+), 10 deletions(-) >> > ... >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> index 25f20f269214..c7f41b26280a 100644 >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c > ... >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, >> else >> gfp = GFP_NOFS | __GFP_NOFAIL; >> >> - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), >> + /* >> + * iop->state tracks two sets of state flags when the >> + * filesystem block size is smaller than the folio size. >> + * The first state tracks per-block uptodate and the >> + * second tracks per-block dirty state. >> + */ >> + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), >> gfp); >> if (iop) { >> spin_lock_init(&iop->state_lock); >> if (folio_test_uptodate(folio)) >> iop_set_range(iop, 0, nr_blocks); >> + if (is_dirty) >> + iop_set_range(iop, nr_blocks, nr_blocks); > > I find the is_dirty logic here a bit confusing. AFAICT, this is > primarily to handle the case where a folio is completely overwritten, so > no iop is allocated at write time, and so then writeback needs to > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc() > callers other than iomap_writepage_map() either pass the result of > folio_test_dirty() or explicitly dirty the entire range of the folio > anyways. iomap_dirty_folio() essentially does the latter because it > needs to dirty the entire folio regardless of whether the iop already > exists or not, right? Yes. > > If so and if I'm following all of that correctly, could this complexity > be isolated to iomap_writepage_map() by simply checking for the !iop > case first, then call iop_alloc() immediately followed by > set_range_dirty() of the entire folio? Then presumably iop_alloc() could > always just dirty based on folio state with the writeback path exception > case handled explicitly. Hm? > Hi Brian, It was discussed here [1] to pass is_dirty flag at the time of iop allocation. We can do what you are essentially suggesting, but it's just extra work i.e. we will again do some calculations of blocks_per_folio, start, end and more importantly take and release iop->state_lock spinlock. Whereas with above approach we could get away with this at the time of iop allocation itself. Besides, isn't it easier this way? which as you also stated we will dirty all the blocks based on is_dirty flag, which is folio_test_dirty() except at the writeback time. >> folio_attach_private(folio, iop); >> } >> return iop; > ... >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) >> } >> EXPORT_SYMBOL_GPL(iomap_invalidate_folio); >> >> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) >> +{ >> + struct iomap_page *iop; >> + struct inode *inode = mapping->host; >> + size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits; > > folio_size()? > sure. >> + >> + iop = iop_alloc(inode, folio, 0, false); >> + iop_set_range_dirty(inode, folio, 0, len); >> + return filemap_dirty_folio(mapping, folio); >> +} >> +EXPORT_SYMBOL_GPL(iomap_dirty_folio); >> + >> static void >> iomap_write_failed(struct inode *inode, loff_t pos, unsigned len) >> { > ... >> @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode, >> } >> } >> >> + /* >> + * When we have per-block dirty tracking, there can be >> + * blocks within a folio which are marked uptodate >> + * but not dirty. In that case it is necessary to punch >> + * out such blocks to avoid leaking any delalloc blocks. >> + */ >> + iop = to_iomap_page(folio); >> + if (!iop) >> + goto skip_iop_punch; >> + last_byte = min_t(loff_t, end_byte - 1, >> + (folio_next_index(folio) << PAGE_SHIFT) - 1); >> + first_blk = offset_in_folio(folio, start_byte) >> >> + blkbits; >> + last_blk = offset_in_folio(folio, last_byte) >> >> + blkbits; >> + blks_per_folio = i_blocks_per_folio(inode, folio); > > Unused? > Yes. I will fix it in next rev somehow compilation didn't give me any warnings. >> + for (i = first_blk; i <= last_blk; i++) { >> + if (!iop_test_block_dirty(folio, i)) >> + punch(inode, i << blkbits, >> + 1 << blkbits); >> + } >> +skip_iop_punch: > > Looks reasonable at first glance, though the deep indentation here kind > of makes my eyes cross. Could we stuff this loop into a > iomap_write_delalloc_scan_folio() helper or some such, and then maybe > update the comment at the top of the whole branch to explain both > punches? I felt that too. Sure will give it a try in the next spin. > > WRT to the !iop case.. I _think_ this is handled correctly here because > that means we'd handle the folio as completely dirty at writeback time. > Is that the case? If so, it might be nice to document that assumption > somewhere (here or perhaps in the writeback path). > !iop case is simply when we don't have a large folio and blocksize == pagesize. In that case we don't allocate any iop and simply returns from iop_alloc(). So then we just skip the loop which is only meant when we have blocks within a folio. -ritesh > Brian > >> /* >> * Make sure the next punch start is correctly bound to >> * the end of this data range, not the end of the folio. >> @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, >> struct writeback_control *wbc, struct inode *inode, >> struct folio *folio, u64 end_pos) >> { >> - struct iomap_page *iop = iop_alloc(inode, folio, 0); >> + struct iomap_page *iop = iop_alloc(inode, folio, 0, true); >> struct iomap_ioend *ioend, *next; >> unsigned len = i_blocksize(inode); >> unsigned nblocks = i_blocks_per_folio(inode, folio); >> @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, >> * invalid, grab a new one. >> */ >> for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { >> - if (iop && !iop_test_block_uptodate(folio, i)) >> + if (iop && !iop_test_block_dirty(folio, i)) >> continue; >> >> error = wpc->ops->map_blocks(wpc, inode, pos); >> @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, >> } >> } >> >> + iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio)); >> folio_start_writeback(folio); >> folio_unlock(folio); >> >> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c >> index 2ef78aa1d3f6..77c7332ae197 100644 >> --- a/fs/xfs/xfs_aops.c >> +++ b/fs/xfs/xfs_aops.c >> @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = { >> .read_folio = xfs_vm_read_folio, >> .readahead = xfs_vm_readahead, >> .writepages = xfs_vm_writepages, >> - .dirty_folio = filemap_dirty_folio, >> + .dirty_folio = iomap_dirty_folio, >> .release_folio = iomap_release_folio, >> .invalidate_folio = iomap_invalidate_folio, >> .bmap = xfs_vm_bmap, >> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c >> index 132f01d3461f..e508c8e97372 100644 >> --- a/fs/zonefs/file.c >> +++ b/fs/zonefs/file.c >> @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = { >> .read_folio = zonefs_read_folio, >> .readahead = zonefs_readahead, >> .writepages = zonefs_writepages, >> - .dirty_folio = filemap_dirty_folio, >> + .dirty_folio = iomap_dirty_folio, >> .release_folio = iomap_release_folio, >> .invalidate_folio = iomap_invalidate_folio, >> .migrate_folio = filemap_migrate_folio, >> diff --git a/include/linux/iomap.h b/include/linux/iomap.h >> index 0f8123504e5e..0c2bee80565c 100644 >> --- a/include/linux/iomap.h >> +++ b/include/linux/iomap.h >> @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count); >> struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos); >> bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags); >> void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len); >> +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio); >> int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, >> const struct iomap_ops *ops); >> int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, >> -- >> 2.39.2 >>
On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote: > Brian Foster <bfoster@redhat.com> writes: > > > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: > >> When filesystem blocksize is less than folio size (either with > >> mapping_large_folio_support() or with blocksize < pagesize) and when the > >> folio is uptodate in pagecache, then even a byte write can cause > >> an entire folio to be written to disk during writeback. This happens > >> because we currently don't have a mechanism to track per-block dirty > >> state within struct iomap_page. We currently only track uptodate state. > >> > >> This patch implements support for tracking per-block dirty state in > >> iomap_page->state bitmap. This should help improve the filesystem write > >> performance and help reduce write amplification. > >> > >> Performance testing of below fio workload reveals ~16x performance > >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) > >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. > >> > >> 1. <test_randwrite.fio> > >> [global] > >> ioengine=psync > >> rw=randwrite > >> overwrite=1 > >> pre_read=1 > >> direct=0 > >> bs=4k > >> size=1G > >> dir=./ > >> numjobs=8 > >> fdatasync=1 > >> runtime=60 > >> iodepth=64 > >> group_reporting=1 > >> > >> [fio-run] > >> > >> 2. Also our internal performance team reported that this patch improves > >> their database workload performance by around ~83% (with XFS on Power) > >> > >> Reported-by: Aravinda Herle <araherle@in.ibm.com> > >> Reported-by: Brian Foster <bfoster@redhat.com> > >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > >> --- > >> fs/gfs2/aops.c | 2 +- > >> fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++--- > >> fs/xfs/xfs_aops.c | 2 +- > >> fs/zonefs/file.c | 2 +- > >> include/linux/iomap.h | 1 + > >> 5 files changed, 112 insertions(+), 10 deletions(-) > >> > > ... > >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >> index 25f20f269214..c7f41b26280a 100644 > >> --- a/fs/iomap/buffered-io.c > >> +++ b/fs/iomap/buffered-io.c > > ... > >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > >> else > >> gfp = GFP_NOFS | __GFP_NOFAIL; > >> > >> - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), > >> + /* > >> + * iop->state tracks two sets of state flags when the > >> + * filesystem block size is smaller than the folio size. > >> + * The first state tracks per-block uptodate and the > >> + * second tracks per-block dirty state. > >> + */ > >> + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), > >> gfp); > >> if (iop) { > >> spin_lock_init(&iop->state_lock); > >> if (folio_test_uptodate(folio)) > >> iop_set_range(iop, 0, nr_blocks); > >> + if (is_dirty) > >> + iop_set_range(iop, nr_blocks, nr_blocks); > > > > I find the is_dirty logic here a bit confusing. AFAICT, this is > > primarily to handle the case where a folio is completely overwritten, so > > no iop is allocated at write time, and so then writeback needs to > > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc() > > callers other than iomap_writepage_map() either pass the result of > > folio_test_dirty() or explicitly dirty the entire range of the folio > > anyways. iomap_dirty_folio() essentially does the latter because it > > needs to dirty the entire folio regardless of whether the iop already > > exists or not, right? > > Yes. > > > > > If so and if I'm following all of that correctly, could this complexity > > be isolated to iomap_writepage_map() by simply checking for the !iop > > case first, then call iop_alloc() immediately followed by > > set_range_dirty() of the entire folio? Then presumably iop_alloc() could > > always just dirty based on folio state with the writeback path exception > > case handled explicitly. Hm? > > > > Hi Brian, > > It was discussed here [1] to pass is_dirty flag at the time of iop > allocation. We can do what you are essentially suggesting, but it's just > extra work i.e. we will again do some calculations of blocks_per_folio, > start, end and more importantly take and release iop->state_lock > spinlock. Whereas with above approach we could get away with this at the > time of iop allocation itself. > Hi Ritesh, Isn't that extra work already occurring in iomap_dirty_folio()? I was just thinking that maybe moving it to where it's apparently needed (i.e. writeback) might eliminate the need for the param. I suppose iomap_dirty_folio() would need to call filemap_dirty_folio() first to make sure iop_alloc() sees the dirty state, but maybe that would also allow skipping the iop alloc if the folio was already dirty (i.e. if the folio was previously dirtied by a full buffered overwite for example)? I've appended a quick diff below (compile tested only) just to explain what I mean. When doing that it also occurred to me that if we really care about the separate call, we could keep the is_dirty param but do the __iop_alloc() wrapper thing where iop_alloc() always passes folio_test_dirty(). BTW, I think you left off your [1] discussion reference.. > Besides, isn't it easier this way? which as you also stated we will > dirty all the blocks based on is_dirty flag, which is folio_test_dirty() > except at the writeback time. > My thinking was just that I kind of had to read through all of the iop_alloc() callsites to grok the purpose of the parameter, which made it seem unnecessarily confusing. But ultimately it made sense, so I don't insist on changing it or anything if this approach is intentional and/or preferred by others. That's just my .02 and I'll defer to your preference. :) > > >> folio_attach_private(folio, iop); > >> } > >> return iop; > > ... > >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) ... > > > > WRT to the !iop case.. I _think_ this is handled correctly here because > > that means we'd handle the folio as completely dirty at writeback time. > > Is that the case? If so, it might be nice to document that assumption > > somewhere (here or perhaps in the writeback path). > > > > !iop case is simply when we don't have a large folio and blocksize == > pagesize. In that case we don't allocate any iop and simply returns > from iop_alloc(). > So then we just skip the loop which is only meant when we have blocks > within a folio. > Isn't it also the case that iop might be NULL at this point if the fs has sub-folio blocks, but the folio was dirtied by a full overwrite of the folio? Or did I misunderstand patch 4? Brian --- 8< --- diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 92e1e1061225..89b3053e3f2d 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len) } static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, - unsigned int flags, bool is_dirty) + unsigned int flags) { struct iomap_page *iop = to_iomap_page(folio); unsigned int nr_blocks = i_blocks_per_folio(inode, folio); @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, spin_lock_init(&iop->state_lock); if (folio_test_uptodate(folio)) iop_set_range(iop, 0, nr_blocks); - if (is_dirty) + if (folio_test_dirty(folio)) iop_set_range(iop, nr_blocks, nr_blocks); folio_attach_private(folio, iop); } @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, if (WARN_ON_ONCE(size > iomap->length)) return -EIO; if (offset > 0) - iop = iop_alloc(iter->inode, folio, iter->flags, - folio_test_dirty(folio)); + iop = iop_alloc(iter->inode, folio, iter->flags); else iop = to_iomap_page(folio); @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, return iomap_read_inline_data(iter, folio); /* zero post-eof blocks as the page may be mapped */ - iop = iop_alloc(iter->inode, folio, iter->flags, - folio_test_dirty(folio)); + iop = iop_alloc(iter->inode, folio, iter->flags); iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen); if (plen == 0) goto done; @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio); bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) { - struct iomap_page *iop; - struct inode *inode = mapping->host; - size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits; - - iop = iop_alloc(inode, folio, 0, false); - iop_set_range_dirty(inode, folio, 0, len); - return filemap_dirty_folio(mapping, folio); + bool dirtied = filemap_dirty_folio(mapping, folio); + if (dirtied) + iop_alloc(mapping->host, folio, 0); + return dirtied; } EXPORT_SYMBOL_GPL(iomap_dirty_folio); @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, pos + len >= folio_pos(folio) + folio_size(folio)) return 0; - iop = iop_alloc(iter->inode, folio, iter->flags, - folio_test_dirty(folio)); + iop = iop_alloc(iter->inode, folio, iter->flags); if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1) return -EAGAIN; @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, struct writeback_control *wbc, struct inode *inode, struct folio *folio, u64 end_pos) { - struct iomap_page *iop = iop_alloc(inode, folio, 0, true); + struct iomap_page *iop = to_iomap_page(folio); struct iomap_ioend *ioend, *next; unsigned len = i_blocksize(inode); unsigned nblocks = i_blocks_per_folio(inode, folio); @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, int error = 0, count = 0, i; LIST_HEAD(submit_list); + if (!iop) { + iop = iop_alloc(inode, folio, 0); + iop_set_range_dirty(inode, folio, 0, folio_size(folio)); + } + WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); /*
Brian Foster <bfoster@redhat.com> writes: > On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote: >> Brian Foster <bfoster@redhat.com> writes: >> >> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: >> >> When filesystem blocksize is less than folio size (either with >> >> mapping_large_folio_support() or with blocksize < pagesize) and when the >> >> folio is uptodate in pagecache, then even a byte write can cause >> >> an entire folio to be written to disk during writeback. This happens >> >> because we currently don't have a mechanism to track per-block dirty >> >> state within struct iomap_page. We currently only track uptodate state. >> >> >> >> This patch implements support for tracking per-block dirty state in >> >> iomap_page->state bitmap. This should help improve the filesystem write >> >> performance and help reduce write amplification. >> >> >> >> Performance testing of below fio workload reveals ~16x performance >> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) >> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. >> >> >> >> 1. <test_randwrite.fio> >> >> [global] >> >> ioengine=psync >> >> rw=randwrite >> >> overwrite=1 >> >> pre_read=1 >> >> direct=0 >> >> bs=4k >> >> size=1G >> >> dir=./ >> >> numjobs=8 >> >> fdatasync=1 >> >> runtime=60 >> >> iodepth=64 >> >> group_reporting=1 >> >> >> >> [fio-run] >> >> >> >> 2. Also our internal performance team reported that this patch improves >> >> their database workload performance by around ~83% (with XFS on Power) >> >> >> >> Reported-by: Aravinda Herle <araherle@in.ibm.com> >> >> Reported-by: Brian Foster <bfoster@redhat.com> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> >> --- >> >> fs/gfs2/aops.c | 2 +- >> >> fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++--- >> >> fs/xfs/xfs_aops.c | 2 +- >> >> fs/zonefs/file.c | 2 +- >> >> include/linux/iomap.h | 1 + >> >> 5 files changed, 112 insertions(+), 10 deletions(-) >> >> >> > ... >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> >> index 25f20f269214..c7f41b26280a 100644 >> >> --- a/fs/iomap/buffered-io.c >> >> +++ b/fs/iomap/buffered-io.c >> > ... >> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, >> >> else >> >> gfp = GFP_NOFS | __GFP_NOFAIL; >> >> >> >> - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), >> >> + /* >> >> + * iop->state tracks two sets of state flags when the >> >> + * filesystem block size is smaller than the folio size. >> >> + * The first state tracks per-block uptodate and the >> >> + * second tracks per-block dirty state. >> >> + */ >> >> + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), >> >> gfp); >> >> if (iop) { >> >> spin_lock_init(&iop->state_lock); >> >> if (folio_test_uptodate(folio)) >> >> iop_set_range(iop, 0, nr_blocks); >> >> + if (is_dirty) >> >> + iop_set_range(iop, nr_blocks, nr_blocks); >> > >> > I find the is_dirty logic here a bit confusing. AFAICT, this is >> > primarily to handle the case where a folio is completely overwritten, so >> > no iop is allocated at write time, and so then writeback needs to >> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc() >> > callers other than iomap_writepage_map() either pass the result of >> > folio_test_dirty() or explicitly dirty the entire range of the folio >> > anyways. iomap_dirty_folio() essentially does the latter because it >> > needs to dirty the entire folio regardless of whether the iop already >> > exists or not, right? >> >> Yes. >> >> > >> > If so and if I'm following all of that correctly, could this complexity >> > be isolated to iomap_writepage_map() by simply checking for the !iop >> > case first, then call iop_alloc() immediately followed by >> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could >> > always just dirty based on folio state with the writeback path exception >> > case handled explicitly. Hm? >> > >> >> Hi Brian, >> >> It was discussed here [1] to pass is_dirty flag at the time of iop >> allocation. We can do what you are essentially suggesting, but it's just >> extra work i.e. we will again do some calculations of blocks_per_folio, >> start, end and more importantly take and release iop->state_lock >> spinlock. Whereas with above approach we could get away with this at the >> time of iop allocation itself. >> > > Hi Ritesh, > > Isn't that extra work already occurring in iomap_dirty_folio()? I was > just thinking that maybe moving it to where it's apparently needed (i.e. > writeback) might eliminate the need for the param. > > I suppose iomap_dirty_folio() would need to call filemap_dirty_folio() > first to make sure iop_alloc() sees the dirty state, but maybe that > would also allow skipping the iop alloc if the folio was already dirty > (i.e. if the folio was previously dirtied by a full buffered overwite > for example)? > > I've appended a quick diff below (compile tested only) just to explain > what I mean. When doing that it also occurred to me that if we really > care about the separate call, we could keep the is_dirty param but do > the __iop_alloc() wrapper thing where iop_alloc() always passes > folio_test_dirty(). Sure. Brian. I guess when we got the comment, it was still in the intial work & I was anyway passing a from_writeback flag. Thanks for the diff, it does make sense to me. I will try to add that change in the next revision. > > BTW, I think you left off your [1] discussion reference.. > Sorry, my bad. [1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@infradead.org/ >> Besides, isn't it easier this way? which as you also stated we will >> dirty all the blocks based on is_dirty flag, which is folio_test_dirty() >> except at the writeback time. >> > > My thinking was just that I kind of had to read through all of the > iop_alloc() callsites to grok the purpose of the parameter, which made > it seem unnecessarily confusing. But ultimately it made sense, so I > don't insist on changing it or anything if this approach is intentional > and/or preferred by others. That's just my .02 and I'll defer to your > preference. :) > Sure Thanks! >> >> >> folio_attach_private(folio, iop); >> >> } >> >> return iop; >> > ... >> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) > ... >> > >> > WRT to the !iop case.. I _think_ this is handled correctly here because >> > that means we'd handle the folio as completely dirty at writeback time. >> > Is that the case? If so, it might be nice to document that assumption >> > somewhere (here or perhaps in the writeback path). >> > >> >> !iop case is simply when we don't have a large folio and blocksize == >> pagesize. In that case we don't allocate any iop and simply returns >> from iop_alloc(). >> So then we just skip the loop which is only meant when we have blocks >> within a folio. >> > > Isn't it also the case that iop might be NULL at this point if the fs > has sub-folio blocks, but the folio was dirtied by a full overwrite of > the folio? Or did I misunderstand patch 4? > Yes. Both of the cases. We can either have bs == ps or we can have a complete overwritten folio in which case we don't allocate any iop in iomap_writepage_map() function. Sure. I will document that when we factor out that change in a seperate function. > Brian > > --- 8< --- > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 92e1e1061225..89b3053e3f2d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len) > } > > static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > - unsigned int flags, bool is_dirty) > + unsigned int flags) > { > struct iomap_page *iop = to_iomap_page(folio); > unsigned int nr_blocks = i_blocks_per_folio(inode, folio); > @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > spin_lock_init(&iop->state_lock); > if (folio_test_uptodate(folio)) > iop_set_range(iop, 0, nr_blocks); > - if (is_dirty) > + if (folio_test_dirty(folio)) > iop_set_range(iop, nr_blocks, nr_blocks); > folio_attach_private(folio, iop); > } > @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, > if (WARN_ON_ONCE(size > iomap->length)) > return -EIO; > if (offset > 0) > - iop = iop_alloc(iter->inode, folio, iter->flags, > - folio_test_dirty(folio)); > + iop = iop_alloc(iter->inode, folio, iter->flags); > else > iop = to_iomap_page(folio); > > @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, > return iomap_read_inline_data(iter, folio); > > /* zero post-eof blocks as the page may be mapped */ > - iop = iop_alloc(iter->inode, folio, iter->flags, > - folio_test_dirty(folio)); > + iop = iop_alloc(iter->inode, folio, iter->flags); > iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen); > if (plen == 0) > goto done; > @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio); > > bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) > { > - struct iomap_page *iop; > - struct inode *inode = mapping->host; > - size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits; > - > - iop = iop_alloc(inode, folio, 0, false); > - iop_set_range_dirty(inode, folio, 0, len); > - return filemap_dirty_folio(mapping, folio); > + bool dirtied = filemap_dirty_folio(mapping, folio); > + if (dirtied) > + iop_alloc(mapping->host, folio, 0); > + return dirtied; > } > EXPORT_SYMBOL_GPL(iomap_dirty_folio); > > @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, > pos + len >= folio_pos(folio) + folio_size(folio)) > return 0; > > - iop = iop_alloc(iter->inode, folio, iter->flags, > - folio_test_dirty(folio)); > + iop = iop_alloc(iter->inode, folio, iter->flags); > > if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1) > return -EAGAIN; > @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > struct writeback_control *wbc, struct inode *inode, > struct folio *folio, u64 end_pos) > { > - struct iomap_page *iop = iop_alloc(inode, folio, 0, true); > + struct iomap_page *iop = to_iomap_page(folio); > struct iomap_ioend *ioend, *next; > unsigned len = i_blocksize(inode); > unsigned nblocks = i_blocks_per_folio(inode, folio); > @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > int error = 0, count = 0, i; > LIST_HEAD(submit_list); > > + if (!iop) { > + iop = iop_alloc(inode, folio, 0); > + iop_set_range_dirty(inode, folio, 0, folio_size(folio)); > + } > + > WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); > > /* Thanks for the review! -ritesh
On Wed, May 17, 2023 at 08:50:39PM +0530, Ritesh Harjani wrote: > Brian Foster <bfoster@redhat.com> writes: > > > On Tue, May 16, 2023 at 08:19:31PM +0530, Ritesh Harjani wrote: > >> Brian Foster <bfoster@redhat.com> writes: > >> > >> > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: > >> >> When filesystem blocksize is less than folio size (either with > >> >> mapping_large_folio_support() or with blocksize < pagesize) and when the > >> >> folio is uptodate in pagecache, then even a byte write can cause > >> >> an entire folio to be written to disk during writeback. This happens > >> >> because we currently don't have a mechanism to track per-block dirty > >> >> state within struct iomap_page. We currently only track uptodate state. > >> >> > >> >> This patch implements support for tracking per-block dirty state in > >> >> iomap_page->state bitmap. This should help improve the filesystem write > >> >> performance and help reduce write amplification. > >> >> > >> >> Performance testing of below fio workload reveals ~16x performance > >> >> improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) > >> >> FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. > >> >> > >> >> 1. <test_randwrite.fio> > >> >> [global] > >> >> ioengine=psync > >> >> rw=randwrite > >> >> overwrite=1 > >> >> pre_read=1 > >> >> direct=0 > >> >> bs=4k > >> >> size=1G > >> >> dir=./ > >> >> numjobs=8 > >> >> fdatasync=1 > >> >> runtime=60 > >> >> iodepth=64 > >> >> group_reporting=1 > >> >> > >> >> [fio-run] > >> >> > >> >> 2. Also our internal performance team reported that this patch improves > >> >> their database workload performance by around ~83% (with XFS on Power) > >> >> > >> >> Reported-by: Aravinda Herle <araherle@in.ibm.com> > >> >> Reported-by: Brian Foster <bfoster@redhat.com> > >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > >> >> --- > >> >> fs/gfs2/aops.c | 2 +- > >> >> fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++--- > >> >> fs/xfs/xfs_aops.c | 2 +- > >> >> fs/zonefs/file.c | 2 +- > >> >> include/linux/iomap.h | 1 + > >> >> 5 files changed, 112 insertions(+), 10 deletions(-) > >> >> > >> > ... > >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >> >> index 25f20f269214..c7f41b26280a 100644 > >> >> --- a/fs/iomap/buffered-io.c > >> >> +++ b/fs/iomap/buffered-io.c > >> > ... > >> >> @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > >> >> else > >> >> gfp = GFP_NOFS | __GFP_NOFAIL; > >> >> > >> >> - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), > >> >> + /* > >> >> + * iop->state tracks two sets of state flags when the > >> >> + * filesystem block size is smaller than the folio size. > >> >> + * The first state tracks per-block uptodate and the > >> >> + * second tracks per-block dirty state. > >> >> + */ > >> >> + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), > >> >> gfp); > >> >> if (iop) { > >> >> spin_lock_init(&iop->state_lock); > >> >> if (folio_test_uptodate(folio)) > >> >> iop_set_range(iop, 0, nr_blocks); > >> >> + if (is_dirty) > >> >> + iop_set_range(iop, nr_blocks, nr_blocks); > >> > > >> > I find the is_dirty logic here a bit confusing. AFAICT, this is > >> > primarily to handle the case where a folio is completely overwritten, so > >> > no iop is allocated at write time, and so then writeback needs to > >> > allocate the iop as fully dirty to reflect that. I.e., all iop_alloc() > >> > callers other than iomap_writepage_map() either pass the result of > >> > folio_test_dirty() or explicitly dirty the entire range of the folio > >> > anyways. iomap_dirty_folio() essentially does the latter because it > >> > needs to dirty the entire folio regardless of whether the iop already > >> > exists or not, right? > >> > >> Yes. > >> > >> > > >> > If so and if I'm following all of that correctly, could this complexity > >> > be isolated to iomap_writepage_map() by simply checking for the !iop > >> > case first, then call iop_alloc() immediately followed by > >> > set_range_dirty() of the entire folio? Then presumably iop_alloc() could > >> > always just dirty based on folio state with the writeback path exception > >> > case handled explicitly. Hm? > >> > > >> > >> Hi Brian, > >> > >> It was discussed here [1] to pass is_dirty flag at the time of iop > >> allocation. We can do what you are essentially suggesting, but it's just > >> extra work i.e. we will again do some calculations of blocks_per_folio, > >> start, end and more importantly take and release iop->state_lock > >> spinlock. Whereas with above approach we could get away with this at the > >> time of iop allocation itself. > >> > > > > Hi Ritesh, > > > > Isn't that extra work already occurring in iomap_dirty_folio()? I was > > just thinking that maybe moving it to where it's apparently needed (i.e. > > writeback) might eliminate the need for the param. > > > > I suppose iomap_dirty_folio() would need to call filemap_dirty_folio() > > first to make sure iop_alloc() sees the dirty state, but maybe that > > would also allow skipping the iop alloc if the folio was already dirty > > (i.e. if the folio was previously dirtied by a full buffered overwite > > for example)? > > > > I've appended a quick diff below (compile tested only) just to explain > > what I mean. When doing that it also occurred to me that if we really > > care about the separate call, we could keep the is_dirty param but do > > the __iop_alloc() wrapper thing where iop_alloc() always passes > > folio_test_dirty(). > > Sure. Brian. I guess when we got the comment, it was still in the intial > work & I was anyway passing a from_writeback flag. Thanks for the diff, > it does make sense to me. I will try to add that change in the next revision. > Ok, though I think what I did to iomap_folio_dirty() might be wrong... If we have a folio/iop that is already partially dirty with sub-folio blocks, and then that folio is mapped and write faulted, we still need to explicitly dirty the rest of the iop during the fault, right? If so, then I guess we'd still need to keep the iop_set_range_dirty() call there at least for that case. Hmm.. so I suppose I could see doing that a couple different ways. One is just to just keep it as you already have it and just drop the dirty param. I.e.: bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) { iop_alloc(mapping->host, folio, 0); iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio)); return filemap_dirty_folio(mapping, folio); } But I also wonder.. if we can skip the iop alloc on full folio buffered overwrites, isn't that also true of mapped writes to folios that don't already have an iop? I.e., I think what I was trying to do in the previous diff was actually something like this: bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) { iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio)); return filemap_dirty_folio(mapping, folio); } ... which would only dirty the full iop if it already exists. Thoughts? Brian > > > > BTW, I think you left off your [1] discussion reference.. > > > > Sorry, my bad. > > [1]: https://lore.kernel.org/linux-xfs/Y9f7cZxnXbL7x0p+@infradead.org/ > > >> Besides, isn't it easier this way? which as you also stated we will > >> dirty all the blocks based on is_dirty flag, which is folio_test_dirty() > >> except at the writeback time. > >> > > > > My thinking was just that I kind of had to read through all of the > > iop_alloc() callsites to grok the purpose of the parameter, which made > > it seem unnecessarily confusing. But ultimately it made sense, so I > > don't insist on changing it or anything if this approach is intentional > > and/or preferred by others. That's just my .02 and I'll defer to your > > preference. :) > > > > Sure Thanks! > > >> > >> >> folio_attach_private(folio, iop); > >> >> } > >> >> return iop; > >> > ... > >> >> @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) > > ... > >> > > >> > WRT to the !iop case.. I _think_ this is handled correctly here because > >> > that means we'd handle the folio as completely dirty at writeback time. > >> > Is that the case? If so, it might be nice to document that assumption > >> > somewhere (here or perhaps in the writeback path). > >> > > >> > >> !iop case is simply when we don't have a large folio and blocksize == > >> pagesize. In that case we don't allocate any iop and simply returns > >> from iop_alloc(). > >> So then we just skip the loop which is only meant when we have blocks > >> within a folio. > >> > > > > Isn't it also the case that iop might be NULL at this point if the fs > > has sub-folio blocks, but the folio was dirtied by a full overwrite of > > the folio? Or did I misunderstand patch 4? > > > > Yes. Both of the cases. We can either have bs == ps or we can have a > complete overwritten folio in which case we don't allocate any iop in > iomap_writepage_map() function. > > Sure. I will document that when we factor out that change in a seperate function. > > > Brian > > > > --- 8< --- > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 92e1e1061225..89b3053e3f2d 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -155,7 +155,7 @@ static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len) > > } > > > > static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > > - unsigned int flags, bool is_dirty) > > + unsigned int flags) > > { > > struct iomap_page *iop = to_iomap_page(folio); > > unsigned int nr_blocks = i_blocks_per_folio(inode, folio); > > @@ -181,7 +181,7 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, > > spin_lock_init(&iop->state_lock); > > if (folio_test_uptodate(folio)) > > iop_set_range(iop, 0, nr_blocks); > > - if (is_dirty) > > + if (folio_test_dirty(folio)) > > iop_set_range(iop, nr_blocks, nr_blocks); > > folio_attach_private(folio, iop); > > } > > @@ -326,8 +326,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, > > if (WARN_ON_ONCE(size > iomap->length)) > > return -EIO; > > if (offset > 0) > > - iop = iop_alloc(iter->inode, folio, iter->flags, > > - folio_test_dirty(folio)); > > + iop = iop_alloc(iter->inode, folio, iter->flags); > > else > > iop = to_iomap_page(folio); > > > > @@ -365,8 +364,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, > > return iomap_read_inline_data(iter, folio); > > > > /* zero post-eof blocks as the page may be mapped */ > > - iop = iop_alloc(iter->inode, folio, iter->flags, > > - folio_test_dirty(folio)); > > + iop = iop_alloc(iter->inode, folio, iter->flags); > > iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen); > > if (plen == 0) > > goto done; > > @@ -616,13 +614,10 @@ EXPORT_SYMBOL_GPL(iomap_invalidate_folio); > > > > bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) > > { > > - struct iomap_page *iop; > > - struct inode *inode = mapping->host; > > - size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits; > > - > > - iop = iop_alloc(inode, folio, 0, false); > > - iop_set_range_dirty(inode, folio, 0, len); > > - return filemap_dirty_folio(mapping, folio); > > + bool dirtied = filemap_dirty_folio(mapping, folio); > > + if (dirtied) > > + iop_alloc(mapping->host, folio, 0); > > + return dirtied; > > } > > EXPORT_SYMBOL_GPL(iomap_dirty_folio); > > > > @@ -673,8 +668,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, > > pos + len >= folio_pos(folio) + folio_size(folio)) > > return 0; > > > > - iop = iop_alloc(iter->inode, folio, iter->flags, > > - folio_test_dirty(folio)); > > + iop = iop_alloc(iter->inode, folio, iter->flags); > > > > if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1) > > return -EAGAIN; > > @@ -1759,7 +1753,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > struct writeback_control *wbc, struct inode *inode, > > struct folio *folio, u64 end_pos) > > { > > - struct iomap_page *iop = iop_alloc(inode, folio, 0, true); > > + struct iomap_page *iop = to_iomap_page(folio); > > struct iomap_ioend *ioend, *next; > > unsigned len = i_blocksize(inode); > > unsigned nblocks = i_blocks_per_folio(inode, folio); > > @@ -1767,6 +1761,11 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > int error = 0, count = 0, i; > > LIST_HEAD(submit_list); > > > > + if (!iop) { > > + iop = iop_alloc(inode, folio, 0); > > + iop_set_range_dirty(inode, folio, 0, folio_size(folio)); > > + } > > + > > WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0); > > > > /* > > Thanks for the review! > > -ritesh >
On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > But I also wonder.. if we can skip the iop alloc on full folio buffered > overwrites, isn't that also true of mapped writes to folios that don't > already have an iop? Yes. > I.e., I think what I was trying to do in the > previous diff was actually something like this: > > bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) > { > iop_set_range_dirty(mapping->host, folio, 0, folio_size(folio)); > return filemap_dirty_folio(mapping, folio); > } > > ... which would only dirty the full iop if it already exists. Thoughts? That does sound pretty good to me, and I can't think of any obvious pitfall.
On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: > +static inline void iop_clear_range(struct iomap_page *iop, > + unsigned int start_blk, unsigned int nr_blks) > +{ > + bitmap_clear(iop->state, start_blk, nr_blks); > +} Similar to the other trivial bitmap wrappers added earlier in the series I don't think this one is very useful. > + struct iomap_page *iop = to_iomap_page(folio); > + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); > + unsigned int first_blk = (off >> inode->i_blkbits); > + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); > + unsigned int nr_blks = last_blk - first_blk + 1; > + unsigned long flags; > + > + if (!iop) > + return; > + spin_lock_irqsave(&iop->state_lock, flags); > + iop_set_range(iop, first_blk + blks_per_folio, nr_blks); > + spin_unlock_irqrestore(&iop->state_lock, flags); All the variable initializations except for ios should really move into a branch here. Or we just use separate helpers for the case where we actually have an iops and isolate all that, which would be my preference (but goes counter to the direction of changes earlier in the series to the existing functions). > +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len) > +{ > + struct iomap_page *iop = to_iomap_page(folio); > + struct inode *inode = folio->mapping->host; > + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); > + unsigned int first_blk = (off >> inode->i_blkbits); > + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); > + unsigned int nr_blks = last_blk - first_blk + 1; > + unsigned long flags; > + > + if (!iop) > + return; > + spin_lock_irqsave(&iop->state_lock, flags); > + iop_clear_range(iop, first_blk + blks_per_folio, nr_blks); > + spin_unlock_irqrestore(&iop->state_lock, flags); > +} Same here.
On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > > But I also wonder.. if we can skip the iop alloc on full folio buffered > > overwrites, isn't that also true of mapped writes to folios that don't > > already have an iop? > > Yes. Hm, well, maybe? If somebody stores to a page, we obviously set the dirty flag on the folio, but depending on the architecture, we may or may not have independent dirty bits on the PTEs (eg if it's a PMD, we have one dirty bit for the entire folio; similarly if ARM uses the contiguous PTE bit). If we do have independent dirty bits, we could dirty only the blocks corresponding to a single page at a time. This has potential for causing some nasty bugs, so I'm inclined to rule that if a folio is mmaped, then it's all dirty from any writable page fault. The fact is that applications generally do not perform writes through mmap because the error handling story is so poor. There may be a different answer for anonymous memory, but that doesn't feel like my problem and shouldn't feel like any FS developer's problem.
Christoph Hellwig <hch@infradead.org> writes: > On Mon, May 08, 2023 at 12:58:00AM +0530, Ritesh Harjani (IBM) wrote: >> +static inline void iop_clear_range(struct iomap_page *iop, >> + unsigned int start_blk, unsigned int nr_blks) >> +{ >> + bitmap_clear(iop->state, start_blk, nr_blks); >> +} > > Similar to the other trivial bitmap wrappers added earlier in the > series I don't think this one is very useful. > >> + struct iomap_page *iop = to_iomap_page(folio); >> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); >> + unsigned int first_blk = (off >> inode->i_blkbits); >> + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); >> + unsigned int nr_blks = last_blk - first_blk + 1; >> + unsigned long flags; >> + >> + if (!iop) >> + return; >> + spin_lock_irqsave(&iop->state_lock, flags); >> + iop_set_range(iop, first_blk + blks_per_folio, nr_blks); >> + spin_unlock_irqrestore(&iop->state_lock, flags); > > All the variable initializations except for ios should really move > into a branch here. Branch won't be needed I guess, will just move the initialization after the return. > Or we just use separate helpers for the case > where we actually have an iops and isolate all that, which would > be my preference (but goes counter to the direction of changes > earlier in the series to the existing functions). > >> +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len) >> +{ >> + struct iomap_page *iop = to_iomap_page(folio); >> + struct inode *inode = folio->mapping->host; >> + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); >> + unsigned int first_blk = (off >> inode->i_blkbits); >> + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); >> + unsigned int nr_blks = last_blk - first_blk + 1; >> + unsigned long flags; >> + >> + if (!iop) >> + return; >> + spin_lock_irqsave(&iop->state_lock, flags); >> + iop_clear_range(iop, first_blk + blks_per_folio, nr_blks); >> + spin_unlock_irqrestore(&iop->state_lock, flags); >> +} > > Same here. Sure. -ritesh
Matthew Wilcox <willy@infradead.org> writes: > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: >> > But I also wonder.. if we can skip the iop alloc on full folio buffered >> > overwrites, isn't that also true of mapped writes to folios that don't >> > already have an iop? >> >> Yes. > > Hm, well, maybe? If somebody stores to a page, we obviously set the > dirty flag on the folio, but depending on the architecture, we may > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > we have one dirty bit for the entire folio; similarly if ARM uses the > contiguous PTE bit). If we do have independent dirty bits, we could > dirty only the blocks corresponding to a single page at a time. > > This has potential for causing some nasty bugs, so I'm inclined to > rule that if a folio is mmaped, then it's all dirty from any writable > page fault. The fact is that applications generally do not perform > writes through mmap because the error handling story is so poor. > > There may be a different answer for anonymous memory, but that doesn't > feel like my problem and shouldn't feel like any FS developer's problem. Although I am skeptical too to do the changes which Brian is suggesting here. i.e. not making all the blocks of the folio dirty when we are going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). However, I am sorry but I coudn't completely follow your reasoning above. I think what Brian is suggesting here is that filemap_dirty_folio() should be similar to complete buffered overwrite case where we do not allocate the iop at the ->write_begin() time. Then at the writeback time we allocate an iop and mark all blocks dirty. In a way it is also the similar case as for mmapped writes too but my only worry is the way mmaped writes work and it makes more sense to keep the dirty state of folio and per-block within iop in sync. For that matter, we can even just make sure we always allocate an iop in the complete overwrites case as well. I didn't change that code because it was kept that way for uptodate state as well and based on one of your inputs for complete overwrite case. Though I agree that we should ideally be allocatting & marking all blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to understand your reasoning better. Thanks! -ritesh
On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered > >> > overwrites, isn't that also true of mapped writes to folios that don't > >> > already have an iop? > >> > >> Yes. > > > > Hm, well, maybe? If somebody stores to a page, we obviously set the > > dirty flag on the folio, but depending on the architecture, we may > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > > we have one dirty bit for the entire folio; similarly if ARM uses the > > contiguous PTE bit). If we do have independent dirty bits, we could > > dirty only the blocks corresponding to a single page at a time. > > > > This has potential for causing some nasty bugs, so I'm inclined to > > rule that if a folio is mmaped, then it's all dirty from any writable > > page fault. The fact is that applications generally do not perform > > writes through mmap because the error handling story is so poor. > > > > There may be a different answer for anonymous memory, but that doesn't > > feel like my problem and shouldn't feel like any FS developer's problem. > > Although I am skeptical too to do the changes which Brian is suggesting > here. i.e. not making all the blocks of the folio dirty when we are > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). > > However, I am sorry but I coudn't completely follow your reasoning > above. I think what Brian is suggesting here is that > filemap_dirty_folio() should be similar to complete buffered overwrite > case where we do not allocate the iop at the ->write_begin() time. > Then at the writeback time we allocate an iop and mark all blocks dirty. > > In a way it is also the similar case as for mmapped writes too but my > only worry is the way mmaped writes work and it makes more > sense to keep the dirty state of folio and per-block within iop in sync. > For that matter, we can even just make sure we always allocate an iop in > the complete overwrites case as well. I didn't change that code because > it was kept that way for uptodate state as well and based on one of your > inputs for complete overwrite case. > > Though I agree that we should ideally be allocatting & marking all > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to > understand your reasoning better. Think about it at a higher level than the implementation ("do we allocate an iop or not"). If userspace dirties one page in a folio, should it dirty all pages in the folio, or just the page that was actually dirtied? I appreciate you're thinking about this from the point of view of 64kB pages on PPC and using single-page folios, but pretend we've allocated a 1MB folio, mmaped it (or a part of it?) and now userspace stores to it. How much of it do we want to write back? My argument is that this is RARE. Userspace generally does not mmap(MAP_SHARED), store to it and call msync() to do writes. Writes are almost always done using the write() syscall. Userspace gets a lot more control about when the writeback happens, and they actually get errors back from the write() syscall. If we attempt to track which pages have actually been dirtied, I worry the fs and the mm will lose track of what the other needs to know. eg the mm will make every page in the folio writable and then not notify the fs when subsequent pages in the folio are stored to.
On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: > Matthew Wilcox <willy@infradead.org> writes: > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered > >> > overwrites, isn't that also true of mapped writes to folios that don't > >> > already have an iop? > >> > >> Yes. > > > > Hm, well, maybe? If somebody stores to a page, we obviously set the > > dirty flag on the folio, but depending on the architecture, we may > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > > we have one dirty bit for the entire folio; similarly if ARM uses the > > contiguous PTE bit). If we do have independent dirty bits, we could > > dirty only the blocks corresponding to a single page at a time. > > > > This has potential for causing some nasty bugs, so I'm inclined to > > rule that if a folio is mmaped, then it's all dirty from any writable > > page fault. The fact is that applications generally do not perform > > writes through mmap because the error handling story is so poor. > > > > There may be a different answer for anonymous memory, but that doesn't > > feel like my problem and shouldn't feel like any FS developer's problem. > > Although I am skeptical too to do the changes which Brian is suggesting > here. i.e. not making all the blocks of the folio dirty when we are > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). > > However, I am sorry but I coudn't completely follow your reasoning > above. I think what Brian is suggesting here is that > filemap_dirty_folio() should be similar to complete buffered overwrite > case where we do not allocate the iop at the ->write_begin() time. > Then at the writeback time we allocate an iop and mark all blocks dirty. > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty granularity of intra-folio faults) makes sense, but I'm also not sure what it has to do with the idea of being consistent with how full folio overwrites are implemented (between buffered or mapped writes). We're not changing historical dirtying granularity either way. I think this is just a bigger picture thought for future consideration as opposed to direct feedback on this patch.. > In a way it is also the similar case as for mmapped writes too but my > only worry is the way mmaped writes work and it makes more > sense to keep the dirty state of folio and per-block within iop in sync. > For that matter, we can even just make sure we always allocate an iop in > the complete overwrites case as well. I didn't change that code because > it was kept that way for uptodate state as well and based on one of your > inputs for complete overwrite case. > Can you elaborate on your concerns, out of curiosity? Either way, IMO it also seems reasonable to drop this behavior for the basic implementation of dirty tracking (so always allocate the iop for sub-folio tracking as you suggest above) and then potentially restore it as a separate optimization patch at the end of the series. That said, I'm not totally clear why it exists in the first place, so that might warrant some investigation. Is it primarily to defer allocations out of task write/fault contexts? To optimize the case where pagecache is dirtied but truncated or something and thus never written back? Is there any room for further improvement where the alloc could be avoided completely for folio overwrites instead of just deferred? Was that actually the case at some point and then something later decided the iop was needed at writeback time, leading to current behavior? Brian > Though I agree that we should ideally be allocatting & marking all > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to > understand your reasoning better. > > Thanks! > -ritesh >
On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: > > Matthew Wilcox <willy@infradead.org> writes: > > > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered > > >> > overwrites, isn't that also true of mapped writes to folios that don't > > >> > already have an iop? > > >> > > >> Yes. > > > > > > Hm, well, maybe? If somebody stores to a page, we obviously set the > > > dirty flag on the folio, but depending on the architecture, we may > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > > > we have one dirty bit for the entire folio; similarly if ARM uses the > > > contiguous PTE bit). If we do have independent dirty bits, we could > > > dirty only the blocks corresponding to a single page at a time. > > > > > > This has potential for causing some nasty bugs, so I'm inclined to > > > rule that if a folio is mmaped, then it's all dirty from any writable > > > page fault. The fact is that applications generally do not perform > > > writes through mmap because the error handling story is so poor. > > > > > > There may be a different answer for anonymous memory, but that doesn't > > > feel like my problem and shouldn't feel like any FS developer's problem. > > > > Although I am skeptical too to do the changes which Brian is suggesting > > here. i.e. not making all the blocks of the folio dirty when we are > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). > > > > However, I am sorry but I coudn't completely follow your reasoning > > above. I think what Brian is suggesting here is that > > filemap_dirty_folio() should be similar to complete buffered overwrite > > case where we do not allocate the iop at the ->write_begin() time. > > Then at the writeback time we allocate an iop and mark all blocks dirty. > > > > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty > granularity of intra-folio faults) makes sense, but I'm also not sure > what it has to do with the idea of being consistent with how full folio > overwrites are implemented (between buffered or mapped writes). We're > not changing historical dirtying granularity either way. I think this is > just a bigger picture thought for future consideration as opposed to > direct feedback on this patch.. <nod> > > In a way it is also the similar case as for mmapped writes too but my > > only worry is the way mmaped writes work and it makes more > > sense to keep the dirty state of folio and per-block within iop in sync. > > For that matter, we can even just make sure we always allocate an iop in > > the complete overwrites case as well. I didn't change that code because > > it was kept that way for uptodate state as well and based on one of your > > inputs for complete overwrite case. > > > > Can you elaborate on your concerns, out of curiosity? > > Either way, IMO it also seems reasonable to drop this behavior for the > basic implementation of dirty tracking (so always allocate the iop for > sub-folio tracking as you suggest above) and then potentially restore it > as a separate optimization patch at the end of the series. Agree. > That said, I'm not totally clear why it exists in the first place, so > that might warrant some investigation. Is it primarily to defer > allocations out of task write/fault contexts? (Assuming by 'it' you mean the behavior where we don't unconditionally allocate iops for blocksize < foliosize...) IIRC the reason is to reduce memory usage by eliding iop allocations unless it's absolutely necessary for correctness was /my/ understanding of why we don't always allocate the iop... > To optimize the case where pagecache is dirtied but truncated or > something and thus never written back? ...because this might very well happen. Write a temporary .o file to the filesystem, then delete the whole thing before writeback ever gets its hands on the file. > Is there any room for further improvement where the alloc could be > avoided completely for folio overwrites instead of just deferred? Once writeback starts, though, we need the iop so that we can know when all the writeback for that folio is actually complete, no matter how many IOs might be in flight for that folio. I don't know how you'd get around this problem. > Was that actually the case at some point and then something later > decided the iop was needed at writeback time, leading to current > behavior? It's been in iomap since the beginning when we lifted it from xfs. --D (who is now weeks behind on reviewing things and stressed out) > Brian > > > Though I agree that we should ideally be allocatting & marking all > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to > > understand your reasoning better. > > > > Thanks! > > -ritesh > > >
On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote: > On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: > > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: > > > Matthew Wilcox <willy@infradead.org> writes: > > > > > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered > > > >> > overwrites, isn't that also true of mapped writes to folios that don't > > > >> > already have an iop? > > > >> > > > >> Yes. > > > > > > > > Hm, well, maybe? If somebody stores to a page, we obviously set the > > > > dirty flag on the folio, but depending on the architecture, we may > > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > > > > we have one dirty bit for the entire folio; similarly if ARM uses the > > > > contiguous PTE bit). If we do have independent dirty bits, we could > > > > dirty only the blocks corresponding to a single page at a time. > > > > > > > > This has potential for causing some nasty bugs, so I'm inclined to > > > > rule that if a folio is mmaped, then it's all dirty from any writable > > > > page fault. The fact is that applications generally do not perform > > > > writes through mmap because the error handling story is so poor. > > > > > > > > There may be a different answer for anonymous memory, but that doesn't > > > > feel like my problem and shouldn't feel like any FS developer's problem. > > > > > > Although I am skeptical too to do the changes which Brian is suggesting > > > here. i.e. not making all the blocks of the folio dirty when we are > > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). > > > > > > However, I am sorry but I coudn't completely follow your reasoning > > > above. I think what Brian is suggesting here is that > > > filemap_dirty_folio() should be similar to complete buffered overwrite > > > case where we do not allocate the iop at the ->write_begin() time. > > > Then at the writeback time we allocate an iop and mark all blocks dirty. > > > > > > > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty > > granularity of intra-folio faults) makes sense, but I'm also not sure > > what it has to do with the idea of being consistent with how full folio > > overwrites are implemented (between buffered or mapped writes). We're > > not changing historical dirtying granularity either way. I think this is > > just a bigger picture thought for future consideration as opposed to > > direct feedback on this patch.. > > <nod> > > > > In a way it is also the similar case as for mmapped writes too but my > > > only worry is the way mmaped writes work and it makes more > > > sense to keep the dirty state of folio and per-block within iop in sync. > > > For that matter, we can even just make sure we always allocate an iop in > > > the complete overwrites case as well. I didn't change that code because > > > it was kept that way for uptodate state as well and based on one of your > > > inputs for complete overwrite case. > > > > > > > Can you elaborate on your concerns, out of curiosity? > > > > Either way, IMO it also seems reasonable to drop this behavior for the > > basic implementation of dirty tracking (so always allocate the iop for > > sub-folio tracking as you suggest above) and then potentially restore it > > as a separate optimization patch at the end of the series. > > Agree. > > > That said, I'm not totally clear why it exists in the first place, so > > that might warrant some investigation. Is it primarily to defer > > allocations out of task write/fault contexts? > > (Assuming by 'it' you mean the behavior where we don't unconditionally > allocate iops for blocksize < foliosize...) > > IIRC the reason is to reduce memory usage by eliding iop allocations > unless it's absolutely necessary for correctness was /my/ understanding > of why we don't always allocate the iop... > > > To optimize the case where pagecache is dirtied but truncated or > > something and thus never written back? > > ...because this might very well happen. Write a temporary .o file to > the filesystem, then delete the whole thing before writeback ever gets > its hands on the file. > I don't think a simple temp write will trigger this scenario currently because the folios would have to be uptodate at the time of the write to bypass the iop alloc. I guess you'd have to read folios (even if backed by holes) first to start seeing the !iop case at writeback time (for bs != ps). That could change with these patches, but I was trying to reason about the intent of the existing code and whether there was some known reason to continue to try and defer the iop allocation as the need/complexity for deferring it grows with the addition of more (i.e. dirty) tracking. > > Is there any room for further improvement where the alloc could be > > avoided completely for folio overwrites instead of just deferred? > > Once writeback starts, though, we need the iop so that we can know when > all the writeback for that folio is actually complete, no matter how > many IOs might be in flight for that folio. I don't know how you'd get > around this problem. > Ok. I noticed some kind of counter or something being updated last time I looked through that code, so it sounds like that's the reason the iop eventually needs to exist. Thanks. > > Was that actually the case at some point and then something later > > decided the iop was needed at writeback time, leading to current > > behavior? > > It's been in iomap since the beginning when we lifted it from xfs. > Not sure exactly what you're referring to here. iomap_writepage_map() would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d ("iomap: Permit pages without an iop to enter writeback"), so I don't see how iop allocs were deferred (other than when bs == ps, obviously) prior to that. Heh, but I'm starting to get my wires crossed just trying to piece things together here. Ritesh, ISTM the (writeback && !iop && bs != ps) case is primarily a subtle side effect of the current writeback behavior being driven by uptodate status. I think it's probably wise to drop it at least initially, always alloc and dirty the appropriate iop ranges for sub-folio blocks, and then if you or others think there is value in the overwrite optimization to defer iop allocs, tack that on as a separate patch and try to be consistent between buffered and mapped writes. Darrick noted above that he also agrees with that separate patch approach. For me, I think it would also be useful to show that there is some measurable performance benefit on at least one reasonable workload to help justify it. Brian > --D (who is now weeks behind on reviewing things and stressed out) > > > Brian > > > > > Though I agree that we should ideally be allocatting & marking all > > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to > > > understand your reasoning better. > > > > > > Thanks! > > > -ritesh > > > > > >
Brian Foster <bfoster@redhat.com> writes: > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote: >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: >> > > Matthew Wilcox <willy@infradead.org> writes: >> > > >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't >> > > >> > already have an iop? >> > > >> >> > > >> Yes. >> > > > >> > > > Hm, well, maybe? If somebody stores to a page, we obviously set the >> > > > dirty flag on the folio, but depending on the architecture, we may >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the >> > > > contiguous PTE bit). If we do have independent dirty bits, we could >> > > > dirty only the blocks corresponding to a single page at a time. >> > > > >> > > > This has potential for causing some nasty bugs, so I'm inclined to >> > > > rule that if a folio is mmaped, then it's all dirty from any writable >> > > > page fault. The fact is that applications generally do not perform >> > > > writes through mmap because the error handling story is so poor. >> > > > >> > > > There may be a different answer for anonymous memory, but that doesn't >> > > > feel like my problem and shouldn't feel like any FS developer's problem. >> > > >> > > Although I am skeptical too to do the changes which Brian is suggesting >> > > here. i.e. not making all the blocks of the folio dirty when we are >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). >> > > >> > > However, I am sorry but I coudn't completely follow your reasoning >> > > above. I think what Brian is suggesting here is that >> > > filemap_dirty_folio() should be similar to complete buffered overwrite >> > > case where we do not allocate the iop at the ->write_begin() time. >> > > Then at the writeback time we allocate an iop and mark all blocks dirty. >> > > >> > >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty >> > granularity of intra-folio faults) makes sense, but I'm also not sure >> > what it has to do with the idea of being consistent with how full folio >> > overwrites are implemented (between buffered or mapped writes). We're >> > not changing historical dirtying granularity either way. I think this is >> > just a bigger picture thought for future consideration as opposed to >> > direct feedback on this patch.. >> >> <nod> >> >> > > In a way it is also the similar case as for mmapped writes too but my >> > > only worry is the way mmaped writes work and it makes more >> > > sense to keep the dirty state of folio and per-block within iop in sync. >> > > For that matter, we can even just make sure we always allocate an iop in >> > > the complete overwrites case as well. I didn't change that code because >> > > it was kept that way for uptodate state as well and based on one of your >> > > inputs for complete overwrite case. >> > > >> > >> > Can you elaborate on your concerns, out of curiosity? >> > >> > Either way, IMO it also seems reasonable to drop this behavior for the >> > basic implementation of dirty tracking (so always allocate the iop for >> > sub-folio tracking as you suggest above) and then potentially restore it >> > as a separate optimization patch at the end of the series. >> >> Agree. >> >> > That said, I'm not totally clear why it exists in the first place, so >> > that might warrant some investigation. Is it primarily to defer >> > allocations out of task write/fault contexts? >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally >> allocate iops for blocksize < foliosize...) >> >> IIRC the reason is to reduce memory usage by eliding iop allocations >> unless it's absolutely necessary for correctness was /my/ understanding >> of why we don't always allocate the iop... >> >> > To optimize the case where pagecache is dirtied but truncated or >> > something and thus never written back? >> >> ...because this might very well happen. Write a temporary .o file to >> the filesystem, then delete the whole thing before writeback ever gets >> its hands on the file. >> > > I don't think a simple temp write will trigger this scenario currently > because the folios would have to be uptodate at the time of the write to > bypass the iop alloc. I guess you'd have to read folios (even if backed > by holes) first to start seeing the !iop case at writeback time (for bs > != ps). > > That could change with these patches, but I was trying to reason about > the intent of the existing code and whether there was some known reason > to continue to try and defer the iop allocation as the need/complexity > for deferring it grows with the addition of more (i.e. dirty) tracking. > Here is the 1st discussion/reasoning where the deferred iop allocation in the readpage path got discussed [1]. And here is the discussion when I first pointed out the deferred allocation in writepage path. IMO, it got slipped in with the discussions maybe only on mailing list but nothing in the commit messages or comments.[2] [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/ [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/ >> > Is there any room for further improvement where the alloc could be >> > avoided completely for folio overwrites instead of just deferred? >> >> Once writeback starts, though, we need the iop so that we can know when >> all the writeback for that folio is actually complete, no matter how >> many IOs might be in flight for that folio. I don't know how you'd get >> around this problem. >> > > Ok. I noticed some kind of counter or something being updated last time > I looked through that code, so it sounds like that's the reason the iop > eventually needs to exist. Thanks. > >> > Was that actually the case at some point and then something later >> > decided the iop was needed at writeback time, leading to current >> > behavior? >> >> It's been in iomap since the beginning when we lifted it from xfs. >> > > Not sure exactly what you're referring to here. iomap_writepage_map() > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d > ("iomap: Permit pages without an iop to enter writeback"), so I don't > see how iop allocs were deferred (other than when bs == ps, obviously) > prior to that. > > Heh, but I'm starting to get my wires crossed just trying to piece > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps) > case is primarily a subtle side effect of the current writeback behavior > being driven by uptodate status. I think it's probably wise to drop it > at least initially, always alloc and dirty the appropriate iop ranges > for sub-folio blocks, and then if you or others think there is value in > the overwrite optimization to defer iop allocs, tack that on as a > separate patch and try to be consistent between buffered and mapped > writes. Based on the discussion so far, I would like to think of this as follow: We already have some sort of lazy iop allocation in the buffered I/O path (discussed above). This patch series does not changes that behavior. For now I would like to keep the page mkwrite page as is without any lazy iop allocation optimization. I am ok to pick this optimization work as a seperate series because, IIUC, Christoph has some ideas on deferring iop allocations even further [2] (from link shared above). Does that sound good? > > Darrick noted above that he also agrees with that separate patch > approach. For me, I think it would also be useful to show that there is > some measurable performance benefit on at least one reasonable workload > to help justify it. Agree that when we work on such optimizations as a seperate series, it will be worth measuring the performance benefits of that. -ritesh > > Brian > >> --D (who is now weeks behind on reviewing things and stressed out) >> >> > Brian >> > >> > > Though I agree that we should ideally be allocatting & marking all >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to >> > > understand your reasoning better. >> > > >> > > Thanks! >> > > -ritesh >> > > >> > >>
On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote: > Brian Foster <bfoster@redhat.com> writes: > > > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote: > >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: > >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: > >> > > Matthew Wilcox <willy@infradead.org> writes: > >> > > > >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered > >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't > >> > > >> > already have an iop? > >> > > >> > >> > > >> Yes. > >> > > > > >> > > > Hm, well, maybe? If somebody stores to a page, we obviously set the > >> > > > dirty flag on the folio, but depending on the architecture, we may > >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the > >> > > > contiguous PTE bit). If we do have independent dirty bits, we could > >> > > > dirty only the blocks corresponding to a single page at a time. > >> > > > > >> > > > This has potential for causing some nasty bugs, so I'm inclined to > >> > > > rule that if a folio is mmaped, then it's all dirty from any writable > >> > > > page fault. The fact is that applications generally do not perform > >> > > > writes through mmap because the error handling story is so poor. > >> > > > > >> > > > There may be a different answer for anonymous memory, but that doesn't > >> > > > feel like my problem and shouldn't feel like any FS developer's problem. > >> > > > >> > > Although I am skeptical too to do the changes which Brian is suggesting > >> > > here. i.e. not making all the blocks of the folio dirty when we are > >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). > >> > > > >> > > However, I am sorry but I coudn't completely follow your reasoning > >> > > above. I think what Brian is suggesting here is that > >> > > filemap_dirty_folio() should be similar to complete buffered overwrite > >> > > case where we do not allocate the iop at the ->write_begin() time. > >> > > Then at the writeback time we allocate an iop and mark all blocks dirty. > >> > > > >> > > >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty > >> > granularity of intra-folio faults) makes sense, but I'm also not sure > >> > what it has to do with the idea of being consistent with how full folio > >> > overwrites are implemented (between buffered or mapped writes). We're > >> > not changing historical dirtying granularity either way. I think this is > >> > just a bigger picture thought for future consideration as opposed to > >> > direct feedback on this patch.. > >> > >> <nod> > >> > >> > > In a way it is also the similar case as for mmapped writes too but my > >> > > only worry is the way mmaped writes work and it makes more > >> > > sense to keep the dirty state of folio and per-block within iop in sync. > >> > > For that matter, we can even just make sure we always allocate an iop in > >> > > the complete overwrites case as well. I didn't change that code because > >> > > it was kept that way for uptodate state as well and based on one of your > >> > > inputs for complete overwrite case. > >> > > > >> > > >> > Can you elaborate on your concerns, out of curiosity? > >> > > >> > Either way, IMO it also seems reasonable to drop this behavior for the > >> > basic implementation of dirty tracking (so always allocate the iop for > >> > sub-folio tracking as you suggest above) and then potentially restore it > >> > as a separate optimization patch at the end of the series. > >> > >> Agree. > >> > >> > That said, I'm not totally clear why it exists in the first place, so > >> > that might warrant some investigation. Is it primarily to defer > >> > allocations out of task write/fault contexts? > >> > >> (Assuming by 'it' you mean the behavior where we don't unconditionally > >> allocate iops for blocksize < foliosize...) > >> > >> IIRC the reason is to reduce memory usage by eliding iop allocations > >> unless it's absolutely necessary for correctness was /my/ understanding > >> of why we don't always allocate the iop... > >> > >> > To optimize the case where pagecache is dirtied but truncated or > >> > something and thus never written back? > >> > >> ...because this might very well happen. Write a temporary .o file to > >> the filesystem, then delete the whole thing before writeback ever gets > >> its hands on the file. > >> > > > > I don't think a simple temp write will trigger this scenario currently > > because the folios would have to be uptodate at the time of the write to > > bypass the iop alloc. I guess you'd have to read folios (even if backed > > by holes) first to start seeing the !iop case at writeback time (for bs > > != ps). > > > > That could change with these patches, but I was trying to reason about > > the intent of the existing code and whether there was some known reason > > to continue to try and defer the iop allocation as the need/complexity > > for deferring it grows with the addition of more (i.e. dirty) tracking. > > > > Here is the 1st discussion/reasoning where the deferred iop allocation > in the readpage path got discussed [1]. > And here is the discussion when I first pointed out the deferred > allocation in writepage path. IMO, it got slipped in with the > discussions maybe only on mailing list but nothing in the commit > messages or comments.[2] > > [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/ > [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/ > > >> > Is there any room for further improvement where the alloc could be > >> > avoided completely for folio overwrites instead of just deferred? > >> > >> Once writeback starts, though, we need the iop so that we can know when > >> all the writeback for that folio is actually complete, no matter how > >> many IOs might be in flight for that folio. I don't know how you'd get > >> around this problem. > >> > > > > Ok. I noticed some kind of counter or something being updated last time > > I looked through that code, so it sounds like that's the reason the iop > > eventually needs to exist. Thanks. > > > >> > Was that actually the case at some point and then something later > >> > decided the iop was needed at writeback time, leading to current > >> > behavior? > >> > >> It's been in iomap since the beginning when we lifted it from xfs. > >> > > > > Not sure exactly what you're referring to here. iomap_writepage_map() > > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d > > ("iomap: Permit pages without an iop to enter writeback"), so I don't > > see how iop allocs were deferred (other than when bs == ps, obviously) > > prior to that. > > > > Heh, but I'm starting to get my wires crossed just trying to piece > > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps) > > case is primarily a subtle side effect of the current writeback behavior > > being driven by uptodate status. I think it's probably wise to drop it > > at least initially, always alloc and dirty the appropriate iop ranges > > for sub-folio blocks, and then if you or others think there is value in > > the overwrite optimization to defer iop allocs, tack that on as a > > separate patch and try to be consistent between buffered and mapped > > writes. > > Based on the discussion so far, I would like to think of this as follow: > We already have some sort of lazy iop allocation in the buffered I/O > path (discussed above). This patch series does not changes that > behavior. For now I would like to keep the page mkwrite page as is > without any lazy iop allocation optimization. > I am ok to pick this optimization work as a seperate series > because, IIUC, Christoph has some ideas on deferring iop allocations > even further [2] (from link shared above). > > Does that sound good? > Could you do that in two steps where the buffered I/O path variant is replaced by explicit dirty tracking in the initial patch, and then is restored by a subsequent patch in the same series? That would allow keeping it around and documenting it explicitly in the commit log for the separate patch, but IMO makes this a bit easier to review (and potentially debug/bisect if needed down the road). But I don't insist if that's too troublesome for some reason... Brian > > > > Darrick noted above that he also agrees with that separate patch > > approach. For me, I think it would also be useful to show that there is > > some measurable performance benefit on at least one reasonable workload > > to help justify it. > > Agree that when we work on such optimizations as a seperate series, it > will be worth measuring the performance benefits of that. > > > -ritesh > > > > > Brian > > > >> --D (who is now weeks behind on reviewing things and stressed out) > >> > >> > Brian > >> > > >> > > Though I agree that we should ideally be allocatting & marking all > >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to > >> > > understand your reasoning better. > >> > > > >> > > Thanks! > >> > > -ritesh > >> > > > >> > > >> >
Brian Foster <bfoster@redhat.com> writes: > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote: >> Brian Foster <bfoster@redhat.com> writes: >> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote: >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: >> >> > > Matthew Wilcox <willy@infradead.org> writes: >> >> > > >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't >> >> > > >> > already have an iop? >> >> > > >> >> >> > > >> Yes. >> >> > > > >> >> > > > Hm, well, maybe? If somebody stores to a page, we obviously set the >> >> > > > dirty flag on the folio, but depending on the architecture, we may >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the >> >> > > > contiguous PTE bit). If we do have independent dirty bits, we could >> >> > > > dirty only the blocks corresponding to a single page at a time. >> >> > > > >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable >> >> > > > page fault. The fact is that applications generally do not perform >> >> > > > writes through mmap because the error handling story is so poor. >> >> > > > >> >> > > > There may be a different answer for anonymous memory, but that doesn't >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem. >> >> > > >> >> > > Although I am skeptical too to do the changes which Brian is suggesting >> >> > > here. i.e. not making all the blocks of the folio dirty when we are >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). >> >> > > >> >> > > However, I am sorry but I coudn't completely follow your reasoning >> >> > > above. I think what Brian is suggesting here is that >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite >> >> > > case where we do not allocate the iop at the ->write_begin() time. >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty. >> >> > > >> >> > >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure >> >> > what it has to do with the idea of being consistent with how full folio >> >> > overwrites are implemented (between buffered or mapped writes). We're >> >> > not changing historical dirtying granularity either way. I think this is >> >> > just a bigger picture thought for future consideration as opposed to >> >> > direct feedback on this patch.. >> >> >> >> <nod> >> >> >> >> > > In a way it is also the similar case as for mmapped writes too but my >> >> > > only worry is the way mmaped writes work and it makes more >> >> > > sense to keep the dirty state of folio and per-block within iop in sync. >> >> > > For that matter, we can even just make sure we always allocate an iop in >> >> > > the complete overwrites case as well. I didn't change that code because >> >> > > it was kept that way for uptodate state as well and based on one of your >> >> > > inputs for complete overwrite case. >> >> > > >> >> > >> >> > Can you elaborate on your concerns, out of curiosity? >> >> > >> >> > Either way, IMO it also seems reasonable to drop this behavior for the >> >> > basic implementation of dirty tracking (so always allocate the iop for >> >> > sub-folio tracking as you suggest above) and then potentially restore it >> >> > as a separate optimization patch at the end of the series. >> >> >> >> Agree. >> >> >> >> > That said, I'm not totally clear why it exists in the first place, so >> >> > that might warrant some investigation. Is it primarily to defer >> >> > allocations out of task write/fault contexts? >> >> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally >> >> allocate iops for blocksize < foliosize...) >> >> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations >> >> unless it's absolutely necessary for correctness was /my/ understanding >> >> of why we don't always allocate the iop... >> >> >> >> > To optimize the case where pagecache is dirtied but truncated or >> >> > something and thus never written back? >> >> >> >> ...because this might very well happen. Write a temporary .o file to >> >> the filesystem, then delete the whole thing before writeback ever gets >> >> its hands on the file. >> >> >> > >> > I don't think a simple temp write will trigger this scenario currently >> > because the folios would have to be uptodate at the time of the write to >> > bypass the iop alloc. I guess you'd have to read folios (even if backed >> > by holes) first to start seeing the !iop case at writeback time (for bs >> > != ps). >> > >> > That could change with these patches, but I was trying to reason about >> > the intent of the existing code and whether there was some known reason >> > to continue to try and defer the iop allocation as the need/complexity >> > for deferring it grows with the addition of more (i.e. dirty) tracking. >> > >> >> Here is the 1st discussion/reasoning where the deferred iop allocation >> in the readpage path got discussed [1]. >> And here is the discussion when I first pointed out the deferred >> allocation in writepage path. IMO, it got slipped in with the >> discussions maybe only on mailing list but nothing in the commit >> messages or comments.[2] >> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/ >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/ >> >> >> > Is there any room for further improvement where the alloc could be >> >> > avoided completely for folio overwrites instead of just deferred? >> >> >> >> Once writeback starts, though, we need the iop so that we can know when >> >> all the writeback for that folio is actually complete, no matter how >> >> many IOs might be in flight for that folio. I don't know how you'd get >> >> around this problem. >> >> >> > >> > Ok. I noticed some kind of counter or something being updated last time >> > I looked through that code, so it sounds like that's the reason the iop >> > eventually needs to exist. Thanks. >> > >> >> > Was that actually the case at some point and then something later >> >> > decided the iop was needed at writeback time, leading to current >> >> > behavior? >> >> >> >> It's been in iomap since the beginning when we lifted it from xfs. >> >> >> > >> > Not sure exactly what you're referring to here. iomap_writepage_map() >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't >> > see how iop allocs were deferred (other than when bs == ps, obviously) >> > prior to that. >> > >> > Heh, but I'm starting to get my wires crossed just trying to piece >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps) >> > case is primarily a subtle side effect of the current writeback behavior >> > being driven by uptodate status. I think it's probably wise to drop it >> > at least initially, always alloc and dirty the appropriate iop ranges >> > for sub-folio blocks, and then if you or others think there is value in >> > the overwrite optimization to defer iop allocs, tack that on as a >> > separate patch and try to be consistent between buffered and mapped >> > writes. >> >> Based on the discussion so far, I would like to think of this as follow: >> We already have some sort of lazy iop allocation in the buffered I/O >> path (discussed above). This patch series does not changes that >> behavior. For now I would like to keep the page mkwrite page as is >> without any lazy iop allocation optimization. >> I am ok to pick this optimization work as a seperate series >> because, IIUC, Christoph has some ideas on deferring iop allocations >> even further [2] (from link shared above). >> >> Does that sound good? >> > > Could you do that in two steps where the buffered I/O path variant is > replaced by explicit dirty tracking in the initial patch, and then is > restored by a subsequent patch in the same series? That would allow Sorry, I couldn't follow it. Can you please elaborate. So, what I was suggesting was - for buffered I/O path we should continue to have the lazy iop allocation scheme whereever possible. Rest of the optimizations of further deferring the iop allocation including in the ->mkwrite path should be dealt seperately in a later patch series. Also we already have a seperate patch in this series which defers the iop allocation if the write completely overwrites the folio [1]. Earlier the behavior was that it skips it entirely if the folio was uptodate, but since we require it for per-block dirty tracking, we defer iop allocation only if it was a complete overwrite of the folio. [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee -ritesh > keeping it around and documenting it explicitly in the commit log for > the separate patch, but IMO makes this a bit easier to review (and > potentially debug/bisect if needed down the road). > > But I don't insist if that's too troublesome for some reason... > > Brian > >> > >> > Darrick noted above that he also agrees with that separate patch >> > approach. For me, I think it would also be useful to show that there is >> > some measurable performance benefit on at least one reasonable workload >> > to help justify it. >> >> Agree that when we work on such optimizations as a seperate series, it >> will be worth measuring the performance benefits of that. >> >> >> -ritesh >> >> > >> > Brian >> > >> >> --D (who is now weeks behind on reviewing things and stressed out) >> >> >> >> > Brian >> >> > >> >> > > Though I agree that we should ideally be allocatting & marking all >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to >> >> > > understand your reasoning better. >> >> > > >> >> > > Thanks! >> >> > > -ritesh >> >> > > >> >> > >> >> >>
On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote: > Brian Foster <bfoster@redhat.com> writes: > > > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote: > >> Brian Foster <bfoster@redhat.com> writes: > >> > >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote: > >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: > >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: > >> >> > > Matthew Wilcox <willy@infradead.org> writes: > >> >> > > > >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: > >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: > >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered > >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't > >> >> > > >> > already have an iop? > >> >> > > >> > >> >> > > >> Yes. > >> >> > > > > >> >> > > > Hm, well, maybe? If somebody stores to a page, we obviously set the > >> >> > > > dirty flag on the folio, but depending on the architecture, we may > >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, > >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the > >> >> > > > contiguous PTE bit). If we do have independent dirty bits, we could > >> >> > > > dirty only the blocks corresponding to a single page at a time. > >> >> > > > > >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to > >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable > >> >> > > > page fault. The fact is that applications generally do not perform > >> >> > > > writes through mmap because the error handling story is so poor. > >> >> > > > > >> >> > > > There may be a different answer for anonymous memory, but that doesn't > >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem. > >> >> > > > >> >> > > Although I am skeptical too to do the changes which Brian is suggesting > >> >> > > here. i.e. not making all the blocks of the folio dirty when we are > >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). > >> >> > > > >> >> > > However, I am sorry but I coudn't completely follow your reasoning > >> >> > > above. I think what Brian is suggesting here is that > >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite > >> >> > > case where we do not allocate the iop at the ->write_begin() time. > >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty. > >> >> > > > >> >> > > >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty > >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure > >> >> > what it has to do with the idea of being consistent with how full folio > >> >> > overwrites are implemented (between buffered or mapped writes). We're > >> >> > not changing historical dirtying granularity either way. I think this is > >> >> > just a bigger picture thought for future consideration as opposed to > >> >> > direct feedback on this patch.. > >> >> > >> >> <nod> > >> >> > >> >> > > In a way it is also the similar case as for mmapped writes too but my > >> >> > > only worry is the way mmaped writes work and it makes more > >> >> > > sense to keep the dirty state of folio and per-block within iop in sync. > >> >> > > For that matter, we can even just make sure we always allocate an iop in > >> >> > > the complete overwrites case as well. I didn't change that code because > >> >> > > it was kept that way for uptodate state as well and based on one of your > >> >> > > inputs for complete overwrite case. > >> >> > > > >> >> > > >> >> > Can you elaborate on your concerns, out of curiosity? > >> >> > > >> >> > Either way, IMO it also seems reasonable to drop this behavior for the > >> >> > basic implementation of dirty tracking (so always allocate the iop for > >> >> > sub-folio tracking as you suggest above) and then potentially restore it > >> >> > as a separate optimization patch at the end of the series. > >> >> > >> >> Agree. > >> >> > >> >> > That said, I'm not totally clear why it exists in the first place, so > >> >> > that might warrant some investigation. Is it primarily to defer > >> >> > allocations out of task write/fault contexts? > >> >> > >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally > >> >> allocate iops for blocksize < foliosize...) > >> >> > >> >> IIRC the reason is to reduce memory usage by eliding iop allocations > >> >> unless it's absolutely necessary for correctness was /my/ understanding > >> >> of why we don't always allocate the iop... > >> >> > >> >> > To optimize the case where pagecache is dirtied but truncated or > >> >> > something and thus never written back? > >> >> > >> >> ...because this might very well happen. Write a temporary .o file to > >> >> the filesystem, then delete the whole thing before writeback ever gets > >> >> its hands on the file. > >> >> > >> > > >> > I don't think a simple temp write will trigger this scenario currently > >> > because the folios would have to be uptodate at the time of the write to > >> > bypass the iop alloc. I guess you'd have to read folios (even if backed > >> > by holes) first to start seeing the !iop case at writeback time (for bs > >> > != ps). > >> > > >> > That could change with these patches, but I was trying to reason about > >> > the intent of the existing code and whether there was some known reason > >> > to continue to try and defer the iop allocation as the need/complexity > >> > for deferring it grows with the addition of more (i.e. dirty) tracking. > >> > > >> > >> Here is the 1st discussion/reasoning where the deferred iop allocation > >> in the readpage path got discussed [1]. > >> And here is the discussion when I first pointed out the deferred > >> allocation in writepage path. IMO, it got slipped in with the > >> discussions maybe only on mailing list but nothing in the commit > >> messages or comments.[2] > >> > >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/ > >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/ > >> > >> >> > Is there any room for further improvement where the alloc could be > >> >> > avoided completely for folio overwrites instead of just deferred? > >> >> > >> >> Once writeback starts, though, we need the iop so that we can know when > >> >> all the writeback for that folio is actually complete, no matter how > >> >> many IOs might be in flight for that folio. I don't know how you'd get > >> >> around this problem. > >> >> > >> > > >> > Ok. I noticed some kind of counter or something being updated last time > >> > I looked through that code, so it sounds like that's the reason the iop > >> > eventually needs to exist. Thanks. > >> > > >> >> > Was that actually the case at some point and then something later > >> >> > decided the iop was needed at writeback time, leading to current > >> >> > behavior? > >> >> > >> >> It's been in iomap since the beginning when we lifted it from xfs. > >> >> > >> > > >> > Not sure exactly what you're referring to here. iomap_writepage_map() > >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d > >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't > >> > see how iop allocs were deferred (other than when bs == ps, obviously) > >> > prior to that. > >> > > >> > Heh, but I'm starting to get my wires crossed just trying to piece > >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps) > >> > case is primarily a subtle side effect of the current writeback behavior > >> > being driven by uptodate status. I think it's probably wise to drop it > >> > at least initially, always alloc and dirty the appropriate iop ranges > >> > for sub-folio blocks, and then if you or others think there is value in > >> > the overwrite optimization to defer iop allocs, tack that on as a > >> > separate patch and try to be consistent between buffered and mapped > >> > writes. > >> > >> Based on the discussion so far, I would like to think of this as follow: > >> We already have some sort of lazy iop allocation in the buffered I/O > >> path (discussed above). This patch series does not changes that > >> behavior. For now I would like to keep the page mkwrite page as is > >> without any lazy iop allocation optimization. > >> I am ok to pick this optimization work as a seperate series > >> because, IIUC, Christoph has some ideas on deferring iop allocations > >> even further [2] (from link shared above). > >> > >> Does that sound good? > >> > > > > Could you do that in two steps where the buffered I/O path variant is > > replaced by explicit dirty tracking in the initial patch, and then is > > restored by a subsequent patch in the same series? That would allow > > Sorry, I couldn't follow it. Can you please elaborate. > Sorry for the confusion... > So, what I was suggesting was - for buffered I/O path we should continue > to have the lazy iop allocation scheme whereever possible. > Rest of the optimizations of further deferring the iop allocation > including in the ->mkwrite path should be dealt seperately in a later > patch series. > Yup, agree. > Also we already have a seperate patch in this series which defers the > iop allocation if the write completely overwrites the folio [1]. > Earlier the behavior was that it skips it entirely if the folio was > uptodate, but since we require it for per-block dirty tracking, we > defer iop allocation only if it was a complete overwrite of the folio. > That is a prepatory patch before iop dirty tracking is enabled in patch 5, right? I was mainly just suggesting to make the overwrite checking part of this patch come after dirty tracking is enabled (as a small optimization patch) rather than before. I don't want to harp on it if that's difficult or still doesn't make sense for some reason. I'll take a closer look the next go around when I have a bit more time and just send a diff if it seems it can be done cleanly.. Brian > [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee > > > -ritesh > > > keeping it around and documenting it explicitly in the commit log for > > the separate patch, but IMO makes this a bit easier to review (and > > potentially debug/bisect if needed down the road). > > > > But I don't insist if that's too troublesome for some reason... > > > > Brian > > > >> > > >> > Darrick noted above that he also agrees with that separate patch > >> > approach. For me, I think it would also be useful to show that there is > >> > some measurable performance benefit on at least one reasonable workload > >> > to help justify it. > >> > >> Agree that when we work on such optimizations as a seperate series, it > >> will be worth measuring the performance benefits of that. > >> > >> > >> -ritesh > >> > >> > > >> > Brian > >> > > >> >> --D (who is now weeks behind on reviewing things and stressed out) > >> >> > >> >> > Brian > >> >> > > >> >> > > Though I agree that we should ideally be allocatting & marking all > >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to > >> >> > > understand your reasoning better. > >> >> > > > >> >> > > Thanks! > >> >> > > -ritesh > >> >> > > > >> >> > > >> >> > >> >
Brian Foster <bfoster@redhat.com> writes: > On Tue, May 23, 2023 at 08:32:42PM +0530, Ritesh Harjani wrote: >> Brian Foster <bfoster@redhat.com> writes: >> >> > On Tue, May 23, 2023 at 07:13:04PM +0530, Ritesh Harjani wrote: >> >> Brian Foster <bfoster@redhat.com> writes: >> >> >> >> > On Mon, May 22, 2023 at 05:56:25PM -0700, Darrick J. Wong wrote: >> >> >> On Mon, May 22, 2023 at 07:18:07AM -0400, Brian Foster wrote: >> >> >> > On Mon, May 22, 2023 at 10:03:05AM +0530, Ritesh Harjani wrote: >> >> >> > > Matthew Wilcox <willy@infradead.org> writes: >> >> >> > > >> >> >> > > > On Thu, May 18, 2023 at 06:23:44AM -0700, Christoph Hellwig wrote: >> >> >> > > >> On Wed, May 17, 2023 at 02:48:12PM -0400, Brian Foster wrote: >> >> >> > > >> > But I also wonder.. if we can skip the iop alloc on full folio buffered >> >> >> > > >> > overwrites, isn't that also true of mapped writes to folios that don't >> >> >> > > >> > already have an iop? >> >> >> > > >> >> >> >> > > >> Yes. >> >> >> > > > >> >> >> > > > Hm, well, maybe? If somebody stores to a page, we obviously set the >> >> >> > > > dirty flag on the folio, but depending on the architecture, we may >> >> >> > > > or may not have independent dirty bits on the PTEs (eg if it's a PMD, >> >> >> > > > we have one dirty bit for the entire folio; similarly if ARM uses the >> >> >> > > > contiguous PTE bit). If we do have independent dirty bits, we could >> >> >> > > > dirty only the blocks corresponding to a single page at a time. >> >> >> > > > >> >> >> > > > This has potential for causing some nasty bugs, so I'm inclined to >> >> >> > > > rule that if a folio is mmaped, then it's all dirty from any writable >> >> >> > > > page fault. The fact is that applications generally do not perform >> >> >> > > > writes through mmap because the error handling story is so poor. >> >> >> > > > >> >> >> > > > There may be a different answer for anonymous memory, but that doesn't >> >> >> > > > feel like my problem and shouldn't feel like any FS developer's problem. >> >> >> > > >> >> >> > > Although I am skeptical too to do the changes which Brian is suggesting >> >> >> > > here. i.e. not making all the blocks of the folio dirty when we are >> >> >> > > going to call ->dirty_folio -> filemap_dirty_folio() (mmaped writes). >> >> >> > > >> >> >> > > However, I am sorry but I coudn't completely follow your reasoning >> >> >> > > above. I think what Brian is suggesting here is that >> >> >> > > filemap_dirty_folio() should be similar to complete buffered overwrite >> >> >> > > case where we do not allocate the iop at the ->write_begin() time. >> >> >> > > Then at the writeback time we allocate an iop and mark all blocks dirty. >> >> >> > > >> >> >> > >> >> >> > Yeah... I think what Willy is saying (i.e. to not track sub-page dirty >> >> >> > granularity of intra-folio faults) makes sense, but I'm also not sure >> >> >> > what it has to do with the idea of being consistent with how full folio >> >> >> > overwrites are implemented (between buffered or mapped writes). We're >> >> >> > not changing historical dirtying granularity either way. I think this is >> >> >> > just a bigger picture thought for future consideration as opposed to >> >> >> > direct feedback on this patch.. >> >> >> >> >> >> <nod> >> >> >> >> >> >> > > In a way it is also the similar case as for mmapped writes too but my >> >> >> > > only worry is the way mmaped writes work and it makes more >> >> >> > > sense to keep the dirty state of folio and per-block within iop in sync. >> >> >> > > For that matter, we can even just make sure we always allocate an iop in >> >> >> > > the complete overwrites case as well. I didn't change that code because >> >> >> > > it was kept that way for uptodate state as well and based on one of your >> >> >> > > inputs for complete overwrite case. >> >> >> > > >> >> >> > >> >> >> > Can you elaborate on your concerns, out of curiosity? >> >> >> > >> >> >> > Either way, IMO it also seems reasonable to drop this behavior for the >> >> >> > basic implementation of dirty tracking (so always allocate the iop for >> >> >> > sub-folio tracking as you suggest above) and then potentially restore it >> >> >> > as a separate optimization patch at the end of the series. >> >> >> >> >> >> Agree. >> >> >> >> >> >> > That said, I'm not totally clear why it exists in the first place, so >> >> >> > that might warrant some investigation. Is it primarily to defer >> >> >> > allocations out of task write/fault contexts? >> >> >> >> >> >> (Assuming by 'it' you mean the behavior where we don't unconditionally >> >> >> allocate iops for blocksize < foliosize...) >> >> >> >> >> >> IIRC the reason is to reduce memory usage by eliding iop allocations >> >> >> unless it's absolutely necessary for correctness was /my/ understanding >> >> >> of why we don't always allocate the iop... >> >> >> >> >> >> > To optimize the case where pagecache is dirtied but truncated or >> >> >> > something and thus never written back? >> >> >> >> >> >> ...because this might very well happen. Write a temporary .o file to >> >> >> the filesystem, then delete the whole thing before writeback ever gets >> >> >> its hands on the file. >> >> >> >> >> > >> >> > I don't think a simple temp write will trigger this scenario currently >> >> > because the folios would have to be uptodate at the time of the write to >> >> > bypass the iop alloc. I guess you'd have to read folios (even if backed >> >> > by holes) first to start seeing the !iop case at writeback time (for bs >> >> > != ps). >> >> > >> >> > That could change with these patches, but I was trying to reason about >> >> > the intent of the existing code and whether there was some known reason >> >> > to continue to try and defer the iop allocation as the need/complexity >> >> > for deferring it grows with the addition of more (i.e. dirty) tracking. >> >> > >> >> >> >> Here is the 1st discussion/reasoning where the deferred iop allocation >> >> in the readpage path got discussed [1]. >> >> And here is the discussion when I first pointed out the deferred >> >> allocation in writepage path. IMO, it got slipped in with the >> >> discussions maybe only on mailing list but nothing in the commit >> >> messages or comments.[2] >> >> >> >> [1]: https://lore.kernel.org/linux-xfs/20210628172727.1894503-1-agruenba@redhat.com/ >> >> [2]: https://lore.kernel.org/linux-xfs/20230130202150.pfohy5yg6dtu64ce@rh-tp/ >> >> >> >> >> > Is there any room for further improvement where the alloc could be >> >> >> > avoided completely for folio overwrites instead of just deferred? >> >> >> >> >> >> Once writeback starts, though, we need the iop so that we can know when >> >> >> all the writeback for that folio is actually complete, no matter how >> >> >> many IOs might be in flight for that folio. I don't know how you'd get >> >> >> around this problem. >> >> >> >> >> > >> >> > Ok. I noticed some kind of counter or something being updated last time >> >> > I looked through that code, so it sounds like that's the reason the iop >> >> > eventually needs to exist. Thanks. >> >> > >> >> >> > Was that actually the case at some point and then something later >> >> >> > decided the iop was needed at writeback time, leading to current >> >> >> > behavior? >> >> >> >> >> >> It's been in iomap since the beginning when we lifted it from xfs. >> >> >> >> >> > >> >> > Not sure exactly what you're referring to here. iomap_writepage_map() >> >> > would warn on the (bs != ps && !iop) case up until commit 8e1bcef8e18d >> >> > ("iomap: Permit pages without an iop to enter writeback"), so I don't >> >> > see how iop allocs were deferred (other than when bs == ps, obviously) >> >> > prior to that. >> >> > >> >> > Heh, but I'm starting to get my wires crossed just trying to piece >> >> > things together here. Ritesh, ISTM the (writeback && !iop && bs != ps) >> >> > case is primarily a subtle side effect of the current writeback behavior >> >> > being driven by uptodate status. I think it's probably wise to drop it >> >> > at least initially, always alloc and dirty the appropriate iop ranges >> >> > for sub-folio blocks, and then if you or others think there is value in >> >> > the overwrite optimization to defer iop allocs, tack that on as a >> >> > separate patch and try to be consistent between buffered and mapped >> >> > writes. >> >> >> >> Based on the discussion so far, I would like to think of this as follow: >> >> We already have some sort of lazy iop allocation in the buffered I/O >> >> path (discussed above). This patch series does not changes that >> >> behavior. For now I would like to keep the page mkwrite page as is >> >> without any lazy iop allocation optimization. >> >> I am ok to pick this optimization work as a seperate series >> >> because, IIUC, Christoph has some ideas on deferring iop allocations >> >> even further [2] (from link shared above). >> >> >> >> Does that sound good? >> >> >> > >> > Could you do that in two steps where the buffered I/O path variant is >> > replaced by explicit dirty tracking in the initial patch, and then is >> > restored by a subsequent patch in the same series? That would allow >> >> Sorry, I couldn't follow it. Can you please elaborate. >> > > Sorry for the confusion... > >> So, what I was suggesting was - for buffered I/O path we should continue >> to have the lazy iop allocation scheme whereever possible. >> Rest of the optimizations of further deferring the iop allocation >> including in the ->mkwrite path should be dealt seperately in a later >> patch series. >> > > Yup, agree. > >> Also we already have a seperate patch in this series which defers the >> iop allocation if the write completely overwrites the folio [1]. >> Earlier the behavior was that it skips it entirely if the folio was >> uptodate, but since we require it for per-block dirty tracking, we >> defer iop allocation only if it was a complete overwrite of the folio. >> > > That is a prepatory patch before iop dirty tracking is enabled in patch > 5, right? Yes, right. > I was mainly just suggesting to make the overwrite checking > part of this patch come after dirty tracking is enabled (as a small > optimization patch) rather than before. > > I don't want to harp on it if that's difficult or still doesn't make > sense for some reason. I'll take a closer look the next go around when I > have a bit more time and just send a diff if it seems it can be done > cleanly.. It should not be difficult. Will make the change in next rev. Thanks for the help. Seems a lot of things have been sorted out now :) I will be working on the review comments, finish off few of the remaining todos, get more testing done and will spin off the next revision. -ritesh > > Brian > >> [1]: https://lore.kernel.org/linux-xfs/ZGzRX9YVkAYJGLqV@bfoster/T/#m048d0a097f7abb09da1c12c9a02afbcc3b9d39ee >> >> >> -ritesh >> >> > keeping it around and documenting it explicitly in the commit log for >> > the separate patch, but IMO makes this a bit easier to review (and >> > potentially debug/bisect if needed down the road). >> > >> > But I don't insist if that's too troublesome for some reason... >> > >> > Brian >> > >> >> > >> >> > Darrick noted above that he also agrees with that separate patch >> >> > approach. For me, I think it would also be useful to show that there is >> >> > some measurable performance benefit on at least one reasonable workload >> >> > to help justify it. >> >> >> >> Agree that when we work on such optimizations as a seperate series, it >> >> will be worth measuring the performance benefits of that. >> >> >> >> >> >> -ritesh >> >> >> >> > >> >> > Brian >> >> > >> >> >> --D (who is now weeks behind on reviewing things and stressed out) >> >> >> >> >> >> > Brian >> >> >> > >> >> >> > > Though I agree that we should ideally be allocatting & marking all >> >> >> > > blocks in iop as dirty in the call to ->dirty_folio(), I just wanted to >> >> >> > > understand your reasoning better. >> >> >> > > >> >> >> > > Thanks! >> >> >> > > -ritesh >> >> >> > > >> >> >> > >> >> >> >> >> >>
On Tue, May 23, 2023 at 11:22:33AM -0400, Brian Foster wrote: > That is a prepatory patch before iop dirty tracking is enabled in patch > 5, right? I was mainly just suggesting to make the overwrite checking > part of this patch come after dirty tracking is enabled (as a small > optimization patch) rather than before. > > I don't want to harp on it if that's difficult or still doesn't make > sense for some reason. I'll take a closer look the next go around when I > have a bit more time and just send a diff if it seems it can be done > cleanly.. Honestly, I think the way the patchset is structured now is fine. There are always multiple ways to structure a set of patches, and I don't think we need be too concerned about what order the things are done in, as long as they get done.
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index a5f4be6b9213..75efec3c3b71 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -746,7 +746,7 @@ static const struct address_space_operations gfs2_aops = { .writepages = gfs2_writepages, .read_folio = gfs2_read_folio, .readahead = gfs2_readahead, - .dirty_folio = filemap_dirty_folio, + .dirty_folio = iomap_dirty_folio, .release_folio = iomap_release_folio, .invalidate_folio = iomap_invalidate_folio, .bmap = gfs2_bmap, diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 25f20f269214..c7f41b26280a 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -52,6 +52,12 @@ static inline void iop_set_range(struct iomap_page *iop, unsigned int start_blk, bitmap_set(iop->state, start_blk, nr_blks); } +static inline void iop_clear_range(struct iomap_page *iop, + unsigned int start_blk, unsigned int nr_blks) +{ + bitmap_clear(iop->state, start_blk, nr_blks); +} + static inline bool iop_test_block(struct iomap_page *iop, unsigned int block) { return test_bit(block, iop->state); @@ -84,6 +90,16 @@ static bool iop_test_block_uptodate(struct folio *folio, unsigned int block) return iop_test_block(iop, block); } +static bool iop_test_block_dirty(struct folio *folio, int block) +{ + struct iomap_page *iop = to_iomap_page(folio); + struct inode *inode = folio->mapping->host; + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); + + WARN_ON(!iop); + return iop_test_block(iop, block + blks_per_folio); +} + static void iop_set_range_uptodate(struct inode *inode, struct folio *folio, size_t off, size_t len) { @@ -104,8 +120,42 @@ static void iop_set_range_uptodate(struct inode *inode, struct folio *folio, } } +static void iop_set_range_dirty(struct inode *inode, struct folio *folio, + size_t off, size_t len) +{ + struct iomap_page *iop = to_iomap_page(folio); + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); + unsigned int first_blk = (off >> inode->i_blkbits); + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); + unsigned int nr_blks = last_blk - first_blk + 1; + unsigned long flags; + + if (!iop) + return; + spin_lock_irqsave(&iop->state_lock, flags); + iop_set_range(iop, first_blk + blks_per_folio, nr_blks); + spin_unlock_irqrestore(&iop->state_lock, flags); +} + +static void iop_clear_range_dirty(struct folio *folio, size_t off, size_t len) +{ + struct iomap_page *iop = to_iomap_page(folio); + struct inode *inode = folio->mapping->host; + unsigned int blks_per_folio = i_blocks_per_folio(inode, folio); + unsigned int first_blk = (off >> inode->i_blkbits); + unsigned int last_blk = ((off + len - 1) >> inode->i_blkbits); + unsigned int nr_blks = last_blk - first_blk + 1; + unsigned long flags; + + if (!iop) + return; + spin_lock_irqsave(&iop->state_lock, flags); + iop_clear_range(iop, first_blk + blks_per_folio, nr_blks); + spin_unlock_irqrestore(&iop->state_lock, flags); +} + static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, - unsigned int flags) + unsigned int flags, bool is_dirty) { struct iomap_page *iop = to_iomap_page(folio); unsigned int nr_blocks = i_blocks_per_folio(inode, folio); @@ -119,12 +169,20 @@ static struct iomap_page *iop_alloc(struct inode *inode, struct folio *folio, else gfp = GFP_NOFS | __GFP_NOFAIL; - iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(nr_blocks)), + /* + * iop->state tracks two sets of state flags when the + * filesystem block size is smaller than the folio size. + * The first state tracks per-block uptodate and the + * second tracks per-block dirty state. + */ + iop = kzalloc(struct_size(iop, state, BITS_TO_LONGS(2 * nr_blocks)), gfp); if (iop) { spin_lock_init(&iop->state_lock); if (folio_test_uptodate(folio)) iop_set_range(iop, 0, nr_blocks); + if (is_dirty) + iop_set_range(iop, nr_blocks, nr_blocks); folio_attach_private(folio, iop); } return iop; @@ -268,7 +326,8 @@ static int iomap_read_inline_data(const struct iomap_iter *iter, if (WARN_ON_ONCE(size > iomap->length)) return -EIO; if (offset > 0) - iop = iop_alloc(iter->inode, folio, iter->flags); + iop = iop_alloc(iter->inode, folio, iter->flags, + folio_test_dirty(folio)); else iop = to_iomap_page(folio); @@ -306,7 +365,8 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter, return iomap_read_inline_data(iter, folio); /* zero post-eof blocks as the page may be mapped */ - iop = iop_alloc(iter->inode, folio, iter->flags); + iop = iop_alloc(iter->inode, folio, iter->flags, + folio_test_dirty(folio)); iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen); if (plen == 0) goto done; @@ -561,6 +621,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len) } EXPORT_SYMBOL_GPL(iomap_invalidate_folio); +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio) +{ + struct iomap_page *iop; + struct inode *inode = mapping->host; + size_t len = i_blocks_per_folio(inode, folio) << inode->i_blkbits; + + iop = iop_alloc(inode, folio, 0, false); + iop_set_range_dirty(inode, folio, 0, len); + return filemap_dirty_folio(mapping, folio); +} +EXPORT_SYMBOL_GPL(iomap_dirty_folio); + static void iomap_write_failed(struct inode *inode, loff_t pos, unsigned len) { @@ -608,7 +680,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, pos + len >= folio_pos(folio) + folio_size(folio)) return 0; - iop = iop_alloc(iter->inode, folio, iter->flags); + iop = iop_alloc(iter->inode, folio, iter->flags, + folio_test_dirty(folio)); if ((iter->flags & IOMAP_NOWAIT) && !iop && nr_blocks > 1) return -EAGAIN; @@ -767,6 +840,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len, if (unlikely(copied < len && !folio_test_uptodate(folio))) return 0; iop_set_range_uptodate(inode, folio, offset_in_folio(folio, pos), len); + iop_set_range_dirty(inode, folio, offset_in_folio(folio, pos), copied); filemap_dirty_folio(inode->i_mapping, folio); return copied; } @@ -954,6 +1028,10 @@ static int iomap_write_delalloc_scan(struct inode *inode, { while (start_byte < end_byte) { struct folio *folio; + struct iomap_page *iop; + unsigned int first_blk, last_blk, blks_per_folio, i; + loff_t last_byte; + u8 blkbits = inode->i_blkbits; /* grab locked page */ folio = filemap_lock_folio(inode->i_mapping, @@ -978,6 +1056,28 @@ static int iomap_write_delalloc_scan(struct inode *inode, } } + /* + * When we have per-block dirty tracking, there can be + * blocks within a folio which are marked uptodate + * but not dirty. In that case it is necessary to punch + * out such blocks to avoid leaking any delalloc blocks. + */ + iop = to_iomap_page(folio); + if (!iop) + goto skip_iop_punch; + last_byte = min_t(loff_t, end_byte - 1, + (folio_next_index(folio) << PAGE_SHIFT) - 1); + first_blk = offset_in_folio(folio, start_byte) >> + blkbits; + last_blk = offset_in_folio(folio, last_byte) >> + blkbits; + blks_per_folio = i_blocks_per_folio(inode, folio); + for (i = first_blk; i <= last_blk; i++) { + if (!iop_test_block_dirty(folio, i)) + punch(inode, i << blkbits, + 1 << blkbits); + } +skip_iop_punch: /* * Make sure the next punch start is correctly bound to * the end of this data range, not the end of the folio. @@ -1666,7 +1766,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, struct writeback_control *wbc, struct inode *inode, struct folio *folio, u64 end_pos) { - struct iomap_page *iop = iop_alloc(inode, folio, 0); + struct iomap_page *iop = iop_alloc(inode, folio, 0, true); struct iomap_ioend *ioend, *next; unsigned len = i_blocksize(inode); unsigned nblocks = i_blocks_per_folio(inode, folio); @@ -1682,7 +1782,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, * invalid, grab a new one. */ for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { - if (iop && !iop_test_block_uptodate(folio, i)) + if (iop && !iop_test_block_dirty(folio, i)) continue; error = wpc->ops->map_blocks(wpc, inode, pos); @@ -1726,6 +1826,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc, } } + iop_clear_range_dirty(folio, 0, end_pos - folio_pos(folio)); folio_start_writeback(folio); folio_unlock(folio); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 2ef78aa1d3f6..77c7332ae197 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -578,7 +578,7 @@ const struct address_space_operations xfs_address_space_operations = { .read_folio = xfs_vm_read_folio, .readahead = xfs_vm_readahead, .writepages = xfs_vm_writepages, - .dirty_folio = filemap_dirty_folio, + .dirty_folio = iomap_dirty_folio, .release_folio = iomap_release_folio, .invalidate_folio = iomap_invalidate_folio, .bmap = xfs_vm_bmap, diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c index 132f01d3461f..e508c8e97372 100644 --- a/fs/zonefs/file.c +++ b/fs/zonefs/file.c @@ -175,7 +175,7 @@ const struct address_space_operations zonefs_file_aops = { .read_folio = zonefs_read_folio, .readahead = zonefs_readahead, .writepages = zonefs_writepages, - .dirty_folio = filemap_dirty_folio, + .dirty_folio = iomap_dirty_folio, .release_folio = iomap_release_folio, .invalidate_folio = iomap_invalidate_folio, .migrate_folio = filemap_migrate_folio, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0f8123504e5e..0c2bee80565c 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -264,6 +264,7 @@ bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count); struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos); bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags); void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len); +bool iomap_dirty_folio(struct address_space *mapping, struct folio *folio); int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, const struct iomap_ops *ops); int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
When filesystem blocksize is less than folio size (either with mapping_large_folio_support() or with blocksize < pagesize) and when the folio is uptodate in pagecache, then even a byte write can cause an entire folio to be written to disk during writeback. This happens because we currently don't have a mechanism to track per-block dirty state within struct iomap_page. We currently only track uptodate state. This patch implements support for tracking per-block dirty state in iomap_page->state bitmap. This should help improve the filesystem write performance and help reduce write amplification. Performance testing of below fio workload reveals ~16x performance improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) FIO reported write bw scores improved from around ~28 MBps to ~452 MBps. 1. <test_randwrite.fio> [global] ioengine=psync rw=randwrite overwrite=1 pre_read=1 direct=0 bs=4k size=1G dir=./ numjobs=8 fdatasync=1 runtime=60 iodepth=64 group_reporting=1 [fio-run] 2. Also our internal performance team reported that this patch improves their database workload performance by around ~83% (with XFS on Power) Reported-by: Aravinda Herle <araherle@in.ibm.com> Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/gfs2/aops.c | 2 +- fs/iomap/buffered-io.c | 115 ++++++++++++++++++++++++++++++++++++++--- fs/xfs/xfs_aops.c | 2 +- fs/zonefs/file.c | 2 +- include/linux/iomap.h | 1 + 5 files changed, 112 insertions(+), 10 deletions(-) -- 2.39.2