Message ID | f5e84d3a63de30def2f3800f534d14389f6ba137.1700506526.git.ritesh.list@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ext2: Use iomap in buffered-io for regular files and enable large folio support | expand |
On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote: > +static void ext2_file_readahead(struct readahead_control *rac) > +{ > + return iomap_readahead(rac, &ext2_iomap_ops); We generally prefer to omit the 'return' when the function returns void (it's a GCC extension to not warn about it, so not actually a bug) This all looks marvellously simple. Good job!
On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote: > - mmap path of ext2 continues to use generic_file_vm_ops So this means there are not space reservations taken at write fault time. While iomap was written with the assumption those exist, I can't instantly spot anything that relies on them - you are just a lot more likely to hit an -ENOSPC from ->map_blocks now. Maybe we should add an xfstests that covers the case where we use up more then the existing free space through writable mmaps to ensure it fully works (where works means at least as good as the old behavior)? > +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb, > + struct iov_iter *from) > +{ > + ssize_t ret = 0; > + struct inode *inode = file_inode(iocb->ki_filp); > + > + inode_lock(inode); > + ret = generic_write_checks(iocb, from); > + if (ret > 0) > + ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops); > + inode_unlock(inode); > + if (ret > 0) > + ret = generic_write_sync(iocb, ret); > + return ret; > +} Not for now, but if we end up doing a bunch of conversation of trivial file systems, it might be worth to eventually add a wrapper for the above code with just the iomap_ops passed in. Only once we see a few duplicates, though. > +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, > + struct inode *inode, loff_t offset) > +{ > + if (offset >= wpc->iomap.offset && > + offset < wpc->iomap.offset + wpc->iomap.length) > + return 0; > + > + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, > + IOMAP_WRITE, &wpc->iomap, NULL); > +} Looking at gfs2 this also might become a pattern. But I'm fine with waiting for now. > @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode) > if (IS_DAX(inode)) > inode->i_mapping->a_ops = &ext2_dax_aops; > else > - inode->i_mapping->a_ops = &ext2_aops; > + inode->i_mapping->a_ops = &ext2_file_aops; > } Did yo run into issues in using the iomap based aops for the other uses of ext2_aops, or are just trying to address the users one at a time?
Christoph Hellwig <hch@infradead.org> writes: > On Tue, Nov 21, 2023 at 12:35:20AM +0530, Ritesh Harjani (IBM) wrote: >> - mmap path of ext2 continues to use generic_file_vm_ops > > So this means there are not space reservations taken at write fault > time. Yes. > While iomap was written with the assumption those exist, I can't > instantly spot anything that relies on them - you are just a lot more > likely to hit an -ENOSPC from ->map_blocks now. Which is also true with existing code no? If the block reservation is not done at the write fault, writeback is likely to fail due to ENOSPC? > Maybe we should add > an xfstests that covers the case where we use up more then the existing > free space through writable mmaps to ensure it fully works (where works > means at least as good as the old behavior)? > Sure, I can try write an fstests for the same. Also I did find an old thread which tried implementing ->page_mkwrite in ext2 [1]. However, it was rejected with following reason - """ Allocating blocks on write fault has the unwelcome impact that the file layout is likely going to be much worse (much more fragmented) - I remember getting some reports about performance regressions from users back when I did a similar change for ext3. And so I'm reluctant to change behavior of such an old legacy filesystem as ext2... """ https://lore.kernel.org/linux-ext4/20210105175348.GE15080@quack2.suse.cz/ >> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb, >> + struct iov_iter *from) >> +{ >> + ssize_t ret = 0; >> + struct inode *inode = file_inode(iocb->ki_filp); >> + >> + inode_lock(inode); >> + ret = generic_write_checks(iocb, from); >> + if (ret > 0) >> + ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops); >> + inode_unlock(inode); >> + if (ret > 0) >> + ret = generic_write_sync(iocb, ret); >> + return ret; >> +} > > Not for now, but if we end up doing a bunch of conversation of trivial > file systems, it might be worth to eventually add a wrapper for the > above code with just the iomap_ops passed in. Only once we see a few > duplicates, though. > Sure, make sense. Thanks! I can try and check if the the wrapper helps. >> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, >> + struct inode *inode, loff_t offset) >> +{ >> + if (offset >= wpc->iomap.offset && >> + offset < wpc->iomap.offset + wpc->iomap.length) >> + return 0; >> + >> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, >> + IOMAP_WRITE, &wpc->iomap, NULL); >> +} > > Looking at gfs2 this also might become a pattern. But I'm fine with > waiting for now. > ok. >> @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode) >> if (IS_DAX(inode)) >> inode->i_mapping->a_ops = &ext2_dax_aops; >> else >> - inode->i_mapping->a_ops = &ext2_aops; >> + inode->i_mapping->a_ops = &ext2_file_aops; >> } > > Did yo run into issues in using the iomap based aops for the other uses > of ext2_aops, or are just trying to address the users one at a time? There are problems for e.g. for dir type in ext2. It uses the pagecache for dir. It uses buffer_heads and attaches them to folio->private. ...it uses block_write_begin/block_write_end() calls. Look for ext4_make_empty() -> ext4_prepare_chunk -> block_write_begin(). Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we might take a iomap writeback path (if using ext2_file_aops for dir) which sees folio->private assuming it is "struct iomap_folio_state". And bad things will happen... Now we don't have an equivalent APIs in iomap for block_write_begin()/end() which the users can call for. Hence, Jan suggested to lets first convert ext2 regular file path to iomap as an RFC. I shall now check for the dir type to see what changes we might require in ext2/iomap code. Thanks again for your review! -ritesh
On Tue, Nov 21, 2023 at 11:26:15AM +0530, Ritesh Harjani wrote: > > instantly spot anything that relies on them - you are just a lot more > > likely to hit an -ENOSPC from ->map_blocks now. > > Which is also true with existing code no? If the block reservation is > not done at the write fault, writeback is likely to fail due to ENOSPC? Yes. Not saying you should change this, I just want to make sure the iomap code handles this fine. I think it does, but I'd rather be sure. > Sure, make sense. Thanks! > I can try and check if the the wrapper helps. Let's wait until we have a few more conversions. > > Did yo run into issues in using the iomap based aops for the other uses > > of ext2_aops, or are just trying to address the users one at a time? > > There are problems for e.g. for dir type in ext2. It uses the pagecache > for dir. It uses buffer_heads and attaches them to folio->private. > ...it uses block_write_begin/block_write_end() calls. > Look for ext4_make_empty() -> ext4_prepare_chunk -> > block_write_begin(). > Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we > might take a iomap writeback path (if using ext2_file_aops for dir) > which sees folio->private assuming it is "struct iomap_folio_state". > And bad things will happen... Oh, indeed, bufferheads again. > Now we don't have an equivalent APIs in iomap for > block_write_begin()/end() which the users can call for. Hence, Jan > suggested to lets first convert ext2 regular file path to iomap as an RFC. Yes, no problem. But maybe worth documenting in the commit log.
Christoph Hellwig <hch@infradead.org> writes: > On Tue, Nov 21, 2023 at 11:26:15AM +0530, Ritesh Harjani wrote: >> > instantly spot anything that relies on them - you are just a lot more >> > likely to hit an -ENOSPC from ->map_blocks now. >> >> Which is also true with existing code no? If the block reservation is >> not done at the write fault, writeback is likely to fail due to ENOSPC? > > Yes. Not saying you should change this, I just want to make sure the > iomap code handles this fine. I think it does, but I'd rather be sure. > Sure. I can write a fstest to test the behavior. >> Sure, make sense. Thanks! >> I can try and check if the the wrapper helps. > > Let's wait until we have a few more conversions. > Sure. >> > Did yo run into issues in using the iomap based aops for the other uses >> > of ext2_aops, or are just trying to address the users one at a time? >> >> There are problems for e.g. for dir type in ext2. It uses the pagecache >> for dir. It uses buffer_heads and attaches them to folio->private. >> ...it uses block_write_begin/block_write_end() calls. >> Look for ext4_make_empty() -> ext4_prepare_chunk -> >> block_write_begin(). >> Now during sync/writeback of the dirty pages (ext4_handle_dirsync()), we >> might take a iomap writeback path (if using ext2_file_aops for dir) >> which sees folio->private assuming it is "struct iomap_folio_state". >> And bad things will happen... > > Oh, indeed, bufferheads again. > >> Now we don't have an equivalent APIs in iomap for >> block_write_begin()/end() which the users can call for. Hence, Jan >> suggested to lets first convert ext2 regular file path to iomap as an RFC. > > Yes, no problem. But maybe worth documenting in the commit log. Sure, I will update the commit log. -ritesh
On Tue 21-11-23 00:35:20, Ritesh Harjani (IBM) wrote: > This patch converts ext2 regular file's buffered-io path to use iomap. > - buffered-io path using iomap_file_buffered_write > - DIO fallback to buffered-io now uses iomap_file_buffered_write > - writeback path now uses a new aops - ext2_file_aops > - truncate now uses iomap_truncate_page > - mmap path of ext2 continues to use generic_file_vm_ops > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Looks nice and simple. Just one comment below: > +static void ext2_file_readahead(struct readahead_control *rac) > +{ > + return iomap_readahead(rac, &ext2_iomap_ops); > +} As Matthew noted, the return of void here looks strange. > +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, > + struct inode *inode, loff_t offset) > +{ > + if (offset >= wpc->iomap.offset && > + offset < wpc->iomap.offset + wpc->iomap.length) > + return 0; > + > + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, > + IOMAP_WRITE, &wpc->iomap, NULL); > +} So this will end up mapping blocks for writeback one block at a time if I'm understanding things right because ext2_iomap_begin() never sets extent larger than 'length' in the iomap result. Hence the wpc checking looks pointless (and actually dangerous - see below). So this will probably get more expensive than necessary by calling into ext2_get_blocks() for each block? Perhaps we could first call ext2_iomap_begin() for reading with high length to get as many mapped blocks as we can and if we find unmapped block (should happen only for writes through mmap), we resort to what you have here... Hum, but this will not work because of the races with truncate which could remove blocks whose mapping we have cached in wpc. We can safely provide a mapping under a folio only once it is locked or has writeback bit set. XFS plays the revalidation sequence counter games because of this so we'd have to do something similar for ext2. Not that I'd care as much about ext2 writeback performance but it should not be that hard and we'll definitely need some similar solution for ext4 anyway. Can you give that a try (as a followup "performance improvement" patch). Honza
On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > writeback bit set. XFS plays the revalidation sequence counter games > because of this so we'd have to do something similar for ext2. Not that I'd > care as much about ext2 writeback performance but it should not be that > hard and we'll definitely need some similar solution for ext4 anyway. Can > you give that a try (as a followup "performance improvement" patch). Darrick has mentioned that he is looking into lifting more of the validation sequence counter validation into iomap. In the meantime I have a series here that at least maps multiple blocks inside a folio in a single go, which might be worth trying with ext2 as well: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks
Jan Kara <jack@suse.cz> writes: > On Tue 21-11-23 00:35:20, Ritesh Harjani (IBM) wrote: >> This patch converts ext2 regular file's buffered-io path to use iomap. >> - buffered-io path using iomap_file_buffered_write >> - DIO fallback to buffered-io now uses iomap_file_buffered_write >> - writeback path now uses a new aops - ext2_file_aops >> - truncate now uses iomap_truncate_page >> - mmap path of ext2 continues to use generic_file_vm_ops >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > > Looks nice and simple. Just one comment below: > >> +static void ext2_file_readahead(struct readahead_control *rac) >> +{ >> + return iomap_readahead(rac, &ext2_iomap_ops); >> +} > > As Matthew noted, the return of void here looks strange. > yes, I will fix it. >> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, >> + struct inode *inode, loff_t offset) >> +{ >> + if (offset >= wpc->iomap.offset && >> + offset < wpc->iomap.offset + wpc->iomap.length) >> + return 0; >> + >> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, >> + IOMAP_WRITE, &wpc->iomap, NULL); >> +} > > So this will end up mapping blocks for writeback one block at a time if I'm > understanding things right because ext2_iomap_begin() never sets extent > larger than 'length' in the iomap result. Hence the wpc checking looks > pointless (and actually dangerous - see below). So this will probably get > more expensive than necessary by calling into ext2_get_blocks() for each > block? Perhaps we could first call ext2_iomap_begin() for reading with high > length to get as many mapped blocks as we can and if we find unmapped block > (should happen only for writes through mmap), we resort to what you have > here... Hum, but this will not work because of the races with truncate > which could remove blocks whose mapping we have cached in wpc. We can > safely provide a mapping under a folio only once it is locked or has > writeback bit set. XFS plays the revalidation sequence counter games > because of this so we'd have to do something similar for ext2. Not that I'd > care as much about ext2 writeback performance but it should not be that > hard and we'll definitely need some similar solution for ext4 anyway. Can > you give that a try (as a followup "performance improvement" patch). > Yes, looking into ->map_blocks was on my todo list, since this RFC only maps 1 block at a time which can be expensive. I missed adding that as a comment in cover letter. But also thanks for providing many details on the same. I am taking a look at it and will get back with more details on how can we get this working, as it will be anyway required for ext4 too. -ritesh
Christoph Hellwig <hch@infradead.org> writes: > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: >> writeback bit set. XFS plays the revalidation sequence counter games >> because of this so we'd have to do something similar for ext2. Not that I'd >> care as much about ext2 writeback performance but it should not be that >> hard and we'll definitely need some similar solution for ext4 anyway. Can >> you give that a try (as a followup "performance improvement" patch). > > Darrick has mentioned that he is looking into lifting more of the > validation sequence counter validation into iomap. > > In the meantime I have a series here that at least maps multiple blocks > inside a folio in a single go, which might be worth trying with ext2 as > well: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks Sure, thanks for providing details. I will check this. -ritesh
On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote: > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > > writeback bit set. XFS plays the revalidation sequence counter games > > because of this so we'd have to do something similar for ext2. Not that I'd > > care as much about ext2 writeback performance but it should not be that > > hard and we'll definitely need some similar solution for ext4 anyway. Can > > you give that a try (as a followup "performance improvement" patch). > > Darrick has mentioned that he is looking into lifting more of the > validation sequence counter validation into iomap. I think that was me, as part of aligning the writeback path with the ->iomap_valid() checks in the write path after we lock the folio we instantiated for the write. It's basically the same thing - once we have a locked folio, we have to check that the cached iomap is still valid before we use it for anything. I need to find the time to get back to that, though. -Dave.
On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote: > On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote: > > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > > > writeback bit set. XFS plays the revalidation sequence counter games > > > because of this so we'd have to do something similar for ext2. Not that I'd > > > care as much about ext2 writeback performance but it should not be that > > > hard and we'll definitely need some similar solution for ext4 anyway. Can > > > you give that a try (as a followup "performance improvement" patch). > > > > Darrick has mentioned that he is looking into lifting more of the > > validation sequence counter validation into iomap. > > I think that was me, as part of aligning the writeback path with > the ->iomap_valid() checks in the write path after we lock the folio > we instantiated for the write. > > It's basically the same thing - once we have a locked folio, we have > to check that the cached iomap is still valid before we use it for > anything. > > I need to find the time to get back to that, though. Heh, we probably both have been chatting with willy on and off about iomap. The particular idea I had is to add a u64 counter to address_space that we can bump in the same places where we bump xfs_inode_fork::if_seq right now.. ->iomap_begin would sample this address_space::i_mappingseq counter (with locks held), and now buffered writes and writeback can check iomap::mappingseq == address_space::i_mappingseq to decide if it's time to revalidate. Anyway, I'll have time to go play with that (and further purging of function pointers) next week or whenever is "after I put out v28 of online repair". ATM I have a rewrite of log intent recovery cooking too, but that's going to need at least a week or two of recoveryloop testing before I put that on the list. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote: > I think that was me No, it was Darrick. We talkd about a lot of things, but not this :) > , as part of aligning the writeback path with > the ->iomap_valid() checks in the write path after we lock the folio > we instantiated for the write. > > It's basically the same thing - once we have a locked folio, we have > to check that the cached iomap is still valid before we use it for > anything. Yes. Currently we do that in the wb ops ->map_blocks. This can get called multiple times per folio, which is a bit silly. With the series I just posted the link to we at least stop doing that if the folio is mapped contiguously, which solves all practical cases, as the sequence check is almost free compared to the actual block mapping. For steps beyond that I'll reply to Darrick's mail.
On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote: > The particular idea I had is to add a u64 counter to address_space that > we can bump in the same places where we bump xfs_inode_fork::if_seq > right now.. ->iomap_begin would sample this address_space::i_mappingseq > counter (with locks held), and now buffered writes and writeback can > check iomap::mappingseq == address_space::i_mappingseq to decide if it's > time to revalidate. So I think moving this to the VFS is probably a good idea, and I actually argued for that when the sequence checking was first proposed. We just have to be careful to be able to map things like the two separate data and cow seq counts in XFS (or anything else complicated in other file systems) to it. > Anyway, I'll have time to go play with that (and further purging of > function pointers) Do we have anything where the function pointer overhead is actually hurting us right now? One thing I'd like to move to is to merge the iomap_begin and iomap_end callbacks into one similar to willy's series from 2020. The big benefit of that would be that (together with switching write_cache_pages to an iterator model) that we could actually use this single iterator callback also for writeback instead of ->map_blocks, which doesn't really work with the current begin/end based iomap_iter as the folios actually written through write_cache_pages might not be contiguous. Using the same mapping callback would not only save some code duplication, but should also allow us to nicely implement Dave's old idea to not dirty pages for O_SYNC writes, but directly write them out. I did start prototyping that in the last days, and iomap_begin vs map_blocks is currently the biggest stumbling block.
On Wed, Nov 22, 2023 at 11:09:17PM -0800, Christoph Hellwig wrote: > On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote: > > The particular idea I had is to add a u64 counter to address_space that > > we can bump in the same places where we bump xfs_inode_fork::if_seq > > right now.. ->iomap_begin would sample this address_space::i_mappingseq > > counter (with locks held), and now buffered writes and writeback can > > check iomap::mappingseq == address_space::i_mappingseq to decide if it's > > time to revalidate. > > So I think moving this to the VFS is probably a good idea, and I > actually argued for that when the sequence checking was first proposed. > We just have to be careful to be able to map things like the two > separate data and cow seq counts in XFS (or anything else complicated > in other file systems) to it. TBH I've been wondering what would happen if we bumped i_mappingseq on updates of either data or cow fork instead of the shift+or'd thing that we use now for writeback and/or pagecache write. I suppose the nice thing about the current encodings is that we elide revalidations when the cow fork changes but mapping isn't shared. > > Anyway, I'll have time to go play with that (and further purging of > > function pointers) > > Do we have anything where the function pointer overhead is actually > hurting us right now? Not that I know of, but moving to a direct call model means that the fs would know based on the iomap_XXX_iter function signature whether or not iomap needs a srcmap; and then it can modify its iomap_begin function accordingly. Right now all those rules aren't especially obvious or well documented. Maybe I can convince myself that improved documentation will suffice to eliminate Ted's confusion. :) Also I haven't checked how much the indirect calls hurt. > One thing I'd like to move to is to merge the iomap_begin and iomap_end > callbacks into one similar to willy's series from 2020. The big Got a link to that? I need my memory refreshed, having DROP TABLE MEM2020; pretty please. > benefit of that would be that (together with switching > write_cache_pages to an iterator model) that we could actually use > this single iterator callback also for writeback instead of > ->map_blocks, which doesn't really work with the current begin/end > based iomap_iter as the folios actually written through > write_cache_pages might not be contiguous. Ooh it'd benice to get rid of that parallel callbacks thing finally. > Using the same mapping > callback would not only save some code duplication, but should also > allow us to nicely implement Dave's old idea to not dirty pages for > O_SYNC writes, but directly write them out. I did start prototyping > that in the last days, and iomap_begin vs map_blocks is currently > the biggest stumbling block. Neat! willy's been pushing me for that too. --D
On Tue, Nov 28, 2023 at 09:37:21PM -0800, Darrick J. Wong wrote: > TBH I've been wondering what would happen if we bumped i_mappingseq on > updates of either data or cow fork instead of the shift+or'd thing that > we use now for writeback and/or pagecache write. > > I suppose the nice thing about the current encodings is that we elide > revalidations when the cow fork changes but mapping isn't shared. changed? But yes. > > > > Anyway, I'll have time to go play with that (and further purging of > > > function pointers) > > > > Do we have anything where the function pointer overhead is actually > > hurting us right now? > > Not that I know of, but moving to a direct call model means that the fs > would know based on the iomap_XXX_iter function signature whether or not > iomap needs a srcmap; and then it can modify its iomap_begin function > accordingly. > > Right now all those rules aren't especially obvious or well documented. > Maybe I can convince myself that improved documentation will suffice to > eliminate Ted's confusion. :) Well, with an iter model I think we can actually kill the srcmap entirely, as we could be doing two separate overlapping iterations at the same time. Which would probably be nice for large unaligned writes as the write size won't be limited by the existing mapping only used to read in the unaligned blocks at the beginning end end. > > One thing I'd like to move to is to merge the iomap_begin and iomap_end > > callbacks into one similar to willy's series from 2020. The big > > Got a link to that? I need my memory refreshed, having DROP TABLE MEM2020; > pretty please. https://lore.kernel.org/all/20200811205314.GF6107@magnolia/T/
On Wed, Nov 22, 2023 at 08:09:44PM -0800, Darrick J. Wong wrote: > On Thu, Nov 23, 2023 at 09:26:44AM +1100, Dave Chinner wrote: > > On Wed, Nov 22, 2023 at 05:11:18AM -0800, Christoph Hellwig wrote: > > > On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > > > > writeback bit set. XFS plays the revalidation sequence counter games > > > > because of this so we'd have to do something similar for ext2. Not that I'd > > > > care as much about ext2 writeback performance but it should not be that > > > > hard and we'll definitely need some similar solution for ext4 anyway. Can > > > > you give that a try (as a followup "performance improvement" patch). > > > > > > Darrick has mentioned that he is looking into lifting more of the > > > validation sequence counter validation into iomap. > > > > I think that was me, as part of aligning the writeback path with > > the ->iomap_valid() checks in the write path after we lock the folio > > we instantiated for the write. > > > > It's basically the same thing - once we have a locked folio, we have > > to check that the cached iomap is still valid before we use it for > > anything. > > > > I need to find the time to get back to that, though. > > Heh, we probably both have been chatting with willy on and off about > iomap. > > The particular idea I had is to add a u64 counter to address_space that > we can bump in the same places where we bump xfs_inode_fork::if_seq > right now.. ->iomap_begin would sample this address_space::i_mappingseq > counter (with locks held), and now buffered writes and writeback can > check iomap::mappingseq == address_space::i_mappingseq to decide if it's > time to revalidate. Can't say I'm a great fan of putting filesystem physical extent map state cookies in the page cache address space. One of the primary architectural drivers for iomap was to completely separate the filesystem extent mapping information from the page cache internals and granularities, so this kinda steps over an architectural boundary in my mind. Also, filesystem mapping operations move further away from the VFS structures into deep internal filesystem - they do not interact with the page cache structures at all. Hence requiring physical extent mapping operations have to poke values in the page cache address space structure just seems like unnecessarily long pointer chasing to me. That said, I have no problesm with extent sequence counters in the VFS inode, but I just don't think it belongs in the page cache address space.... -Dave.
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > Christoph Hellwig <hch@infradead.org> writes: > >> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: >>> writeback bit set. XFS plays the revalidation sequence counter games >>> because of this so we'd have to do something similar for ext2. Not that I'd >>> care as much about ext2 writeback performance but it should not be that >>> hard and we'll definitely need some similar solution for ext4 anyway. Can >>> you give that a try (as a followup "performance improvement" patch). >> >> Darrick has mentioned that he is looking into lifting more of the >> validation sequence counter validation into iomap. >> >> In the meantime I have a series here that at least maps multiple blocks >> inside a folio in a single go, which might be worth trying with ext2 as >> well: >> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks > > Sure, thanks for providing details. I will check this. > So I took a look at this. Here is what I think - So this is useful of-course when we have a large folio. Because otherwise it's just one block at a time for each folio. This is not a problem once FS buffered-io handling code moves to iomap (because we can then enable large folio support to it). However, this would still require us to pass a folio to ->map_blocks call to determine the size of the folio (which I am not saying can't be done but just stating my observations here). Now coming to implementing validation seq check. I am hoping, it should be straight forward at least for ext2, because it mostly just have to protect against truncate operation (which can change the block mapping)... ...ok so here is the PoC for seq counter check for ext2. Next let me try to see if we can lift this up from the FS side to iomap - From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001 From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks This implements inode block seq counter (ib_seq) which helps in validating whether the cached wpc (struct iomap_writepage_ctx) still has the valid entries or not. Block mapping can get changed say due to a racing truncate. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/ext2/balloc.c | 1 + fs/ext2/ext2.h | 6 ++++++ fs/ext2/inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++----- fs/ext2/super.c | 2 +- 4 files changed, 54 insertions(+), 6 deletions(-) diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c index e124f3d709b2..e97040194da4 100644 --- a/fs/ext2/balloc.c +++ b/fs/ext2/balloc.c @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block, } ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); + ext2_inc_ib_seq(EXT2_I(inode)); do_more: overflow = 0; diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 677a9ad45dcb..882c14d20183 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -663,6 +663,7 @@ struct ext2_inode_info { struct rw_semaphore xattr_sem; #endif rwlock_t i_meta_lock; + unsigned int ib_seq; /* * truncate_mutex is for serialising ext2_truncate() against @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode) return container_of(inode, struct ext2_inode_info, vfs_inode); } +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei) +{ + WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1); +} + /* balloc.c */ extern int ext2_bg_has_super(struct super_block *sb, int group); extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index f4e0b9a93095..a5490d641c26 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode, ext2_fsblk_t current_block = 0; int ret = 0; + ext2_inc_ib_seq(EXT2_I(inode)); + /* * Here we try to allocate the requested multiple blocks at once, * on a best-effort basis. @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) return mpage_writepages(mapping, wbc, ext2_get_block); } +struct ext2_writepage_ctx { + struct iomap_writepage_ctx ctx; + unsigned int ib_seq; +}; + +static inline +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx) +{ + return container_of(ctx, struct ext2_writepage_ctx, ctx); +} + +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode, + loff_t offset) +{ + if (offset < wpc->iomap.offset || + offset >= wpc->iomap.offset + wpc->iomap.length) + return false; + + if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq)) + return false; + + return true; +} + static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, struct inode *inode, loff_t offset) { - if (offset >= wpc->iomap.offset && - offset < wpc->iomap.offset + wpc->iomap.length) + loff_t maxblocks = (loff_t)INT_MAX; + u8 blkbits = inode->i_blkbits; + u32 bno; + bool new, boundary; + int ret; + + if (ext2_imap_valid(wpc, inode, offset)) return 0; - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, + EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq); + + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits, + &bno, &new, &boundary, 0); + if (ret < 0) + return ret; + /* + * ret can be 0 in case of a hole which is possible for mmaped writes. + */ + ret = ret ? ret : 1; + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits, IOMAP_WRITE, &wpc->iomap, NULL); } @@ -984,9 +1025,9 @@ static const struct iomap_writeback_ops ext2_writeback_ops = { static int ext2_file_writepages(struct address_space *mapping, struct writeback_control *wbc) { - struct iomap_writepage_ctx wpc = { }; + struct ext2_writepage_ctx wpc = { }; - return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops); + return iomap_writepages(mapping, wbc, &wpc.ctx, &ext2_writeback_ops); } static int diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 645ee6142f69..cd1d1678e35b 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb) #ifdef CONFIG_QUOTA memset(&ei->i_dquot, 0, sizeof(ei->i_dquot)); #endif - + WRITE_ONCE(ei->ib_seq, 0); return &ei->vfs_inode; } 2.30.2
On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index 677a9ad45dcb..882c14d20183 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -663,6 +663,7 @@ struct ext2_inode_info { > struct rw_semaphore xattr_sem; > #endif > rwlock_t i_meta_lock; > + unsigned int ib_seq; Surely i_blkseq? Almost everything in this struct is prefixed i_.
On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: > So I took a look at this. Here is what I think - > So this is useful of-course when we have a large folio. Because > otherwise it's just one block at a time for each folio. This is not a > problem once FS buffered-io handling code moves to iomap (because we > can then enable large folio support to it). Yes. > However, this would still require us to pass a folio to ->map_blocks > call to determine the size of the folio (which I am not saying can't be > done but just stating my observations here). XFS currently maps based on the underlyig reservation (delalloc extent) and not the actual map size. This works because on-disk extents are allocated as unwritten extents, and only the actual written part is the converted. But if you only want to allocate blocks for the part actually written you actually need to pass in the dirty range and not just use the whole folio. This would be the incremental patch to do that: http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752 But unless your block allocator is very cheap doing what XFS does is probably going to work much better. > ...ok so here is the PoC for seq counter check for ext2. Next let me > try to see if we can lift this up from the FS side to iomap - This looks good to me from a very superficial view. Dave is the expert on this, though.
Matthew Wilcox <willy@infradead.org> writes: > On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: >> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h >> index 677a9ad45dcb..882c14d20183 100644 >> --- a/fs/ext2/ext2.h >> +++ b/fs/ext2/ext2.h >> @@ -663,6 +663,7 @@ struct ext2_inode_info { >> struct rw_semaphore xattr_sem; >> #endif >> rwlock_t i_meta_lock; >> + unsigned int ib_seq; > > Surely i_blkseq? Almost everything in this struct is prefixed i_. Certainly! -ritesh
Christoph Hellwig <hch@infradead.org> writes: > On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: >> So I took a look at this. Here is what I think - >> So this is useful of-course when we have a large folio. Because >> otherwise it's just one block at a time for each folio. This is not a >> problem once FS buffered-io handling code moves to iomap (because we >> can then enable large folio support to it). > > Yes. > >> However, this would still require us to pass a folio to ->map_blocks >> call to determine the size of the folio (which I am not saying can't be >> done but just stating my observations here). > > XFS currently maps based on the underlyig reservation (delalloc extent) > and not the actual map size. This works because on-disk extents are > allocated as unwritten extents, and only the actual written part is > the converted. But if you only want to allocate blocks for the part > actually written you actually need to pass in the dirty range and not > just use the whole folio. This would be the incremental patch to do > that: yes. dirty range indeed. > > http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752 > > But unless your block allocator is very cheap doing what XFS does is > probably going to work much better. > >> ...ok so here is the PoC for seq counter check for ext2. Next let me >> try to see if we can lift this up from the FS side to iomap - > > This looks good to me from a very superficial view. Dave is the expert > on this, though. Sure.
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > >> Christoph Hellwig <hch@infradead.org> writes: >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: >>>> writeback bit set. XFS plays the revalidation sequence counter games >>>> because of this so we'd have to do something similar for ext2. Not that I'd >>>> care as much about ext2 writeback performance but it should not be that >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can >>>> you give that a try (as a followup "performance improvement" patch). ok. So I am re-thinknig over this on why will a filesystem like ext2 would require sequence counter check. We don't have collapse range or COW sort of operations, it is only the truncate which can race, but that should be taken care by folio_lock. And even if the partial truncate happens on a folio, since the logical to physical block mapping never changes, it should not matter if the writeback wrote data to a cached entry, right? -ritesh >>> >>> Darrick has mentioned that he is looking into lifting more of the >>> validation sequence counter validation into iomap. >>> >>> In the meantime I have a series here that at least maps multiple blocks >>> inside a folio in a single go, which might be worth trying with ext2 as >>> well: >>> >>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks >> >> Sure, thanks for providing details. I will check this. >> > > So I took a look at this. Here is what I think - > So this is useful of-course when we have a large folio. Because > otherwise it's just one block at a time for each folio. This is not a > problem once FS buffered-io handling code moves to iomap (because we > can then enable large folio support to it). > > However, this would still require us to pass a folio to ->map_blocks > call to determine the size of the folio (which I am not saying can't be > done but just stating my observations here). > > Now coming to implementing validation seq check. I am hoping, it should > be straight forward at least for ext2, because it mostly just have to > protect against truncate operation (which can change the block mapping)... > > ...ok so here is the PoC for seq counter check for ext2. Next let me > try to see if we can lift this up from the FS side to iomap - > > > From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001 > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> > Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks > > This implements inode block seq counter (ib_seq) which helps in validating > whether the cached wpc (struct iomap_writepage_ctx) still has the valid > entries or not. Block mapping can get changed say due to a racing truncate. > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/ext2/balloc.c | 1 + > fs/ext2/ext2.h | 6 ++++++ > fs/ext2/inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++----- > fs/ext2/super.c | 2 +- > 4 files changed, 54 insertions(+), 6 deletions(-) > > diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c > index e124f3d709b2..e97040194da4 100644 > --- a/fs/ext2/balloc.c > +++ b/fs/ext2/balloc.c > @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block, > } > > ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); > + ext2_inc_ib_seq(EXT2_I(inode)); > > do_more: > overflow = 0; > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index 677a9ad45dcb..882c14d20183 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -663,6 +663,7 @@ struct ext2_inode_info { > struct rw_semaphore xattr_sem; > #endif > rwlock_t i_meta_lock; > + unsigned int ib_seq; > > /* > * truncate_mutex is for serialising ext2_truncate() against > @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode) > return container_of(inode, struct ext2_inode_info, vfs_inode); > } > > +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei) > +{ > + WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1); > +} > + > /* balloc.c */ > extern int ext2_bg_has_super(struct super_block *sb, int group); > extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group); > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index f4e0b9a93095..a5490d641c26 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode, > ext2_fsblk_t current_block = 0; > int ret = 0; > > + ext2_inc_ib_seq(EXT2_I(inode)); > + > /* > * Here we try to allocate the requested multiple blocks at once, > * on a best-effort basis. > @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) > return mpage_writepages(mapping, wbc, ext2_get_block); > } > > +struct ext2_writepage_ctx { > + struct iomap_writepage_ctx ctx; > + unsigned int ib_seq; > +}; > + > +static inline > +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx) > +{ > + return container_of(ctx, struct ext2_writepage_ctx, ctx); > +} > + > +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode, > + loff_t offset) > +{ > + if (offset < wpc->iomap.offset || > + offset >= wpc->iomap.offset + wpc->iomap.length) > + return false; > + > + if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq)) > + return false; > + > + return true; > +} > + > static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, > struct inode *inode, loff_t offset) > { > - if (offset >= wpc->iomap.offset && > - offset < wpc->iomap.offset + wpc->iomap.length) > + loff_t maxblocks = (loff_t)INT_MAX; > + u8 blkbits = inode->i_blkbits; > + u32 bno; > + bool new, boundary; > + int ret; > + > + if (ext2_imap_valid(wpc, inode, offset)) > return 0; > > - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, > + EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq); > + > + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits, > + &bno, &new, &boundary, 0); > + if (ret < 0) > + return ret; > + /* > + * ret can be 0 in case of a hole which is possible for mmaped writes. > + */ > + ret = ret ? ret : 1; > + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits, > IOMAP_WRITE, &wpc->iomap, NULL); > } > > @@ -984,9 +1025,9 @@ static const struct iomap_writeback_ops ext2_writeback_ops = { > static int ext2_file_writepages(struct address_space *mapping, > struct writeback_control *wbc) > { > - struct iomap_writepage_ctx wpc = { }; > + struct ext2_writepage_ctx wpc = { }; > > - return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops); > + return iomap_writepages(mapping, wbc, &wpc.ctx, &ext2_writeback_ops); > } > > static int > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 645ee6142f69..cd1d1678e35b 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb) > #ifdef CONFIG_QUOTA > memset(&ei->i_dquot, 0, sizeof(ei->i_dquot)); > #endif > - > + WRITE_ONCE(ei->ib_seq, 0); > return &ei->vfs_inode; > } > > > 2.30.2
On 2023/11/30 12:30, Christoph Hellwig wrote: > On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: >> So I took a look at this. Here is what I think - >> So this is useful of-course when we have a large folio. Because >> otherwise it's just one block at a time for each folio. This is not a >> problem once FS buffered-io handling code moves to iomap (because we >> can then enable large folio support to it). > > Yes. > >> However, this would still require us to pass a folio to ->map_blocks >> call to determine the size of the folio (which I am not saying can't be >> done but just stating my observations here). > > XFS currently maps based on the underlyig reservation (delalloc extent) > and not the actual map size. This works because on-disk extents are > allocated as unwritten extents, and only the actual written part is > the converted. But if you only want to allocate blocks for the part IIUC, I noticed a side effect of this map method, Let's think about a special case, if we sync a partial range of a delalloc extent, and then the system crashes before the remaining data write back. We could get a file which has allocated blocks(unwritten) beyond EOF. I can reproduce it on xfs. # write 10 blocks, but only sync 3 blocks, i_size becomes 12K after IO complete xfs_io -f -c "pwrite 0 40k" -c "sync_range 0 12k" /mnt/foo # postpone the remaining data writeback echo 20000 > /proc/sys/vm/dirty_writeback_centisecs echo 20000 > /proc/sys/vm/dirty_expire_centisecs # wait 35s to make sure xfs's cil log writeback sleep 35 # triger system crash echo c > /proc/sysrq-trigger After system reboot, the 'foo' file's size is 12K but have 10 allocated blocks. xfs_db> inode 131 core.size = 12288 core.nblocks = 10 ... 0:[0,24,3,0] 1:[3,27,7,1] I'm not expert on xfs, but I don't think this is the right result, is it? Thanks, Yi. > actually written you actually need to pass in the dirty range and not > just use the whole folio. This would be the incremental patch to do > that: > > http://git.infradead.org/users/hch/xfs.git/commitdiff/0007893015796ef2ba16bb8b98c4c9fb6e9e6752 > > But unless your block allocator is very cheap doing what XFS does is > probably going to work much better. > >> ...ok so here is the PoC for seq counter check for ext2. Next let me >> try to see if we can lift this up from the FS side to iomap - > > This looks good to me from a very superficial view. Dave is the expert > on this, though. > > > . >
On Thu 30-11-23 13:15:58, Ritesh Harjani wrote: > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > > > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > > > >> Christoph Hellwig <hch@infradead.org> writes: > >> > >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > >>>> writeback bit set. XFS plays the revalidation sequence counter games > >>>> because of this so we'd have to do something similar for ext2. Not that I'd > >>>> care as much about ext2 writeback performance but it should not be that > >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can > >>>> you give that a try (as a followup "performance improvement" patch). > > ok. So I am re-thinknig over this on why will a filesystem like ext2 > would require sequence counter check. We don't have collapse range > or COW sort of operations, it is only the truncate which can race, > but that should be taken care by folio_lock. And even if the partial > truncate happens on a folio, since the logical to physical block mapping > never changes, it should not matter if the writeback wrote data to a > cached entry, right? Yes, so this is what I think I've already mentioned. As long as we map just the block at the current offset (or a block under currently locked folio), we are fine and we don't need any kind of sequence counter. But as soon as we start caching any kind of mapping in iomap_writepage_ctx we need a way to protect from races with truncate. So something like the sequence counter. Honza
Jan Kara <jack@suse.cz> writes: > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote: >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: >> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: >> > >> >> Christoph Hellwig <hch@infradead.org> writes: >> >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: >> >>>> writeback bit set. XFS plays the revalidation sequence counter games >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd >> >>>> care as much about ext2 writeback performance but it should not be that >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can >> >>>> you give that a try (as a followup "performance improvement" patch). >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2 >> would require sequence counter check. We don't have collapse range >> or COW sort of operations, it is only the truncate which can race, >> but that should be taken care by folio_lock. And even if the partial >> truncate happens on a folio, since the logical to physical block mapping >> never changes, it should not matter if the writeback wrote data to a >> cached entry, right? > > Yes, so this is what I think I've already mentioned. As long as we map just > the block at the current offset (or a block under currently locked folio), > we are fine and we don't need any kind of sequence counter. But as soon as > we start caching any kind of mapping in iomap_writepage_ctx we need a way > to protect from races with truncate. So something like the sequence counter. > Why do we need to protect from the race with truncate, is my question here. So, IMO, truncate will truncate the folio cache first before releasing the FS blocks. Truncation of the folio cache and the writeback path are protected using folio_lock() Truncate will clear the dirty flag of the folio before releasing the folio_lock() right, so writeback will not even proceed for folios which are not marked dirty (even if we have a cached wpc entry for which folio is released from folio cache). Now coming to the stale cached wpc entry for which truncate is doing a partial truncation. Say, truncate ended up calling truncate_inode_partial_folio(). Now for such folio (it remains dirty after partial truncation) (for which there is a stale cached wpc entry), when writeback writes to the underlying stale block, there is no harm with that right? Also this will "only" happen for folio which was partially truncated. So why do we need to have sequence counter for protecting against this race is my question. So could this be only needed when existing logical to physical block mapping changes e.g. like COW or maybe collapse range? -ritesh
On Thu 30-11-23 16:29:59, Ritesh Harjani wrote: > Jan Kara <jack@suse.cz> writes: > > > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote: > >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > >> > >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > >> > > >> >> Christoph Hellwig <hch@infradead.org> writes: > >> >> > >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > >> >>>> writeback bit set. XFS plays the revalidation sequence counter games > >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd > >> >>>> care as much about ext2 writeback performance but it should not be that > >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can > >> >>>> you give that a try (as a followup "performance improvement" patch). > >> > >> ok. So I am re-thinknig over this on why will a filesystem like ext2 > >> would require sequence counter check. We don't have collapse range > >> or COW sort of operations, it is only the truncate which can race, > >> but that should be taken care by folio_lock. And even if the partial > >> truncate happens on a folio, since the logical to physical block mapping > >> never changes, it should not matter if the writeback wrote data to a > >> cached entry, right? > > > > Yes, so this is what I think I've already mentioned. As long as we map just > > the block at the current offset (or a block under currently locked folio), > > we are fine and we don't need any kind of sequence counter. But as soon as > > we start caching any kind of mapping in iomap_writepage_ctx we need a way > > to protect from races with truncate. So something like the sequence counter. > > > > Why do we need to protect from the race with truncate, is my question here. > So, IMO, truncate will truncate the folio cache first before releasing the FS > blocks. Truncation of the folio cache and the writeback path are > protected using folio_lock() > Truncate will clear the dirty flag of the folio before > releasing the folio_lock() right, so writeback will not even proceed for > folios which are not marked dirty (even if we have a cached wpc entry for > which folio is released from folio cache). > > Now coming to the stale cached wpc entry for which truncate is doing a > partial truncation. Say, truncate ended up calling > truncate_inode_partial_folio(). Now for such folio (it remains dirty > after partial truncation) (for which there is a stale cached wpc entry), > when writeback writes to the underlying stale block, there is no harm > with that right? > > Also this will "only" happen for folio which was partially truncated. > So why do we need to have sequence counter for protecting against this > race is my question. That's a very good question and it took me a while to formulate my "this sounds problematic" feeling into a particular example :) We can still have a race like: write_cache_pages() cache extent covering 0..1MB range write page at offset 0k truncate(file, 4k) drops all relevant pages frees fs blocks pwrite(file, 4k, 4k) creates dirty page in the page cache writes page at offset 4k to a stale block Honza
Jan Kara <jack@suse.cz> writes: > On Thu 30-11-23 16:29:59, Ritesh Harjani wrote: >> Jan Kara <jack@suse.cz> writes: >> >> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote: >> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: >> >> >> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: >> >> > >> >> >> Christoph Hellwig <hch@infradead.org> writes: >> >> >> >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: >> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games >> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd >> >> >>>> care as much about ext2 writeback performance but it should not be that >> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can >> >> >>>> you give that a try (as a followup "performance improvement" patch). >> >> >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2 >> >> would require sequence counter check. We don't have collapse range >> >> or COW sort of operations, it is only the truncate which can race, >> >> but that should be taken care by folio_lock. And even if the partial >> >> truncate happens on a folio, since the logical to physical block mapping >> >> never changes, it should not matter if the writeback wrote data to a >> >> cached entry, right? >> > >> > Yes, so this is what I think I've already mentioned. As long as we map just >> > the block at the current offset (or a block under currently locked folio), >> > we are fine and we don't need any kind of sequence counter. But as soon as >> > we start caching any kind of mapping in iomap_writepage_ctx we need a way >> > to protect from races with truncate. So something like the sequence counter. >> > >> >> Why do we need to protect from the race with truncate, is my question here. >> So, IMO, truncate will truncate the folio cache first before releasing the FS >> blocks. Truncation of the folio cache and the writeback path are >> protected using folio_lock() >> Truncate will clear the dirty flag of the folio before >> releasing the folio_lock() right, so writeback will not even proceed for >> folios which are not marked dirty (even if we have a cached wpc entry for >> which folio is released from folio cache). >> >> Now coming to the stale cached wpc entry for which truncate is doing a >> partial truncation. Say, truncate ended up calling >> truncate_inode_partial_folio(). Now for such folio (it remains dirty >> after partial truncation) (for which there is a stale cached wpc entry), >> when writeback writes to the underlying stale block, there is no harm >> with that right? >> >> Also this will "only" happen for folio which was partially truncated. >> So why do we need to have sequence counter for protecting against this >> race is my question. > > That's a very good question and it took me a while to formulate my "this > sounds problematic" feeling into a particular example :) We can still have > a race like: > > write_cache_pages() > cache extent covering 0..1MB range > write page at offset 0k > truncate(file, 4k) > drops all relevant pages > frees fs blocks > pwrite(file, 4k, 4k) > creates dirty page in the page cache > writes page at offset 4k to a stale block :) Nice! Truncate followed by an append write could cause this race with writeback happening in parallel. And this might cause data corruption. Thanks again for clearly explaining the race :) So I think just having a seq. counter (no other locks required), should be ok to prevent this race. Since mostly what we are interested in is whether the stale cached block mapping got changed for the folio which writeback is going to continue writing to. i.e. the race only happens when 2 of these operation happen while we have a cached block mapping for a folio - - a new physical block got allocated for the same logical offset to which the previous folio belongs to - the folio got re-instatiated in the page cache as dirty with the new physical block mapping at the same logical offset of the file. Now when the writeback continues, it will iterate over the same dirty folio in question, lock it, check for ->map_blocks which will notice a changed seq counter and do ->get_blocks again and then we do submit_bio() to the right physical block. So, it should be ok, if we simply update the seq counter everytime a block is allocated or freed for a given inode... because when we check the seq counter, we know the folio is locked and the other operation of re-instating a new physical block mapping for a given folio can also only happen while under a folio lock. OR, one other approach to this can be... IIUC, for a new block mapping for the same folio, the folio can be made to get invalidated once in between right? So should this be handled in folio cache somehow to know if for a given folio the underlying mapping might have changed for which iomap should again query ->map_blocks() ? ... can that also help against unnecessary re-validations against cached block mappings? Thoughts? -ritesh > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu 30-11-23 21:20:41, Ritesh Harjani wrote: > Jan Kara <jack@suse.cz> writes: > > > On Thu 30-11-23 16:29:59, Ritesh Harjani wrote: > >> Jan Kara <jack@suse.cz> writes: > >> > >> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote: > >> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > >> >> > >> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > >> >> > > >> >> >> Christoph Hellwig <hch@infradead.org> writes: > >> >> >> > >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > >> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games > >> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd > >> >> >>>> care as much about ext2 writeback performance but it should not be that > >> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can > >> >> >>>> you give that a try (as a followup "performance improvement" patch). > >> >> > >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2 > >> >> would require sequence counter check. We don't have collapse range > >> >> or COW sort of operations, it is only the truncate which can race, > >> >> but that should be taken care by folio_lock. And even if the partial > >> >> truncate happens on a folio, since the logical to physical block mapping > >> >> never changes, it should not matter if the writeback wrote data to a > >> >> cached entry, right? > >> > > >> > Yes, so this is what I think I've already mentioned. As long as we map just > >> > the block at the current offset (or a block under currently locked folio), > >> > we are fine and we don't need any kind of sequence counter. But as soon as > >> > we start caching any kind of mapping in iomap_writepage_ctx we need a way > >> > to protect from races with truncate. So something like the sequence counter. > >> > > >> > >> Why do we need to protect from the race with truncate, is my question here. > >> So, IMO, truncate will truncate the folio cache first before releasing the FS > >> blocks. Truncation of the folio cache and the writeback path are > >> protected using folio_lock() > >> Truncate will clear the dirty flag of the folio before > >> releasing the folio_lock() right, so writeback will not even proceed for > >> folios which are not marked dirty (even if we have a cached wpc entry for > >> which folio is released from folio cache). > >> > >> Now coming to the stale cached wpc entry for which truncate is doing a > >> partial truncation. Say, truncate ended up calling > >> truncate_inode_partial_folio(). Now for such folio (it remains dirty > >> after partial truncation) (for which there is a stale cached wpc entry), > >> when writeback writes to the underlying stale block, there is no harm > >> with that right? > >> > >> Also this will "only" happen for folio which was partially truncated. > >> So why do we need to have sequence counter for protecting against this > >> race is my question. > > > > That's a very good question and it took me a while to formulate my "this > > sounds problematic" feeling into a particular example :) We can still have > > a race like: > > > > write_cache_pages() > > cache extent covering 0..1MB range > > write page at offset 0k > > truncate(file, 4k) > > drops all relevant pages > > frees fs blocks > > pwrite(file, 4k, 4k) > > creates dirty page in the page cache > > writes page at offset 4k to a stale block > > :) Nice! > Truncate followed by an append write could cause this race with writeback > happening in parallel. And this might cause data corruption. > > Thanks again for clearly explaining the race :) > > So I think just having a seq. counter (no other locks required), should > be ok to prevent this race. Since mostly what we are interested in is > whether the stale cached block mapping got changed for the folio which > writeback is going to continue writing to. > > i.e. the race only happens when 2 of these operation happen while we > have a cached block mapping for a folio - > - a new physical block got allocated for the same logical offset to > which the previous folio belongs to > - the folio got re-instatiated in the page cache as dirty with the new > physical block mapping at the same logical offset of the file. > > Now when the writeback continues, it will iterate over the same > dirty folio in question, lock it, check for ->map_blocks which will > notice a changed seq counter and do ->get_blocks again and then > we do submit_bio() to the right physical block. > > So, it should be ok, if we simply update the seq counter everytime a > block is allocated or freed for a given inode... because when we > check the seq counter, we know the folio is locked and the other > operation of re-instating a new physical block mapping for a given folio > can also only happen while under a folio lock. Yes. > OR, one other approach to this can be... > > IIUC, for a new block mapping for the same folio, the folio can be made to > get invalidated once in between right? So should this be handled in folio > cache somehow to know if for a given folio the underlying mapping might > have changed for which iomap should again query ->map_blocks() ? > ... can that also help against unnecessary re-validations against cached > block mappings? This would be difficult because the page cache handling code does not know someone has cached block mapping somewhere... Honza
On Thu, Nov 30, 2023 at 09:20:41PM +0530, Ritesh Harjani wrote: > OR, one other approach to this can be... > > IIUC, for a new block mapping for the same folio, the folio can be made to > get invalidated once in between right? So should this be handled in folio > cache somehow to know if for a given folio the underlying mapping might > have changed for which iomap should again query ->map_blocks() ? > ... can that also help against unnecessary re-validations against cached > block mappings? That's pretty much exactly what buffer heads do -- we'd see the mapped bit was now clear. Maybe we could have a bit in the iomap_folio_state that is set once a map has been returned for this folio. On the one hand, this is exactly against the direction of iomap; to divorce the filesystem state from the page cache state. On the other hand, maybe that wasn't a great idea and deserves to be reexamined.
On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: > > > Christoph Hellwig <hch@infradead.org> writes: > > > >> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: > >>> writeback bit set. XFS plays the revalidation sequence counter games > >>> because of this so we'd have to do something similar for ext2. Not that I'd > >>> care as much about ext2 writeback performance but it should not be that > >>> hard and we'll definitely need some similar solution for ext4 anyway. Can > >>> you give that a try (as a followup "performance improvement" patch). > >> > >> Darrick has mentioned that he is looking into lifting more of the > >> validation sequence counter validation into iomap. > >> > >> In the meantime I have a series here that at least maps multiple blocks > >> inside a folio in a single go, which might be worth trying with ext2 as > >> well: > >> > >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks > > > > Sure, thanks for providing details. I will check this. > > > > So I took a look at this. Here is what I think - > So this is useful of-course when we have a large folio. Because > otherwise it's just one block at a time for each folio. This is not a > problem once FS buffered-io handling code moves to iomap (because we > can then enable large folio support to it). > > However, this would still require us to pass a folio to ->map_blocks > call to determine the size of the folio (which I am not saying can't be > done but just stating my observations here). > > Now coming to implementing validation seq check. I am hoping, it should > be straight forward at least for ext2, because it mostly just have to > protect against truncate operation (which can change the block mapping)... > > ...ok so here is the PoC for seq counter check for ext2. Next let me > try to see if we can lift this up from the FS side to iomap - > > > From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001 > From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> > Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks > > This implements inode block seq counter (ib_seq) which helps in validating > whether the cached wpc (struct iomap_writepage_ctx) still has the valid > entries or not. Block mapping can get changed say due to a racing truncate. > > Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > --- > fs/ext2/balloc.c | 1 + > fs/ext2/ext2.h | 6 ++++++ > fs/ext2/inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++----- > fs/ext2/super.c | 2 +- > 4 files changed, 54 insertions(+), 6 deletions(-) > > diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c > index e124f3d709b2..e97040194da4 100644 > --- a/fs/ext2/balloc.c > +++ b/fs/ext2/balloc.c > @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block, > } > > ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); > + ext2_inc_ib_seq(EXT2_I(inode)); > > do_more: > overflow = 0; > diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h > index 677a9ad45dcb..882c14d20183 100644 > --- a/fs/ext2/ext2.h > +++ b/fs/ext2/ext2.h > @@ -663,6 +663,7 @@ struct ext2_inode_info { > struct rw_semaphore xattr_sem; > #endif > rwlock_t i_meta_lock; > + unsigned int ib_seq; > > /* > * truncate_mutex is for serialising ext2_truncate() against > @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode) > return container_of(inode, struct ext2_inode_info, vfs_inode); > } > > +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei) > +{ > + WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1); > +} > + > /* balloc.c */ > extern int ext2_bg_has_super(struct super_block *sb, int group); > extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group); > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index f4e0b9a93095..a5490d641c26 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode, > ext2_fsblk_t current_block = 0; > int ret = 0; > > + ext2_inc_ib_seq(EXT2_I(inode)); > + > /* > * Here we try to allocate the requested multiple blocks at once, > * on a best-effort basis. > @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) > return mpage_writepages(mapping, wbc, ext2_get_block); > } > > +struct ext2_writepage_ctx { > + struct iomap_writepage_ctx ctx; > + unsigned int ib_seq; > +}; Why copy this structure from XFS? The iomap held in ctx->iomap now has a 'u64 validity_cookie;' which is expressly intended to be used for holding this information. And then this becomes: > +static inline > +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx) > +{ > + return container_of(ctx, struct ext2_writepage_ctx, ctx); > +} > + > +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode, > + loff_t offset) > +{ > + if (offset < wpc->iomap.offset || > + offset >= wpc->iomap.offset + wpc->iomap.length) > + return false; > + > + if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq)) > + return false; if (wpc->iomap->validity_cookie != EXT2_I(inode)->ib_seq) return false; > + > + return true; > +} > + > static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, > struct inode *inode, loff_t offset) > { > - if (offset >= wpc->iomap.offset && > - offset < wpc->iomap.offset + wpc->iomap.length) > + loff_t maxblocks = (loff_t)INT_MAX; > + u8 blkbits = inode->i_blkbits; > + u32 bno; > + bool new, boundary; > + int ret; > + > + if (ext2_imap_valid(wpc, inode, offset)) > return 0; > > - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, > + EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq); > + > + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits, > + &bno, &new, &boundary, 0); This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is immediately staled and the very next call to ext2_imap_valid() will fail, causing a new iomap to be mapped even when not necessary. Indeed, we also don't hold the ei->truncate_mutex here, so the sampling of ib_seq could race with other allocation or truncation calls on this inode. That's a landmine that will eventually bite hard, because the sequence number returned with the iomap does not reflect the state of the extent tree when the iomap is created. If you look at the XFS code, we set iomap->validity_cookie whenever we call xfs_bmbt_to_iomap() to format the found/allocated extent into an iomap to return to the caller. The sequence number is passed into that function alongside the extent map by the caller, because when we format the extent into an iomap the caller has already dropped the allocation lock. We must sample the sequence number after the allocation but before we drop the allocation lock so that the sequence number -exactly- matches the extent tree and the extent map we are going to format into the iomap, otherwise the extent we map into the iomap may already be stale and the validity cookie won't tell us that. i.e. the cohrenecy relationship between the validity cookie and the iomap must be set up from a synchronised state. If the sequence number is allowed to change between mapping the extent and sampling the sequence number, then the extent we map and cache may already be invalid and this introduces the possibility that the validity cookie won't always catch that... > + if (ret < 0) > + return ret; > + /* > + * ret can be 0 in case of a hole which is possible for mmaped writes. > + */ > + ret = ret ? ret : 1; > + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits, > IOMAP_WRITE, &wpc->iomap, NULL); So calling ext2_iomap_begin() here might have to change. Indeed, given that we just allocated the blocks and we know where they are located, calling ext2_iomap_begin()->ext2_get_blocks() to look them up again just to format them into the iomap seems .... convoluted. It would be better to factor ext2_iomap_begin() into a call to ext2_get_blocks() and then a "ext2_blocks_to_iomap()" function to format the returned blocks into the iomap that is returned. Then ext2_write_map_blocks() can just call ext2_blocks_to_iomap() here and it skips the unnecessary block mapping lookup.... It also means that the ib_seq returned by the allocation can be fed directly into the iomap->validity_cookie... -Dave.
Dave Chinner <david@fromorbit.com> writes: > On Thu, Nov 30, 2023 at 08:54:31AM +0530, Ritesh Harjani wrote: >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes: >> >> > Christoph Hellwig <hch@infradead.org> writes: >> > >> >> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote: >> >>> writeback bit set. XFS plays the revalidation sequence counter games >> >>> because of this so we'd have to do something similar for ext2. Not that I'd >> >>> care as much about ext2 writeback performance but it should not be that >> >>> hard and we'll definitely need some similar solution for ext4 anyway. Can >> >>> you give that a try (as a followup "performance improvement" patch). >> >> >> >> Darrick has mentioned that he is looking into lifting more of the >> >> validation sequence counter validation into iomap. >> >> >> >> In the meantime I have a series here that at least maps multiple blocks >> >> inside a folio in a single go, which might be worth trying with ext2 as >> >> well: >> >> >> >> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iomap-map-multiple-blocks >> > >> > Sure, thanks for providing details. I will check this. >> > >> >> So I took a look at this. Here is what I think - >> So this is useful of-course when we have a large folio. Because >> otherwise it's just one block at a time for each folio. This is not a >> problem once FS buffered-io handling code moves to iomap (because we >> can then enable large folio support to it). >> >> However, this would still require us to pass a folio to ->map_blocks >> call to determine the size of the folio (which I am not saying can't be >> done but just stating my observations here). >> >> Now coming to implementing validation seq check. I am hoping, it should >> be straight forward at least for ext2, because it mostly just have to >> protect against truncate operation (which can change the block mapping)... >> >> ...ok so here is the PoC for seq counter check for ext2. Next let me >> try to see if we can lift this up from the FS side to iomap - >> >> >> From 86b7bdf79107c3d0a4f37583121d7c136db1bc5c Mon Sep 17 00:00:00 2001 >> From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> >> Subject: [PATCH] ext2: Implement seq check for mapping multiblocks in ->map_blocks >> >> This implements inode block seq counter (ib_seq) which helps in validating >> whether the cached wpc (struct iomap_writepage_ctx) still has the valid >> entries or not. Block mapping can get changed say due to a racing truncate. >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> >> --- >> fs/ext2/balloc.c | 1 + >> fs/ext2/ext2.h | 6 ++++++ >> fs/ext2/inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++----- >> fs/ext2/super.c | 2 +- >> 4 files changed, 54 insertions(+), 6 deletions(-) >> >> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c >> index e124f3d709b2..e97040194da4 100644 >> --- a/fs/ext2/balloc.c >> +++ b/fs/ext2/balloc.c >> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block, >> } >> >> ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); >> + ext2_inc_ib_seq(EXT2_I(inode)); >> >> do_more: >> overflow = 0; >> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h >> index 677a9ad45dcb..882c14d20183 100644 >> --- a/fs/ext2/ext2.h >> +++ b/fs/ext2/ext2.h >> @@ -663,6 +663,7 @@ struct ext2_inode_info { >> struct rw_semaphore xattr_sem; >> #endif >> rwlock_t i_meta_lock; >> + unsigned int ib_seq; >> >> /* >> * truncate_mutex is for serialising ext2_truncate() against >> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode) >> return container_of(inode, struct ext2_inode_info, vfs_inode); >> } >> >> +static inline void ext2_inc_ib_seq(struct ext2_inode_info *ei) >> +{ >> + WRITE_ONCE(ei->ib_seq, READ_ONCE(ei->ib_seq) + 1); >> +} >> + >> /* balloc.c */ >> extern int ext2_bg_has_super(struct super_block *sb, int group); >> extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group); >> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c >> index f4e0b9a93095..a5490d641c26 100644 >> --- a/fs/ext2/inode.c >> +++ b/fs/ext2/inode.c >> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode, >> ext2_fsblk_t current_block = 0; >> int ret = 0; >> >> + ext2_inc_ib_seq(EXT2_I(inode)); >> + >> /* >> * Here we try to allocate the requested multiple blocks at once, >> * on a best-effort basis. >> @@ -966,14 +968,53 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) >> return mpage_writepages(mapping, wbc, ext2_get_block); >> } >> >> +struct ext2_writepage_ctx { >> + struct iomap_writepage_ctx ctx; >> + unsigned int ib_seq; >> +}; > > Why copy this structure from XFS? The iomap held in ctx->iomap now > has a 'u64 validity_cookie;' which is expressly intended to be used > for holding this information. > > And then this becomes: > Right. Thanks for pointing out. >> +static inline >> +struct ext2_writepage_ctx *EXT2_WPC(struct iomap_writepage_ctx *ctx) >> +{ >> + return container_of(ctx, struct ext2_writepage_ctx, ctx); >> +} >> + >> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode, >> + loff_t offset) >> +{ >> + if (offset < wpc->iomap.offset || >> + offset >= wpc->iomap.offset + wpc->iomap.length) >> + return false; >> + >> + if (EXT2_WPC(wpc)->ib_seq != READ_ONCE(EXT2_I(inode)->ib_seq)) >> + return false; > > if (wpc->iomap->validity_cookie != EXT2_I(inode)->ib_seq) > return false; > >> + >> + return true; >> +} >> + >> static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, >> struct inode *inode, loff_t offset) >> { >> - if (offset >= wpc->iomap.offset && >> - offset < wpc->iomap.offset + wpc->iomap.length) >> + loff_t maxblocks = (loff_t)INT_MAX; >> + u8 blkbits = inode->i_blkbits; >> + u32 bno; >> + bool new, boundary; >> + int ret; >> + >> + if (ext2_imap_valid(wpc, inode, offset)) >> return 0; >> >> - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, >> + EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq); >> + >> + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits, >> + &bno, &new, &boundary, 0); > > This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it > does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is > immediately staled and the very next call to ext2_imap_valid() will > fail, causing a new iomap to be mapped even when not necessary. In case of ext2 here, the allocation happens at the write time itself for buffered writes. So what we are essentially doing here (at the time of writeback) is querying ->get_blocks(..., create=0) and passing those no. of blocks (ret) further. So it is unlikely that the block allocations will happen right after we sample ib_seq (unless we race with truncate). For mmapped writes, we expect to find a hole and in case of a hole at the offset, we only pass & cache 1 block in wpc. For mmapped writes case since we go and allocate 1 block, so I agree that the ib_seq will change right after in ->get_blocks. But since in this case we only alloc and cache 1 block, we anyway will have to call ->get_blocks irrespective of ib_seq checking. > > Indeed, we also don't hold the ei->truncate_mutex here, so the > sampling of ib_seq could race with other allocation or truncation > calls on this inode. That's a landmine that will eventually bite > hard, because the sequence number returned with the iomap does not > reflect the state of the extent tree when the iomap is created. > So by sampling the ib_seq before calling ext2_get_blocks(), we ensure any change in block mapping after the sampling gets recorded like race of writeback with truncate (which can therefore change ib_seq value) > If you look at the XFS code, we set iomap->validity_cookie whenever > we call xfs_bmbt_to_iomap() to format the found/allocated extent > into an iomap to return to the caller. The sequence number is passed > into that function alongside the extent map by the caller, because > when we format the extent into an iomap the caller has already dropped the > allocation lock. We must sample the sequence number after the allocation > but before we drop the allocation lock so that the sequence number > -exactly- matches the extent tree and the extent map we are going to > format into the iomap, otherwise the extent we map into the iomap > may already be stale and the validity cookie won't tell us that. > > i.e. the cohrenecy relationship between the validity cookie and the > iomap must be set up from a synchronised state. If the sequence > number is allowed to change between mapping the extent and sampling > the sequence number, then the extent we map and cache may already be > invalid and this introduces the possibility that the validity cookie > won't always catch that... > Yes, Dave. Thanks for sharing the details. In the current PoC, I believe it is synchronized w.r.t ib_seq change. >> + if (ret < 0) >> + return ret; >> + /* >> + * ret can be 0 in case of a hole which is possible for mmaped writes. >> + */ >> + ret = ret ? ret : 1; >> + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits, >> IOMAP_WRITE, &wpc->iomap, NULL); > > So calling ext2_iomap_begin() here might have to change. Indeed, > given that we just allocated the blocks and we know where they are > located, calling ext2_iomap_begin()->ext2_get_blocks() to look them > up again just to format them into the iomap seems .... convoluted. > > It would be better to factor ext2_iomap_begin() into a call to > ext2_get_blocks() and then a "ext2_blocks_to_iomap()" function to > format the returned blocks into the iomap that is returned. Then > ext2_write_map_blocks() can just call ext2_blocks_to_iomap() here > and it skips the unnecessary block mapping lookup.... > > It also means that the ib_seq returned by the allocation can be fed > directly into the iomap->validity_cookie... > Thanks for your extensive review and sharing more details around this. Let me spend more time to see if this can bring benefit in this case for ext2. -ritesh
On Tue 05-12-23 20:52:26, Ritesh Harjani wrote: > Dave Chinner <david@fromorbit.com> writes: > >> static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, > >> struct inode *inode, loff_t offset) > >> { > >> - if (offset >= wpc->iomap.offset && > >> - offset < wpc->iomap.offset + wpc->iomap.length) > >> + loff_t maxblocks = (loff_t)INT_MAX; > >> + u8 blkbits = inode->i_blkbits; > >> + u32 bno; > >> + bool new, boundary; > >> + int ret; > >> + > >> + if (ext2_imap_valid(wpc, inode, offset)) > >> return 0; > >> > >> - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, > >> + EXT2_WPC(wpc)->ib_seq = READ_ONCE(EXT2_I(inode)->ib_seq); > >> + > >> + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits, > >> + &bno, &new, &boundary, 0); > > > > This is incorrectly ordered. ext2_get_blocks() bumps ib_seq when it > > does allocation, so the newly stored EXT2_WPC(wpc)->ib_seq is > > immediately staled and the very next call to ext2_imap_valid() will > > fail, causing a new iomap to be mapped even when not necessary. > > In case of ext2 here, the allocation happens at the write time itself > for buffered writes. So what we are essentially doing here (at the time > of writeback) is querying ->get_blocks(..., create=0) and passing those > no. of blocks (ret) further. So it is unlikely that the block > allocations will happen right after we sample ib_seq > (unless we race with truncate). > > For mmapped writes, we expect to find a hole and in case of a hole at > the offset, we only pass & cache 1 block in wpc. > > For mmapped writes case since we go and allocate 1 block, so I agree > that the ib_seq will change right after in ->get_blocks. But since in > this case we only alloc and cache 1 block, we anyway will have to call > ->get_blocks irrespective of ib_seq checking. I agree with your reasoning Ritesh but I guess it would deserve a comment because it is a bit subtle and easily forgotten detail. Honza
diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 4ddc36f4dbd4..ee5cd4a2f24f 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -252,7 +252,7 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) iocb->ki_flags &= ~IOCB_DIRECT; pos = iocb->ki_pos; - status = generic_perform_write(iocb, from); + status = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops); if (unlikely(status < 0)) { ret = status; goto out_unlock; @@ -278,6 +278,22 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) return ret; } +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb, + struct iov_iter *from) +{ + ssize_t ret = 0; + struct inode *inode = file_inode(iocb->ki_filp); + + inode_lock(inode); + ret = generic_write_checks(iocb, from); + if (ret > 0) + ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops); + inode_unlock(inode); + if (ret > 0) + ret = generic_write_sync(iocb, ret); + return ret; +} + static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) { #ifdef CONFIG_FS_DAX @@ -299,7 +315,7 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from) if (iocb->ki_flags & IOCB_DIRECT) return ext2_dio_write_iter(iocb, from); - return generic_file_write_iter(iocb, from); + return ext2_buffered_write_iter(iocb, from); } const struct file_operations ext2_file_operations = { diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 464faf6c217e..b6224d94a7dd 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -879,10 +879,14 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length, if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0) return -ENOTBLK; - if (iomap->type == IOMAP_MAPPED && - written < length && - (flags & IOMAP_WRITE)) + if (iomap->type == IOMAP_MAPPED && written < length && + (flags & IOMAP_WRITE)) { ext2_write_failed(inode->i_mapping, offset + length); + return 0; + } + + if (iomap->flags & IOMAP_F_SIZE_CHANGED) + mark_inode_dirty(inode); return 0; } @@ -914,6 +918,16 @@ static void ext2_readahead(struct readahead_control *rac) mpage_readahead(rac, ext2_get_block); } +static int ext2_file_read_folio(struct file *file, struct folio *folio) +{ + return iomap_read_folio(folio, &ext2_iomap_ops); +} + +static void ext2_file_readahead(struct readahead_control *rac) +{ + return iomap_readahead(rac, &ext2_iomap_ops); +} + static int ext2_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, struct page **pagep, void **fsdata) @@ -943,12 +957,40 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block) return generic_block_bmap(mapping,block,ext2_get_block); } +static sector_t ext2_file_bmap(struct address_space *mapping, sector_t block) +{ + return iomap_bmap(mapping, block, &ext2_iomap_ops); +} + static int ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) { return mpage_writepages(mapping, wbc, ext2_get_block); } +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc, + struct inode *inode, loff_t offset) +{ + if (offset >= wpc->iomap.offset && + offset < wpc->iomap.offset + wpc->iomap.length) + return 0; + + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize, + IOMAP_WRITE, &wpc->iomap, NULL); +} + +static const struct iomap_writeback_ops ext2_writeback_ops = { + .map_blocks = ext2_write_map_blocks, +}; + +static int ext2_file_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct iomap_writepage_ctx wpc = { }; + + return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops); +} + static int ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc) { @@ -957,6 +999,20 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc); } +const struct address_space_operations ext2_file_aops = { + .dirty_folio = iomap_dirty_folio, + .release_folio = iomap_release_folio, + .invalidate_folio = iomap_invalidate_folio, + .read_folio = ext2_file_read_folio, + .readahead = ext2_file_readahead, + .bmap = ext2_file_bmap, + .direct_IO = noop_direct_IO, + .writepages = ext2_file_writepages, + .migrate_folio = filemap_migrate_folio, + .is_partially_uptodate = iomap_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, +}; + const struct address_space_operations ext2_aops = { .dirty_folio = block_dirty_folio, .invalidate_folio = block_invalidate_folio, @@ -1281,8 +1337,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) error = dax_truncate_page(inode, newsize, NULL, &ext2_iomap_ops); else - error = block_truncate_page(inode->i_mapping, - newsize, ext2_get_block); + error = iomap_truncate_page(inode, newsize, NULL, + &ext2_iomap_ops); if (error) return error; @@ -1372,7 +1428,7 @@ void ext2_set_file_ops(struct inode *inode) if (IS_DAX(inode)) inode->i_mapping->a_ops = &ext2_dax_aops; else - inode->i_mapping->a_ops = &ext2_aops; + inode->i_mapping->a_ops = &ext2_file_aops; } struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
This patch converts ext2 regular file's buffered-io path to use iomap. - buffered-io path using iomap_file_buffered_write - DIO fallback to buffered-io now uses iomap_file_buffered_write - writeback path now uses a new aops - ext2_file_aops - truncate now uses iomap_truncate_page - mmap path of ext2 continues to use generic_file_vm_ops Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> --- fs/ext2/file.c | 20 +++++++++++++-- fs/ext2/inode.c | 68 ++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 80 insertions(+), 8 deletions(-)