Message ID | 20221115013043.360610-6-david@fromorbit.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | xfs, iomap: fix data corrupton due to stale cached iomaps | expand |
> +static int > +xfs_buffered_write_delalloc_scan( As pointed out last time, this is generic iomap (or in fact filemap.c, but maybe start small) logic, so move it there and just pass xfs_buffered_write_delalloc_punch as a callback.
On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > xfs_buffered_write_iomap_end() currently invalidates the page cache > over the unused range of the delalloc extent it allocated. While the > write allocated the delalloc extent, it does not own it exclusively > as the write does not hold any locks that prevent either writeback > or mmap page faults from changing the state of either the page cache > or the extent state backing this range. > > Whilst xfs_bmap_punch_delalloc_range() already handles races in > extent conversion - it will only punch out delalloc extents and it > ignores any other type of extent - the page cache truncate does not > discriminate between data written by this write or some other task. > As a result, truncating the page cache can result in data corruption > if the write races with mmap modifications to the file over the same > range. > > generic/346 exercises this workload, and if we randomly fail writes > (as will happen when iomap gets stale iomap detection later in the > patchset), it will randomly corrupt the file data because it removes > data written by mmap() in the same page as the write() that failed. > > Hence we do not want to punch out the page cache over the range of > the extent we failed to write to - what we actually need to do is > detect the ranges that have dirty data in cache over them and *not > punch them out*. > > TO do this, we have to walk the page cache over the range of the > delalloc extent we want to remove. This is made complex by the fact > we have to handle partially up-to-date folios correctly and this can > happen even when the FSB size == PAGE_SIZE because we now support > multi-page folios in the page cache. > > Because we are only interested in discovering the edges of data > ranges in the page cache (i.e. hole-data boundaries) we can make use > of mapping_seek_hole_data() to find those transitions in the page > cache. As we hold the invalidate_lock, we know that the boundaries > are not going to change while we walk the range. This interface is > also byte-based and is sub-page block aware, so we can find the data > ranges in the cache based on byte offsets rather than page, folio or > fs block sized chunks. This greatly simplifies the logic of finding > dirty cached ranges in the page cache. > > Once we've identified a range that contains cached data, we can then > iterate the range folio by folio. This allows us to determine if the > data is dirty and hence perform the correct delalloc extent punching > operations. The seek interface we use to iterate data ranges will > give us sub-folio start/end granularity, so we may end up looking up > the same folio multiple times as the seek interface iterates across > each discontiguous data region in the folio. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 141 insertions(+), 10 deletions(-) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > end_fsb - start_fsb); > } > > +/* > + * Scan the data range passed to us for dirty page cache folios. If we find a > + * dirty folio, punch out the preceeding range and update the offset from which > + * the next punch will start from. > + * > + * We can punch out clean pages because they either contain data that has been > + * written back - in which case the delalloc punch over that range is a no-op - > + * or they have been read faults in which case they contain zeroes and we can > + * remove the delalloc backing range and any new writes to those pages will do > + * the normal hole filling operation... > + * > + * This makes the logic simple: we only need to keep the delalloc extents only > + * over the dirty ranges of the page cache. > + */ > +static int > +xfs_buffered_write_delalloc_scan( > + struct inode *inode, > + loff_t *punch_start_byte, > + loff_t start_byte, > + loff_t end_byte) > +{ > + loff_t offset = start_byte; > + > + while (offset < end_byte) { > + struct folio *folio; > + > + /* grab locked page */ > + folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT); > + if (!folio) { > + offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE; > + continue; > + } > + > + /* if dirty, punch up to offset */ > + if (folio_test_dirty(folio)) { > + if (offset > *punch_start_byte) { > + int error; > + > + error = xfs_buffered_write_delalloc_punch(inode, > + *punch_start_byte, offset); This sounds an awful lot like what iomap_writeback_ops.discard_folio() does, albeit without the xfs_alert screaming everywhere. Moving along... so we punch out delalloc reservations for any part of the page cache that is clean. "punch_start_byte" is the start pos of the last range of clean cache, and we want to punch up to the start of this dirty folio... > + if (error) { > + folio_unlock(folio); > + folio_put(folio); > + return error; > + } > + } > + > + /* > + * Make sure the next punch start is correctly bound to > + * the end of this data range, not the end of the folio. > + */ > + *punch_start_byte = min_t(loff_t, end_byte, > + folio_next_index(folio) << PAGE_SHIFT); ...and then start a new clean range after this folio (or at the end_byte to signal that we're done)... > + } > + > + /* move offset to start of next folio in range */ > + offset = folio_next_index(folio) << PAGE_SHIFT; > + folio_unlock(folio); > + folio_put(folio); > + } > + return 0; > +} > + > +/* > + * Punch out all the delalloc blocks in the range given except for those that > + * have dirty data still pending in the page cache - those are going to be > + * written and so must still retain the delalloc backing for writeback. > + * > + * As we are scanning the page cache for data, we don't need to reimplement the > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > + * start and end of data ranges correctly even for sub-folio block sizes. This > + * byte range based iteration is especially convenient because it means we don't > + * have to care about variable size folios, nor where the start or end of the > + * data range lies within a folio, if they lie within the same folio or even if > + * there are multiple discontiguous data ranges within the folio. > + */ > +static int > +xfs_buffered_write_delalloc_release( > + struct inode *inode, > + loff_t start_byte, > + loff_t end_byte) > +{ > + loff_t punch_start_byte = start_byte; > + int error = 0; > + > + /* > + * Lock the mapping to avoid races with page faults re-instantiating > + * folios and dirtying them via ->page_mkwrite whilst we walk the > + * cache and perform delalloc extent removal. Failing to do this can > + * leave dirty pages with no space reservation in the cache. > + */ > + filemap_invalidate_lock(inode->i_mapping); > + while (start_byte < end_byte) { > + loff_t data_end; > + > + start_byte = mapping_seek_hole_data(inode->i_mapping, > + start_byte, end_byte, SEEK_DATA); > + /* > + * If there is no more data to scan, all that is left is to > + * punch out the remaining range. > + */ > + if (start_byte == -ENXIO || start_byte == end_byte) > + break; > + if (start_byte < 0) { > + error = start_byte; > + goto out_unlock; > + } > + ASSERT(start_byte >= punch_start_byte); > + ASSERT(start_byte < end_byte); > + > + /* > + * We find the end of this contiguous cached data range by > + * seeking from start_byte to the beginning of the next hole. > + */ > + data_end = mapping_seek_hole_data(inode->i_mapping, start_byte, > + end_byte, SEEK_HOLE); > + if (data_end < 0) { > + error = data_end; > + goto out_unlock; > + } > + ASSERT(data_end > start_byte); > + ASSERT(data_end <= end_byte); > + > + error = xfs_buffered_write_delalloc_scan(inode, > + &punch_start_byte, start_byte, data_end); ...and we use SEEK_HOLE/SEEK_DATA to find the ranges of pagecache where there's even going to be folios mapped. But in structuring the code like this, xfs now has to know details about folio state again, and the point of iomap/buffered-io.c is to delegate handling of the pagecache away from XFS, at least for filesystems that want to manage buffered IO the same way XFS does. IOWs, I agree with Christoph that these two functions that compute the ranges that need delalloc-punching really ought to be in the iomap code. TBH I wonder if this isn't better suited for being called by iomap_write_iter after a short write on an IOMAP_F_NEW iomap, with an function pointer in iomap page_ops for iomap to tell xfs to drop the delalloc reservations. --D > + if (error) > + goto out_unlock; > + > + /* The next data search starts at the end of this one. */ > + start_byte = data_end; > + } > + > + if (punch_start_byte <1 end_byte) > + error = xfs_buffered_write_delalloc_punch(inode, > + punch_start_byte, end_byte); > +out_unlock: > + filemap_invalidate_unlock(inode->i_mapping); > + return error; > +} > + > static int > xfs_buffered_write_iomap_end( > struct inode *inode, > @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end( > if (start_byte >= end_byte) > return 0; > > - /* > - * Lock the mapping to avoid races with page faults re-instantiating > - * folios and dirtying them via ->page_mkwrite between the page cache > - * truncation and the delalloc extent removal. Failing to do this can > - * leave dirty pages with no space reservation in the cache. > - */ > - filemap_invalidate_lock(inode->i_mapping); > - truncate_pagecache_range(inode, start_byte, end_byte - 1); > - error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte); > - filemap_invalidate_unlock(inode->i_mapping); > + error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte); > if (error && !xfs_is_shutdown(mp)) { > xfs_alert(mp, "%s: unable to clean up ino 0x%llx", > __func__, XFS_I(inode)->i_ino); > -- > 2.37.2 >
On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > ... > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 141 insertions(+), 10 deletions(-) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > end_fsb - start_fsb); > } > ... > +/* > + * Punch out all the delalloc blocks in the range given except for those that > + * have dirty data still pending in the page cache - those are going to be > + * written and so must still retain the delalloc backing for writeback. > + * > + * As we are scanning the page cache for data, we don't need to reimplement the > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > + * start and end of data ranges correctly even for sub-folio block sizes. This > + * byte range based iteration is especially convenient because it means we don't > + * have to care about variable size folios, nor where the start or end of the > + * data range lies within a folio, if they lie within the same folio or even if > + * there are multiple discontiguous data ranges within the folio. > + */ > +static int > +xfs_buffered_write_delalloc_release( > + struct inode *inode, > + loff_t start_byte, > + loff_t end_byte) > +{ > + loff_t punch_start_byte = start_byte; > + int error = 0; > + > + /* > + * Lock the mapping to avoid races with page faults re-instantiating > + * folios and dirtying them via ->page_mkwrite whilst we walk the > + * cache and perform delalloc extent removal. Failing to do this can > + * leave dirty pages with no space reservation in the cache. > + */ > + filemap_invalidate_lock(inode->i_mapping); > + while (start_byte < end_byte) { > + loff_t data_end; > + > + start_byte = mapping_seek_hole_data(inode->i_mapping, > + start_byte, end_byte, SEEK_DATA); FWIW, the fact that mapping seek data is based on uptodate status means that seek behavior can change based on prior reads. For example, see how seek hole/data presents reads of unwritten ranges as data [1]. The same thing isn't observable for holes because iomap doesn't check the mapping in that case, but underlying iop state is the same and that is what this code is looking at. The filtering being done here means we essentially only care about dirty pages backed by delalloc blocks. That means if you get here with a dirty page and the portion of the page affected by this failed write is uptodate, this won't punch an underlying delalloc block even though nothing else may have written to it in the meantime. That sort of state can be created by a prior read of the range on a sub-page block size fs, or perhaps a racing async readahead (via read fault of a lower offset..?), etc. I suspect this is not a serious error because the page is dirty and writeback will thus convert the block. The only exception to that I can see is if the block is beyond EOF (consider a mapped read to a page that straddles EOF, followed by a post-eof write that fails), writeback won't actually map the block directly. It may convert if contiguous with delalloc blocks inside EOF (and sufficiently sized physical extents exist), or even if not, should still otherwise be cleaned up by the various other means we already have to manage post-eof blocks. So IOW there's a tradeoff being made here for possible spurious allocation and I/O and a subtle dependency on writeback that should probably be documented somewhere. The larger concern is that if writeback eventually changes based on dirty range tracking in a way that breaks this dependency, that introduces yet another stale delalloc block landmine associated with this error handling code (regardless of whether you want to call that a bug in this code, seek data, whatever), and those problems are difficult enough to root cause as it is. Brian [1] # xfs_io -fc "falloc 0 4k" -c "seek -a 0" -c "pread 0 4k" -c "seek -a 0" <file> Whence Result HOLE 0 read 4096/4096 bytes at offset 0 4 KiB, 4 ops; 0.0000 sec (156 MiB/sec and 160000.0000 ops/sec) Whence Result DATA 0 HOLE 4096 > + /* > + * If there is no more data to scan, all that is left is to > + * punch out the remaining range. > + */ > + if (start_byte == -ENXIO || start_byte == end_byte) > + break; > + if (start_byte < 0) { > + error = start_byte; > + goto out_unlock; > + } > + ASSERT(start_byte >= punch_start_byte); > + ASSERT(start_byte < end_byte); > + > + /* > + * We find the end of this contiguous cached data range by > + * seeking from start_byte to the beginning of the next hole. > + */ > + data_end = mapping_seek_hole_data(inode->i_mapping, start_byte, > + end_byte, SEEK_HOLE); > + if (data_end < 0) { > + error = data_end; > + goto out_unlock; > + } > + ASSERT(data_end > start_byte); > + ASSERT(data_end <= end_byte); > + > + error = xfs_buffered_write_delalloc_scan(inode, > + &punch_start_byte, start_byte, data_end); > + if (error) > + goto out_unlock; > + > + /* The next data search starts at the end of this one. */ > + start_byte = data_end; > + } > + > + if (punch_start_byte < end_byte) > + error = xfs_buffered_write_delalloc_punch(inode, > + punch_start_byte, end_byte); > +out_unlock: > + filemap_invalidate_unlock(inode->i_mapping); > + return error; > +} > + > static int > xfs_buffered_write_iomap_end( > struct inode *inode, > @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end( > if (start_byte >= end_byte) > return 0; > > - /* > - * Lock the mapping to avoid races with page faults re-instantiating > - * folios and dirtying them via ->page_mkwrite between the page cache > - * truncation and the delalloc extent removal. Failing to do this can > - * leave dirty pages with no space reservation in the cache. > - */ > - filemap_invalidate_lock(inode->i_mapping); > - truncate_pagecache_range(inode, start_byte, end_byte - 1); > - error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte); > - filemap_invalidate_unlock(inode->i_mapping); > + error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte); > if (error && !xfs_is_shutdown(mp)) { > xfs_alert(mp, "%s: unable to clean up ino 0x%llx", > __func__, XFS_I(inode)->i_ino); > -- > 2.37.2 > >
On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote: > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > ... > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 141 insertions(+), 10 deletions(-) > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > > end_fsb - start_fsb); > > } > > > ... > > +/* > > + * Punch out all the delalloc blocks in the range given except for those that > > + * have dirty data still pending in the page cache - those are going to be > > + * written and so must still retain the delalloc backing for writeback. > > + * > > + * As we are scanning the page cache for data, we don't need to reimplement the > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > > + * start and end of data ranges correctly even for sub-folio block sizes. This > > + * byte range based iteration is especially convenient because it means we don't > > + * have to care about variable size folios, nor where the start or end of the > > + * data range lies within a folio, if they lie within the same folio or even if > > + * there are multiple discontiguous data ranges within the folio. > > + */ > > +static int > > +xfs_buffered_write_delalloc_release( > > + struct inode *inode, > > + loff_t start_byte, > > + loff_t end_byte) > > +{ > > + loff_t punch_start_byte = start_byte; > > + int error = 0; > > + > > + /* > > + * Lock the mapping to avoid races with page faults re-instantiating > > + * folios and dirtying them via ->page_mkwrite whilst we walk the > > + * cache and perform delalloc extent removal. Failing to do this can > > + * leave dirty pages with no space reservation in the cache. > > + */ > > + filemap_invalidate_lock(inode->i_mapping); > > + while (start_byte < end_byte) { > > + loff_t data_end; > > + > > + start_byte = mapping_seek_hole_data(inode->i_mapping, > > + start_byte, end_byte, SEEK_DATA); > > FWIW, the fact that mapping seek data is based on uptodate status means > that seek behavior can change based on prior reads. Yup. It should be obvious that any page cache scan based algorithm will change based on changing page cache residency. > For example, see how > seek hole/data presents reads of unwritten ranges as data [1]. The same > thing isn't observable for holes because iomap doesn't check the mapping > in that case, but underlying iop state is the same and that is what this > code is looking at. Well, yes. That's the fundamental, underlying issue that this patchset is addressing for the write() operation: that the page cache contents and the underlying filesystem extent map are not guaranteed to be coherent and can be changed independently of each other. The whole problem with looking exclusively at filesystem level extent state (and hence FIEMAP) is that the extent state doesn't tell us whether the is uncommitted data over the range of the extent in the page cache. The filesystem extent state and page cache data *can't be coherent* in a writeback caching environment. This is the fundamental difference between what the filesystem extent map tells us (FIEMAP) and what querying the page cache tells us (SEEK_DATA/SEEK_HOLE). This is also the underlying problem with iomap_truncate_page() - it fails to query the page cache for data over unwritten extents, so fails to zero the post-EOF part of dirty folios over unwritten extents and so it all goes wrong... > The filtering being done here means we essentially only care about dirty > pages backed by delalloc blocks. That means if you get here with a dirty > page and the portion of the page affected by this failed write is > uptodate, this won't punch an underlying delalloc block even though > nothing else may have written to it in the meantime. Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap will not span ranges that have pre-existing *dirty* data in the page cache. Those *must* already have (del)allocated extents, hence the iomap for the newly allocated delalloc extent will always end before pre-existing dirty data in the page cache starts. Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map precludes stepping into ranges that have pre-existing cached dirty data. We also can't get a racing write() to the same range right now because this is all under IOLOCK_EXCL, hence we only ever see dirty folios as a result of race with page faults. page faults zero the entire folio they insert into the page cache and iomap_folio_mkwrite_iter() asserts that the entire folio is marked up to date. Hence if we find a dirty folio outside the range the write() dirtied, we are guaranteed that the entire dirty folio is up to date.... Yes, there can be pre-existing *clean* folios (and clean partially up to date folios) in the page cache, but we won't have dirty partially up to date pages in the middle of the range we are scanning. Hence we only need to care about the edge cases (folios that overlap start and ends). We skip the partially written start block, and we always punch up to the end block if it is different from the last block we punched up to. If the end of the data spans into a dirty folio, we know that dirty range is up to date because the seek scan only returns ranges that are up to date. Hence we don't punch those partial blocks out.... Regardless, let's assume we have a racing write that has partially updated and dirtied a folio (because we've moved to XFS_IOLOCK_SHARED locking for writes). This case is already handled by the mapping_seek_hole_data() based iteration. That is, the mapping_seek_hole_data() loop provides us with *discrete ranges of up to date data* that are independent of folio size, up-to-date range granularity, dirty range tracking, filesystem block size, etc. Hence if the next discrete range we discover is in the same dirty folio as the previous discrete range of up to date data, we know we have a sub-folio sized hole in the data that is not up to date. Because there is no data over this range, we have to punch out the underlying delalloc extent over that range. IOWs, the dirty state of the folio and/or the granularity of the dirty range tracking is irrelevant here - we know there was no data in the cache (dlean or dirty) over this range because it is discontiguous with the previous range of data returned. IOWs, if we have this "up to date" map on a dirty folio like this: Data +-------+UUUUUUU+-------+UUUUUUU+-------+ Extent map +DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ Then the unrolled iteration and punching we do would look like this: First iteration of the range: punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE: ^ Data range: +UUUUUUU+ Punch range: +-------+ Extent map: +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ Second iteration: punch_start V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE: ^ Data range: +UUUUUUU+ Punch range: +-------+ Extent map: +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+ Third iteration: punch_start V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves into next folio in cache .... Punch range: +-------+ ...... Extent map: +-------+DDDDDDD+-------+DDDDDDD+-------+ ...... (to end of scan range or start of next data) As you can see, this scan does not care about folio size, sub-folio range granularity or filesystem block sizes. It also matches exactly how writeback handles dirty, partially up to date folios, so there's no stray delalloc blocks left around to be tripped over after failed or short writes occur. Indeed, if we move to sub-folio dirty range tracking, we can simply add a mapping_seek_hole_data() variant that walks dirty ranges in the page cache rather than up to date ranges. Then we can remove the inner loop from this code that looks up folios to determine dirty state. The above algorithm does not change - we just walk from discrete range to discrete range punching the gaps between them.... IOWs, the algorithm is largely future proof - the only thing that needs to change if we change iomap to track sub-folio dirty ranges is how we check the data range for being dirty. That should be no surprise, really, the surprise should be that we can make some simple mods to page cache seek to remove the need for checking dirty state in this code altogether.... > That sort of state > can be created by a prior read of the range on a sub-page block size fs, > or perhaps a racing async readahead (via read fault of a lower > offset..?), etc. Yup, generic/346 exercises this racing unaligned, sub-folio mmap write vs write() case. This test, specifically, was the reason I moved to using mapping_seek_hole_data() - g/346 found an endless stream of bugs in the sub-multi-page-folio range iteration code I kept trying to write.... > I suspect this is not a serious error because the page is dirty > and writeback will thus convert the block. The only exception to > that I can see is if the block is beyond EOF (consider a mapped > read to a page that straddles EOF, followed by a post-eof write > that fails), writeback won't actually map the block directly. I don't think that can happen. iomap_write_failed() calls truncate_pagecache_range() to remove any newly instantiated cached data beyond the original EOF. Hence the delalloc punch will remove everything beyond the original EOF that was allocated for the failed write. Hence when we get to writeback we're not going to find any up-to-date data beyond the EOF block in the page cache, nor any stray delalloc blocks way beyond EOF.... > It may convert if contiguous with delalloc blocks inside EOF (and > sufficiently sized physical extents exist), or even if not, should > still otherwise be cleaned up by the various other means we > already have to manage post-eof blocks. > > So IOW there's a tradeoff being made here for possible spurious > allocation and I/O and a subtle dependency on writeback that > should probably be documented somewhere. As per above, I don't think there is any spurious/stale allocation left behind by the punch code, nor is there any dependency on writeback to ignore it such issues. > The larger concern is that if > writeback eventually changes based on dirty range tracking in a way that > breaks this dependency, that introduces yet another stale delalloc block > landmine associated with this error handling code (regardless of whether > you want to call that a bug in this code, seek data, whatever), and > those problems are difficult enough to root cause as it is. If iomap changes how it tracks dirty ranges, this punch code only needs small changes to work with that correctly. There aren't any unknown landmines here - if we change dirty tracking, we know that we have to update the code that depends on the existing dirty tracking mechanisms to work correctly with the new infrastructure... Cheers, Dave.
On Tue, Nov 15, 2022 at 04:48:58PM -0800, Darrick J. Wong wrote: > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > +/* > > + * Scan the data range passed to us for dirty page cache folios. If we find a > > + * dirty folio, punch out the preceeding range and update the offset from which > > + * the next punch will start from. > > + * > > + * We can punch out clean pages because they either contain data that has been > > + * written back - in which case the delalloc punch over that range is a no-op - > > + * or they have been read faults in which case they contain zeroes and we can > > + * remove the delalloc backing range and any new writes to those pages will do > > + * the normal hole filling operation... > > + * > > + * This makes the logic simple: we only need to keep the delalloc extents only > > + * over the dirty ranges of the page cache. > > + */ > > +static int > > +xfs_buffered_write_delalloc_scan( > > + struct inode *inode, > > + loff_t *punch_start_byte, > > + loff_t start_byte, > > + loff_t end_byte) > > +{ > > + loff_t offset = start_byte; > > + > > + while (offset < end_byte) { > > + struct folio *folio; > > + > > + /* grab locked page */ > > + folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT); > > + if (!folio) { > > + offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE; > > + continue; > > + } > > + > > + /* if dirty, punch up to offset */ > > + if (folio_test_dirty(folio)) { > > + if (offset > *punch_start_byte) { > > + int error; > > + > > + error = xfs_buffered_write_delalloc_punch(inode, > > + *punch_start_byte, offset); > > This sounds an awful lot like what iomap_writeback_ops.discard_folio() > does, albeit without the xfs_alert screaming everywhere. similar, but .discard_folio() is trashing uncommitted written data, whilst this loop is explicitly preserving uncommitted written data.... > Moving along... so we punch out delalloc reservations for any part of > the page cache that is clean. "punch_start_byte" is the start pos of > the last range of clean cache, and we want to punch up to the start of > this dirty folio... > > > + if (error) { > > + folio_unlock(folio); > > + folio_put(folio); > > + return error; > > + } > > + } > > + > > + /* > > + * Make sure the next punch start is correctly bound to > > + * the end of this data range, not the end of the folio. > > + */ > > + *punch_start_byte = min_t(loff_t, end_byte, > > + folio_next_index(folio) << PAGE_SHIFT); > > ...and then start a new clean range after this folio (or at the end_byte > to signal that we're done)... Yes. > > + filemap_invalidate_lock(inode->i_mapping); > > + while (start_byte < end_byte) { > > + loff_t data_end; > > + > > + start_byte = mapping_seek_hole_data(inode->i_mapping, > > + start_byte, end_byte, SEEK_DATA); > > + /* > > + * If there is no more data to scan, all that is left is to > > + * punch out the remaining range. > > + */ > > + if (start_byte == -ENXIO || start_byte == end_byte) > > + break; > > + if (start_byte < 0) { > > + error = start_byte; > > + goto out_unlock; > > + } > > + ASSERT(start_byte >= punch_start_byte); > > + ASSERT(start_byte < end_byte); > > + > > + /* > > + * We find the end of this contiguous cached data range by > > + * seeking from start_byte to the beginning of the next hole. > > + */ > > + data_end = mapping_seek_hole_data(inode->i_mapping, start_byte, > > + end_byte, SEEK_HOLE); > > + if (data_end < 0) { > > + error = data_end; > > + goto out_unlock; > > + } > > + ASSERT(data_end > start_byte); > > + ASSERT(data_end <= end_byte); > > + > > + error = xfs_buffered_write_delalloc_scan(inode, > > + &punch_start_byte, start_byte, data_end); > > ...and we use SEEK_HOLE/SEEK_DATA to find the ranges of pagecache where > there's even going to be folios mapped. But in structuring the code > like this, xfs now has to know details about folio state again, and the > point of iomap/buffered-io.c is to delegate handling of the pagecache > away from XFS, at least for filesystems that want to manage buffered IO > the same way XFS does. > > IOWs, I agree with Christoph that these two functions that compute the > ranges that need delalloc-punching really ought to be in the iomap code. Which I've already done. > TBH I wonder if this isn't better suited for being called by > iomap_write_iter after a short write on an IOMAP_F_NEW iomap, with an > function pointer in iomap page_ops for iomap to tell xfs to drop the > delalloc reservations. IIRC, the whole point of the iomap_begin()/iomap_end() operation pairs is that the iter functions don't need to concern themselves with the filesystem operations needed to manipulate extents and clean up filesystem state as a result of failed operations. I'm pretty sure we need this same error handling for zeroing and unshare operations, because they could also detect stale cached iomaps and have to go remap their extents. Maybe they don't allocate new extents, but that is beside the point - the error handling that is necessary is common to multiple functions, and right now they don't have to care about cleaning up iomap/filesystem state when short writes/errors occur. Hence I don't think we want to break the architectural layering by doing this - I think it would just lead to having to handle the same issues in multiple places, and we don't gain any extra control over or information about how to perform the cleanup of the unused portion of the iomap.... Cheers, Dave.
On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote: > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote: > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > ... > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > > > 1 file changed, 141 insertions(+), 10 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > > > end_fsb - start_fsb); > > > } > > > > > ... > > > +/* > > > + * Punch out all the delalloc blocks in the range given except for those that > > > + * have dirty data still pending in the page cache - those are going to be > > > + * written and so must still retain the delalloc backing for writeback. > > > + * > > > + * As we are scanning the page cache for data, we don't need to reimplement the > > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > > > + * start and end of data ranges correctly even for sub-folio block sizes. This > > > + * byte range based iteration is especially convenient because it means we don't > > > + * have to care about variable size folios, nor where the start or end of the > > > + * data range lies within a folio, if they lie within the same folio or even if > > > + * there are multiple discontiguous data ranges within the folio. > > > + */ > > > +static int > > > +xfs_buffered_write_delalloc_release( > > > + struct inode *inode, > > > + loff_t start_byte, > > > + loff_t end_byte) > > > +{ > > > + loff_t punch_start_byte = start_byte; > > > + int error = 0; > > > + > > > + /* > > > + * Lock the mapping to avoid races with page faults re-instantiating > > > + * folios and dirtying them via ->page_mkwrite whilst we walk the > > > + * cache and perform delalloc extent removal. Failing to do this can > > > + * leave dirty pages with no space reservation in the cache. > > > + */ > > > + filemap_invalidate_lock(inode->i_mapping); > > > + while (start_byte < end_byte) { > > > + loff_t data_end; > > > + > > > + start_byte = mapping_seek_hole_data(inode->i_mapping, > > > + start_byte, end_byte, SEEK_DATA); > > > > FWIW, the fact that mapping seek data is based on uptodate status means > > that seek behavior can change based on prior reads. > > Yup. It should be obvious that any page cache scan based algorithm > will change based on changing page cache residency. > > > For example, see how > > seek hole/data presents reads of unwritten ranges as data [1]. The same > > thing isn't observable for holes because iomap doesn't check the mapping > > in that case, but underlying iop state is the same and that is what this > > code is looking at. > > Well, yes. That's the fundamental, underlying issue that this > patchset is addressing for the write() operation: that the page > cache contents and the underlying filesystem extent map are not > guaranteed to be coherent and can be changed independently of each > other. > > The whole problem with looking exclusively at filesystem level > extent state (and hence FIEMAP) is that the extent state doesn't > tell us whether the is uncommitted data over the range of the extent > in the page cache. The filesystem extent state and page cache data > *can't be coherent* in a writeback caching environment. This is the > fundamental difference between what the filesystem extent map tells > us (FIEMAP) and what querying the page cache tells us > (SEEK_DATA/SEEK_HOLE). > > This is also the underlying problem with iomap_truncate_page() - it > fails to query the page cache for data over unwritten extents, so > fails to zero the post-EOF part of dirty folios over unwritten > extents and so it all goes wrong... > > > The filtering being done here means we essentially only care about dirty > > pages backed by delalloc blocks. That means if you get here with a dirty > > page and the portion of the page affected by this failed write is > > uptodate, this won't punch an underlying delalloc block even though > > nothing else may have written to it in the meantime. > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap > will not span ranges that have pre-existing *dirty* data in the > page cache. Those *must* already have (del)allocated extents, hence > the iomap for the newly allocated delalloc extent will always end > before pre-existing dirty data in the page cache starts. > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map > precludes stepping into ranges that have pre-existing cached dirty > data. > > We also can't get a racing write() to the same range right now > because this is all under IOLOCK_EXCL, hence we only ever see dirty > folios as a result of race with page faults. page faults zero the > entire folio they insert into the page cache and > iomap_folio_mkwrite_iter() asserts that the entire folio is marked > up to date. Hence if we find a dirty folio outside the range the > write() dirtied, we are guaranteed that the entire dirty folio is up > to date.... > > Yes, there can be pre-existing *clean* folios (and clean partially > up to date folios) in the page cache, but we won't have dirty > partially up to date pages in the middle of the range we are > scanning. Hence we only need to care about the edge cases (folios > that overlap start and ends). We skip the partially written start > block, and we always punch up to the end block if it is different > from the last block we punched up to. If the end of the data spans > into a dirty folio, we know that dirty range is up to date because > the seek scan only returns ranges that are up to date. Hence we > don't punch those partial blocks out.... > > Regardless, let's assume we have a racing write that has partially > updated and dirtied a folio (because we've moved to > XFS_IOLOCK_SHARED locking for writes). This case is already handled > by the mapping_seek_hole_data() based iteration. > > That is, the mapping_seek_hole_data() loop provides us with > *discrete ranges of up to date data* that are independent of folio > size, up-to-date range granularity, dirty range tracking, filesystem > block size, etc. > > Hence if the next discrete range we discover is in the same dirty > folio as the previous discrete range of up to date data, we know we > have a sub-folio sized hole in the data that is not up to date. > Because there is no data over this range, we have to punch out the > underlying delalloc extent over that range. > > IOWs, the dirty state of the folio and/or the granularity of the > dirty range tracking is irrelevant here - we know there was no data > in the cache (dlean or dirty) over this range because it is > discontiguous with the previous range of data returned. > > IOWs, if we have this "up to date" map on a dirty folio like this: > > Data +-------+UUUUUUU+-------+UUUUUUU+-------+ > Extent map +DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > Then the unrolled iteration and punching we do would look like this: > > First iteration of the range: > > punch_start: > V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > > SEEK_DATA: V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_HOLE: ^ > Data range: +UUUUUUU+ > Punch range: +-------+ > Extent map: +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > Second iteration: > > punch_start V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_DATA: V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_HOLE: ^ > Data range: +UUUUUUU+ > Punch range: +-------+ > Extent map: +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+ > > Third iteration: > > punch_start V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_DATA: - moves into next folio in cache > .... > Punch range: +-------+ ...... > Extent map: +-------+DDDDDDD+-------+DDDDDDD+-------+ ...... > (to end of scan range or start of next data) > > As you can see, this scan does not care about folio size, sub-folio > range granularity or filesystem block sizes. It also matches > exactly how writeback handles dirty, partially up to date folios, so > there's no stray delalloc blocks left around to be tripped over > after failed or short writes occur. I had been wondering about the particular case of dealing with partially uptodate folios, so it was really helpful to watch this all in ASCIIVision(tm). > Indeed, if we move to sub-folio dirty range tracking, we can simply > add a mapping_seek_hole_data() variant that walks dirty ranges in > the page cache rather than up to date ranges. Then we can remove the > inner loop from this code that looks up folios to determine dirty > state. The above algorithm does not change - we just walk from > discrete range to discrete range punching the gaps between them.... > > IOWs, the algorithm is largely future proof - the only thing that > needs to change if we change iomap to track sub-folio dirty ranges > is how we check the data range for being dirty. That should be no > surprise, really, the surprise should be that we can make some > simple mods to page cache seek to remove the need for checking dirty > state in this code altogether.... > > > That sort of state > > can be created by a prior read of the range on a sub-page block size fs, > > or perhaps a racing async readahead (via read fault of a lower > > offset..?), etc. > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap > write vs write() case. This test, specifically, was the reason I > moved to using mapping_seek_hole_data() - g/346 found an endless > stream of bugs in the sub-multi-page-folio range iteration code I > kept trying to write.... > > > I suspect this is not a serious error because the page is dirty > > and writeback will thus convert the block. The only exception to > > that I can see is if the block is beyond EOF (consider a mapped > > read to a page that straddles EOF, followed by a post-eof write > > that fails), writeback won't actually map the block directly. > > I don't think that can happen. iomap_write_failed() calls > truncate_pagecache_range() to remove any newly instantiated cached > data beyond the original EOF. Hence the delalloc punch will remove > everything beyond the original EOF that was allocated for the failed > write. Hence when we get to writeback we're not going to find any > up-to-date data beyond the EOF block in the page cache, nor any > stray delalloc blocks way beyond EOF.... > > > It may convert if contiguous with delalloc blocks inside EOF (and > > sufficiently sized physical extents exist), or even if not, should > > still otherwise be cleaned up by the various other means we > > already have to manage post-eof blocks. > > > > So IOW there's a tradeoff being made here for possible spurious > > allocation and I/O and a subtle dependency on writeback that > > should probably be documented somewhere. > > As per above, I don't think there is any spurious/stale allocation > left behind by the punch code, nor is there any dependency on > writeback to ignore it such issues. > > > The larger concern is that if > > writeback eventually changes based on dirty range tracking in a way that > > breaks this dependency, that introduces yet another stale delalloc block > > landmine associated with this error handling code (regardless of whether > > you want to call that a bug in this code, seek data, whatever), and > > those problems are difficult enough to root cause as it is. > > If iomap changes how it tracks dirty ranges, this punch code only > needs small changes to work with that correctly. There aren't any > unknown landmines here - if we change dirty tracking, we know that > we have to update the code that depends on the existing dirty > tracking mechanisms to work correctly with the new infrastructure... I hope that anyone trying to change the iomap dirty tracking code will find it easier to do with the iteration code in buffered-io.c. Thanks for taking the time to rework that part for v3. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote: > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote: > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > ... > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++--- > > > 1 file changed, 141 insertions(+), 10 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index 7bb55dbc19d3..2d48fcc7bd6f 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( > > > end_fsb - start_fsb); > > > } > > > > > ... > > > +/* > > > + * Punch out all the delalloc blocks in the range given except for those that > > > + * have dirty data still pending in the page cache - those are going to be > > > + * written and so must still retain the delalloc backing for writeback. > > > + * > > > + * As we are scanning the page cache for data, we don't need to reimplement the > > > + * wheel - mapping_seek_hole_data() does exactly what we need to identify the > > > + * start and end of data ranges correctly even for sub-folio block sizes. This > > > + * byte range based iteration is especially convenient because it means we don't > > > + * have to care about variable size folios, nor where the start or end of the > > > + * data range lies within a folio, if they lie within the same folio or even if > > > + * there are multiple discontiguous data ranges within the folio. > > > + */ > > > +static int > > > +xfs_buffered_write_delalloc_release( > > > + struct inode *inode, > > > + loff_t start_byte, > > > + loff_t end_byte) > > > +{ > > > + loff_t punch_start_byte = start_byte; > > > + int error = 0; > > > + > > > + /* > > > + * Lock the mapping to avoid races with page faults re-instantiating > > > + * folios and dirtying them via ->page_mkwrite whilst we walk the > > > + * cache and perform delalloc extent removal. Failing to do this can > > > + * leave dirty pages with no space reservation in the cache. > > > + */ > > > + filemap_invalidate_lock(inode->i_mapping); > > > + while (start_byte < end_byte) { > > > + loff_t data_end; > > > + > > > + start_byte = mapping_seek_hole_data(inode->i_mapping, > > > + start_byte, end_byte, SEEK_DATA); > > > > FWIW, the fact that mapping seek data is based on uptodate status means > > that seek behavior can change based on prior reads. > > Yup. It should be obvious that any page cache scan based algorithm > will change based on changing page cache residency. > > > For example, see how > > seek hole/data presents reads of unwritten ranges as data [1]. The same > > thing isn't observable for holes because iomap doesn't check the mapping > > in that case, but underlying iop state is the same and that is what this > > code is looking at. > > Well, yes. That's the fundamental, underlying issue that this > patchset is addressing for the write() operation: that the page > cache contents and the underlying filesystem extent map are not > guaranteed to be coherent and can be changed independently of each > other. > > The whole problem with looking exclusively at filesystem level > extent state (and hence FIEMAP) is that the extent state doesn't > tell us whether the is uncommitted data over the range of the extent > in the page cache. The filesystem extent state and page cache data > *can't be coherent* in a writeback caching environment. This is the > fundamental difference between what the filesystem extent map tells > us (FIEMAP) and what querying the page cache tells us > (SEEK_DATA/SEEK_HOLE). > > This is also the underlying problem with iomap_truncate_page() - it > fails to query the page cache for data over unwritten extents, so > fails to zero the post-EOF part of dirty folios over unwritten > extents and so it all goes wrong... > > > The filtering being done here means we essentially only care about dirty > > pages backed by delalloc blocks. That means if you get here with a dirty > > page and the portion of the page affected by this failed write is > > uptodate, this won't punch an underlying delalloc block even though > > nothing else may have written to it in the meantime. > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc iomap > will not span ranges that have pre-existing *dirty* data in the > page cache. Those *must* already have (del)allocated extents, hence > the iomap for the newly allocated delalloc extent will always end > before pre-existing dirty data in the page cache starts. > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map > precludes stepping into ranges that have pre-existing cached dirty > data. > > We also can't get a racing write() to the same range right now > because this is all under IOLOCK_EXCL, hence we only ever see dirty > folios as a result of race with page faults. page faults zero the > entire folio they insert into the page cache and > iomap_folio_mkwrite_iter() asserts that the entire folio is marked > up to date. Hence if we find a dirty folio outside the range the > write() dirtied, we are guaranteed that the entire dirty folio is up > to date.... > > Yes, there can be pre-existing *clean* folios (and clean partially > up to date folios) in the page cache, but we won't have dirty > partially up to date pages in the middle of the range we are > scanning. Hence we only need to care about the edge cases (folios > that overlap start and ends). We skip the partially written start > block, and we always punch up to the end block if it is different > from the last block we punched up to. If the end of the data spans > into a dirty folio, we know that dirty range is up to date because > the seek scan only returns ranges that are up to date. Hence we > don't punch those partial blocks out.... > > Regardless, let's assume we have a racing write that has partially > updated and dirtied a folio (because we've moved to > XFS_IOLOCK_SHARED locking for writes). This case is already handled > by the mapping_seek_hole_data() based iteration. > > That is, the mapping_seek_hole_data() loop provides us with > *discrete ranges of up to date data* that are independent of folio > size, up-to-date range granularity, dirty range tracking, filesystem > block size, etc. > > Hence if the next discrete range we discover is in the same dirty > folio as the previous discrete range of up to date data, we know we > have a sub-folio sized hole in the data that is not up to date. > Because there is no data over this range, we have to punch out the > underlying delalloc extent over that range. > > IOWs, the dirty state of the folio and/or the granularity of the > dirty range tracking is irrelevant here - we know there was no data > in the cache (dlean or dirty) over this range because it is > discontiguous with the previous range of data returned. > > IOWs, if we have this "up to date" map on a dirty folio like this: > > Data +-------+UUUUUUU+-------+UUUUUUU+-------+ > Extent map +DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > Then the unrolled iteration and punching we do would look like this: > > First iteration of the range: > > punch_start: > V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > > SEEK_DATA: V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_HOLE: ^ > Data range: +UUUUUUU+ > Punch range: +-------+ > Extent map: +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > Second iteration: > > punch_start V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_DATA: V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_HOLE: ^ > Data range: +UUUUUUU+ > Punch range: +-------+ > Extent map: +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+ > > Third iteration: > > punch_start V > +-------+UUUUUUU+-------+UUUUUUU+-------+ > SEEK_DATA: - moves into next folio in cache > .... > Punch range: +-------+ ...... > Extent map: +-------+DDDDDDD+-------+DDDDDDD+-------+ ...... > (to end of scan range or start of next data) > > As you can see, this scan does not care about folio size, sub-folio > range granularity or filesystem block sizes. It also matches > exactly how writeback handles dirty, partially up to date folios, so > there's no stray delalloc blocks left around to be tripped over > after failed or short writes occur. > > Indeed, if we move to sub-folio dirty range tracking, we can simply > add a mapping_seek_hole_data() variant that walks dirty ranges in > the page cache rather than up to date ranges. Then we can remove the > inner loop from this code that looks up folios to determine dirty > state. The above algorithm does not change - we just walk from > discrete range to discrete range punching the gaps between them.... > > IOWs, the algorithm is largely future proof - the only thing that > needs to change if we change iomap to track sub-folio dirty ranges > is how we check the data range for being dirty. That should be no > surprise, really, the surprise should be that we can make some > simple mods to page cache seek to remove the need for checking dirty > state in this code altogether.... > > > That sort of state > > can be created by a prior read of the range on a sub-page block size fs, > > or perhaps a racing async readahead (via read fault of a lower > > offset..?), etc. > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap > write vs write() case. This test, specifically, was the reason I > moved to using mapping_seek_hole_data() - g/346 found an endless > stream of bugs in the sub-multi-page-folio range iteration code I > kept trying to write.... > > > I suspect this is not a serious error because the page is dirty > > and writeback will thus convert the block. The only exception to > > that I can see is if the block is beyond EOF (consider a mapped > > read to a page that straddles EOF, followed by a post-eof write > > that fails), writeback won't actually map the block directly. > > I don't think that can happen. iomap_write_failed() calls > truncate_pagecache_range() to remove any newly instantiated cached > data beyond the original EOF. Hence the delalloc punch will remove > everything beyond the original EOF that was allocated for the failed > write. Hence when we get to writeback we're not going to find any > up-to-date data beyond the EOF block in the page cache, nor any > stray delalloc blocks way beyond EOF.... > It can happen if the page straddles eof. I don't think much of the above relates to the behavior I'm describing. This doesn't require racing writes, theoretical shared write locking, the IOMAP_F_NEW flag or core iteration algorithm are not major factors, etc. It's easier to just use examples. Consider the following sequence on a bsize=1k fs on a kernel with this patch series. Note that my xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that triggers the -EFAULT error check in iomap_write_iter() by passing a bad user buffer: # set eof and make sure entire page is uptodate (even posteof) $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k" $file # do a delalloc and open/close cycle to set dirty release $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file # fail an appending write to a posteof, discontig, uptodate range on already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c "fiemap -v" $file wrote 1/1 bytes at offset 0 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec) pwrite: Bad address /mnt/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..1]: 22..23 2 0x0 1: [2..3]: hole 2 2: [4..5]: 0..1 2 0x7 So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block there that hasn't been written to. I call this a landmine because it's an unexpected/untracked instance of post-eof delalloc. I.e., post-eof block reclamation won't pick it up because the inode was never tagged for speculative preallocation: $ xfs_spaceman -c "prealloc -s" /mnt $ xfs_io -c "fiemap -v" /mnt/file /mnt/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..1]: 22..23 2 0x0 1: [2..3]: hole 2 2: [4..5]: 0..1 2 0x7 At the same time, it's not an observable problem in XFS because inode reclaim will eventually truncate that block off due to the delalloc block check, which will prevent any sort of space accounting corruption. I also don't think anybody will miss a single block dangling off like that in the meantime. The only other odd side effect I can think of is if this happened and the user performed an explicit post-eof fallocate, reclaim could now lop that off due to the last resort delalloc block check. That can still technically happen anyways as a side effect of speculative prealloc so I don't see that as a major problem either. The more important point here is that just because XFS can handle this situation reasonably gracefully doesn't mean every other filesystem expects it. > > It may convert if contiguous with delalloc blocks inside EOF (and > > sufficiently sized physical extents exist), or even if not, should > > still otherwise be cleaned up by the various other means we > > already have to manage post-eof blocks. > > > > So IOW there's a tradeoff being made here for possible spurious > > allocation and I/O and a subtle dependency on writeback that > > should probably be documented somewhere. > > As per above, I don't think there is any spurious/stale allocation > left behind by the punch code, nor is there any dependency on > writeback to ignore it such issues. > The same sort of example as above (but within eof) demonstrates the writeback dependency. I.e., with this series alone the punch code leaves behind an in-eof uptodate delalloc block, but writeback comes along and maps it regardless because it also only has uptodate granularity within the dirty folio. When writeback changes to only map dirty sub-ranges, clearly that no longer happens and the landmine turns into a stale delalloc leak. This is also easy to reproduce with the last dirty range RFC patches pulled on top of this series. The changes you describe above to seek out and punch all !dirty delalloc ranges should address both of these problems, because the error handling code doesn't consider i_size the way writeback does. > > The larger concern is that if > > writeback eventually changes based on dirty range tracking in a way that > > breaks this dependency, that introduces yet another stale delalloc block > > landmine associated with this error handling code (regardless of whether > > you want to call that a bug in this code, seek data, whatever), and > > those problems are difficult enough to root cause as it is. > > If iomap changes how it tracks dirty ranges, this punch code only > needs small changes to work with that correctly. There aren't any > unknown landmines here - if we change dirty tracking, we know that > we have to update the code that depends on the existing dirty > tracking mechanisms to work correctly with the new infrastructure... > I'm just pointing out there are side effects / corner cases that don't depend on the complicated shared locking / racing write fault scenarios described above. Therefore, this code probably needs to be fixed at the same time dirty range tracking is enabled to not introduce stale delalloc problems for XFS. In the meantime, I don't think any fs developer just looking to use or debug the iomap layer and considering this punch error handling mechanism should be expected to immediately connect the dots between it using pagecache seek, that depending on Uptodate, the pagecache seek behavior implications of that, writeback potentially not mapping post-eof ranges, and therefore possibly needing to know that the fs in question may need its own custom post-eof error handling if dangling post-eof blocks aren't expected anywhere else for that fs. It may be possible to address that by doing more aggressive punching past the i_size boundary here, but that comes with other tradeoffs. Short of that, I was really just asking for a more descriptive comment for the punch mechanism so these quirks are not lost before they can be properly fixed up or are apparent to any new users besides XFS this might attract in the meantime. For example, something like the following added to the _punch_delalloc() comment (from v3): " ... Note that since this depends on pagecache seek data, and that depends on folio Uptodate state to identify data regions, the scanning behavior here can be non-deterministic based on the state of cache at the time of the failed operation. For example, if the target folio of a failed write was already dirty and the target (sub-folio) range was previously an Uptodate hole before the failing write may have performed delayed allocation, we have no choice but to skip the punch for that sub-range of the folio because we can't tell whether sub-folio Uptodate ranges were actually written to or not. This means that delalloc block allocations may persist for writes that never succeeded. This can similarly skip punches beyond EOF for failed post-eof writes, except note that writeback will not explicitly map blocks beyond i_size. Therefore, filesystems that cannot gracefully accommodate dangling post-eof blocks via other means may need to check for and handle that case directly after calling this function. These quirks should all be addressed with proper sub-folio dirty range tracking, at which point we can deterministically scan and punch non-dirty delalloc ranges at the time of write failure. " ... but I also don't want to go in circles just over a comment, so I'm content that it's at least documented on the list here. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Fri, Nov 18, 2022 at 12:20:45PM -0500, Brian Foster wrote: > On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote: > > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote: > > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > ... > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> --- > > > > fs/xfs/xfs_iomap.c | 151 > > > > ++++++++++++++++++++++++++++++++++++++++++--- 1 file > > > > changed, 141 insertions(+), 10 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index > > > > 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c > > > > +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@ > > > > xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); } > > > > > > > ... > > > > +/* + * Punch out all the delalloc blocks in the range given > > > > except for those that + * have dirty data still pending in > > > > the page cache - those are going to be + * written and so > > > > must still retain the delalloc backing for writeback. + * + > > > > * As we are scanning the page cache for data, we don't need > > > > to reimplement the + * wheel - mapping_seek_hole_data() does > > > > exactly what we need to identify the + * start and end of > > > > data ranges correctly even for sub-folio block sizes. This + > > > > * byte range based iteration is especially convenient > > > > because it means we don't + * have to care about variable > > > > size folios, nor where the start or end of the + * data > > > > range lies within a folio, if they lie within the same folio > > > > or even if + * there are multiple discontiguous data ranges > > > > within the folio. + */ +static int > > > > +xfs_buffered_write_delalloc_release( + struct inode > > > > *inode, + loff_t start_byte, + > > > > loff_t end_byte) +{ + loff_t > > > > punch_start_byte = start_byte; + int > > > > error = 0; + + /* + * Lock the mapping to avoid races > > > > with page faults re-instantiating + * folios and > > > > dirtying them via ->page_mkwrite whilst we walk the + * > > > > cache and perform delalloc extent removal. Failing to do > > > > this can + * leave dirty pages with no space > > > > reservation in the cache. + */ + > > > > filemap_invalidate_lock(inode->i_mapping); + while > > > > (start_byte < end_byte) { + loff_t > > > > data_end; + + start_byte = > > > > mapping_seek_hole_data(inode->i_mapping, + > > > > start_byte, end_byte, SEEK_DATA); > > > > > > FWIW, the fact that mapping seek data is based on uptodate > > > status means that seek behavior can change based on prior > > > reads. > > > > Yup. It should be obvious that any page cache scan based > > algorithm will change based on changing page cache residency. > > > > > For example, see how seek hole/data presents reads of > > > unwritten ranges as data [1]. The same thing isn't observable > > > for holes because iomap doesn't check the mapping in that > > > case, but underlying iop state is the same and that is what > > > this code is looking at. > > > > Well, yes. That's the fundamental, underlying issue that this > > patchset is addressing for the write() operation: that the page > > cache contents and the underlying filesystem extent map are not > > guaranteed to be coherent and can be changed independently of > > each other. > > > > The whole problem with looking exclusively at filesystem level > > extent state (and hence FIEMAP) is that the extent state doesn't > > tell us whether the is uncommitted data over the range of the > > extent in the page cache. The filesystem extent state and page > > cache data *can't be coherent* in a writeback caching > > environment. This is the fundamental difference between what the > > filesystem extent map tells us (FIEMAP) and what querying the > > page cache tells us (SEEK_DATA/SEEK_HOLE). > > > > This is also the underlying problem with iomap_truncate_page() - > > it fails to query the page cache for data over unwritten > > extents, so fails to zero the post-EOF part of dirty folios over > > unwritten extents and so it all goes wrong... > > > > > The filtering being done here means we essentially only care > > > about dirty pages backed by delalloc blocks. That means if you > > > get here with a dirty page and the portion of the page > > > affected by this failed write is uptodate, this won't punch an > > > underlying delalloc block even though nothing else may have > > > written to it in the meantime. > > > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc > > iomap will not span ranges that have pre-existing *dirty* data > > in the page cache. Those *must* already have (del)allocated > > extents, hence the iomap for the newly allocated delalloc extent > > will always end before pre-existing dirty data in the page cache > > starts. > > > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map > > precludes stepping into ranges that have pre-existing cached > > dirty data. > > > > We also can't get a racing write() to the same range right now > > because this is all under IOLOCK_EXCL, hence we only ever see > > dirty folios as a result of race with page faults. page faults > > zero the entire folio they insert into the page cache and > > iomap_folio_mkwrite_iter() asserts that the entire folio is > > marked up to date. Hence if we find a dirty folio outside the > > range the write() dirtied, we are guaranteed that the entire > > dirty folio is up to date.... > > > > Yes, there can be pre-existing *clean* folios (and clean > > partially up to date folios) in the page cache, but we won't > > have dirty partially up to date pages in the middle of the range > > we are scanning. Hence we only need to care about the edge cases > > (folios that overlap start and ends). We skip the partially > > written start block, and we always punch up to the end block if > > it is different from the last block we punched up to. If the end > > of the data spans into a dirty folio, we know that dirty range > > is up to date because the seek scan only returns ranges that are > > up to date. Hence we don't punch those partial blocks out.... > > > > Regardless, let's assume we have a racing write that has > > partially updated and dirtied a folio (because we've moved to > > XFS_IOLOCK_SHARED locking for writes). This case is already > > handled by the mapping_seek_hole_data() based iteration. > > > > That is, the mapping_seek_hole_data() loop provides us with > > *discrete ranges of up to date data* that are independent of > > folio size, up-to-date range granularity, dirty range tracking, > > filesystem block size, etc. > > > > Hence if the next discrete range we discover is in the same > > dirty folio as the previous discrete range of up to date data, > > we know we have a sub-folio sized hole in the data that is not > > up to date. Because there is no data over this range, we have > > to punch out the underlying delalloc extent over that range. > > > > IOWs, the dirty state of the folio and/or the granularity of the > > dirty range tracking is irrelevant here - we know there was no > > data in the cache (dlean or dirty) over this range because it is > > discontiguous with the previous range of data returned. > > > > IOWs, if we have this "up to date" map on a dirty folio like > > this: > > > > Data +-------+UUUUUUU+-------+UUUUUUU+-------+ > > Extent map +DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > > > Then the unrolled iteration and punching we do would look like > > this: > > > > First iteration of the range: > > > > punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+ > > > > SEEK_DATA: V +-------+UUUUUUU+-------+UUUUUUU+-------+ > > SEEK_HOLE: ^ Data range: +UUUUUUU+ > > Punch range: +-------+ Extent map: > > +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > > > Second iteration: > > > > punch_start V > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: > > V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE: > > ^ Data range: +UUUUUUU+ Punch > > range: +-------+ Extent map: > > +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+ > > > > Third iteration: > > > > punch_start V > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves > > into next folio in cache .... Punch range: > > +-------+ ...... Extent map: > > +-------+DDDDDDD+-------+DDDDDDD+-------+ ...... (to end of > > scan range or start of next data) > > > > As you can see, this scan does not care about folio size, > > sub-folio range granularity or filesystem block sizes. It also > > matches exactly how writeback handles dirty, partially up to > > date folios, so there's no stray delalloc blocks left around to > > be tripped over after failed or short writes occur. > > > > Indeed, if we move to sub-folio dirty range tracking, we can > > simply add a mapping_seek_hole_data() variant that walks dirty > > ranges in the page cache rather than up to date ranges. Then we > > can remove the inner loop from this code that looks up folios to > > determine dirty state. The above algorithm does not change - we > > just walk from discrete range to discrete range punching the > > gaps between them.... > > > > IOWs, the algorithm is largely future proof - the only thing > > that needs to change if we change iomap to track sub-folio dirty > > ranges is how we check the data range for being dirty. That > > should be no surprise, really, the surprise should be that we > > can make some simple mods to page cache seek to remove the need > > for checking dirty state in this code altogether.... > > > > > That sort of state can be created by a prior read of the range > > > on a sub-page block size fs, or perhaps a racing async > > > readahead (via read fault of a lower offset..?), etc. > > > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap > > write vs write() case. This test, specifically, was the reason I > > moved to using mapping_seek_hole_data() - g/346 found an endless > > stream of bugs in the sub-multi-page-folio range iteration code > > I kept trying to write.... > > > > > I suspect this is not a serious error because the page is > > > dirty and writeback will thus convert the block. The only > > > exception to that I can see is if the block is beyond EOF > > > (consider a mapped read to a page that straddles EOF, followed > > > by a post-eof write that fails), writeback won't actually map > > > the block directly. > > > > I don't think that can happen. iomap_write_failed() calls > > truncate_pagecache_range() to remove any newly instantiated > > cached data beyond the original EOF. Hence the delalloc punch > > will remove everything beyond the original EOF that was > > allocated for the failed write. Hence when we get to writeback > > we're not going to find any up-to-date data beyond the EOF block > > in the page cache, nor any stray delalloc blocks way beyond > > EOF.... > > > > It can happen if the page straddles eof. I don't think much of the > above relates to the behavior I'm describing. This doesn't require > racing writes, theoretical shared write locking, the IOMAP_F_NEW > flag or core iteration algorithm are not major factors, etc. > > It's easier to just use examples. Consider the following sequence > on a bsize=1k fs on a kernel with this patch series. Note that my > xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that > triggers the -EFAULT error check in iomap_write_iter() by passing > a bad user buffer: > > # set eof and make sure entire page is uptodate (even posteof) > $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k" > $file # do a delalloc and open/close cycle to set dirty release > $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file # > fail an appending write to a posteof, discontig, uptodate range on > already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c > "fiemap -v" $file > Firstly, can you tell me which patchset you reproduced this on? Just a vanilla kernel, thev version of the patchset in this thread or the latest one I sent out? Secondly, can you please send a patch with that failed write functionality upstream for xfs_io - it is clearly useful. > wrote 1/1 bytes at offset 0 > 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec) > pwrite: Bad address > /mnt/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..1]: 22..23 2 0x0 > 1: [2..3]: hole 2 > 2: [4..5]: 0..1 2 0x7 > > So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block > there that hasn't been written to. Ok, so this is failing before even iomap_write_begin() is called, so it's a failed write that has not touched page cache state, but has allocated a delalloc block beyond EOF. This is a case that writeback handles because writeback is supposed skip writeback for pages and blocks beyond EOF. i.e. iomap_writepage_map() is passed the current EOF to terminate the writeback mapping loop when the folio being written back spans EOF. Hence iomap_writepage_map never walks past EOF and so does not ever iterate filesystem blocks that contain up to date data beyond in the page cache beyond EOF. That's why writeback leaves that stray delalloc block beyond, even though it's covered by up to date data. IOWs, it appears that the problem here is that the page cache considers data in the folio beyond EOF to be up to date and contain data, whilst most filesystems do not consider data beyond EOF to be valid. This means there is a boundary condition that mapping_seek_hole_data() doesn't handle correctly - it completely ignores EOF and so SEEK_DATA for an offset past EOF will report data beyond EOF if the file has been mmap()d and the last folio faulted in. This specific boundary condition behaviour is hidden by iomap_seek_data() and shmem_file_llseek() by not allowing seeks to start and/or end beyond EOF. Hence they never calls mapping_seek_hole_data() with a start byte beyond EOF and so will not ever try to find data in the page caceh beyond EOF. That looks easy enough to fix - just consider any data range returned that is beyond i_size_read() to require punching, and then the semantics match writeback ignoring anything beyond EOF.... > I call this a landmine because it's an > unexpected/untracked instance of post-eof delalloc. I.e., post-eof block > reclamation won't pick it up because the inode was never tagged for speculative > preallocation: Of course - this wasn't allocated by speculative preallocation - that doesn't kick in until the file is >64kB in length, so xfs_bmapi_reserve_delalloc() doesn't set the flag saying there's post-eof preallocation to trim away on the inode. > > > It may convert if contiguous with delalloc blocks inside EOF (and > > > sufficiently sized physical extents exist), or even if not, should > > > still otherwise be cleaned up by the various other means we > > > already have to manage post-eof blocks. > > > > > > So IOW there's a tradeoff being made here for possible spurious > > > allocation and I/O and a subtle dependency on writeback that > > > should probably be documented somewhere. > > > > As per above, I don't think there is any spurious/stale allocation > > left behind by the punch code, nor is there any dependency on > > writeback to ignore it such issues. > > > > The same sort of example as above (but within eof) demonstrates the > writeback dependency. I.e., with this series alone the punch code leaves > behind an in-eof uptodate delalloc block, but writeback comes along and > maps it regardless because it also only has uptodate granularity within > the dirty folio. This "unflushable delalloc extent" case can't happen within EOF. The page fault initialises the folio full of zeroes, so when we get the failed write, the only change is that we now have a single delalloc block within EOF that writeback will iterate over, allocate and write all the up to date zeros to it. Sure, this error path might cause a small amount of extra IO over the dirty folio, but it does not cause any other sort of issue. THere are no stray delalloc blocks, there are no data corruption issues, there are no stale data exposure issues, etc. Leaving a delalloc block under up-to-date page cache data should not cause a user vsible issue. > When writeback changes to only map dirty sub-ranges, clearly that no > longer happens and the landmine turns into a stale delalloc leak. This > is also easy to reproduce with the last dirty range RFC patches pulled > on top of this series. No, not even then. If we have sub-folio dirty ranges, we know this failed write range is *still clean* and so we'll punch it out, regardless of whether it is inside or outside EOF. > It may be possible to address that by doing more aggressive punching > past the i_size boundary here, but that comes with other tradeoffs. Tradeoffs such as .... ? > For example, if the target folio of a failed write > was already dirty and the target (sub-folio) range was previously an > Uptodate hole before the failing write may have performed delayed > allocation, we have no choice but to skip the punch for that sub-range > of the folio because we can't tell whether sub-folio Uptodate ranges > were actually written to or not. This means that delalloc block > allocations may persist for writes that never succeeded. I can add something like that, though I'd need to make sure it is clear that this is not a data corruption vector, just an side effect of a highly unlikely set of corner case conditions we only care about handling without corruption, not efficiency or performance. Cheers, Dave.
On Tue, Nov 22, 2022 at 10:13:04AM +1100, Dave Chinner wrote: > On Fri, Nov 18, 2022 at 12:20:45PM -0500, Brian Foster wrote: > > On Thu, Nov 17, 2022 at 11:41:33AM +1100, Dave Chinner wrote: > > > On Wed, Nov 16, 2022 at 08:57:19AM -0500, Brian Foster wrote: > > > > On Tue, Nov 15, 2022 at 12:30:39PM +1100, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > ... > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> --- > > > > > fs/xfs/xfs_iomap.c | 151 > > > > > ++++++++++++++++++++++++++++++++++++++++++--- 1 file > > > > > changed, 141 insertions(+), 10 deletions(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index > > > > > 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c > > > > > +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@ > > > > > xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); } > > > > > > > > > ... > > > > > +/* + * Punch out all the delalloc blocks in the range given > > > > > except for those that + * have dirty data still pending in > > > > > the page cache - those are going to be + * written and so > > > > > must still retain the delalloc backing for writeback. + * + > > > > > * As we are scanning the page cache for data, we don't need > > > > > to reimplement the + * wheel - mapping_seek_hole_data() does > > > > > exactly what we need to identify the + * start and end of > > > > > data ranges correctly even for sub-folio block sizes. This + > > > > > * byte range based iteration is especially convenient > > > > > because it means we don't + * have to care about variable > > > > > size folios, nor where the start or end of the + * data > > > > > range lies within a folio, if they lie within the same folio > > > > > or even if + * there are multiple discontiguous data ranges > > > > > within the folio. + */ +static int > > > > > +xfs_buffered_write_delalloc_release( + struct inode > > > > > *inode, + loff_t start_byte, + > > > > > loff_t end_byte) +{ + loff_t > > > > > punch_start_byte = start_byte; + int > > > > > error = 0; + + /* + * Lock the mapping to avoid races > > > > > with page faults re-instantiating + * folios and > > > > > dirtying them via ->page_mkwrite whilst we walk the + * > > > > > cache and perform delalloc extent removal. Failing to do > > > > > this can + * leave dirty pages with no space > > > > > reservation in the cache. + */ + > > > > > filemap_invalidate_lock(inode->i_mapping); + while > > > > > (start_byte < end_byte) { + loff_t > > > > > data_end; + + start_byte = > > > > > mapping_seek_hole_data(inode->i_mapping, + > > > > > start_byte, end_byte, SEEK_DATA); > > > > > > > > FWIW, the fact that mapping seek data is based on uptodate > > > > status means that seek behavior can change based on prior > > > > reads. > > > > > > Yup. It should be obvious that any page cache scan based > > > algorithm will change based on changing page cache residency. > > > > > > > For example, see how seek hole/data presents reads of > > > > unwritten ranges as data [1]. The same thing isn't observable > > > > for holes because iomap doesn't check the mapping in that > > > > case, but underlying iop state is the same and that is what > > > > this code is looking at. > > > > > > Well, yes. That's the fundamental, underlying issue that this > > > patchset is addressing for the write() operation: that the page > > > cache contents and the underlying filesystem extent map are not > > > guaranteed to be coherent and can be changed independently of > > > each other. > > > > > > The whole problem with looking exclusively at filesystem level > > > extent state (and hence FIEMAP) is that the extent state doesn't > > > tell us whether the is uncommitted data over the range of the > > > extent in the page cache. The filesystem extent state and page > > > cache data *can't be coherent* in a writeback caching > > > environment. This is the fundamental difference between what the > > > filesystem extent map tells us (FIEMAP) and what querying the > > > page cache tells us (SEEK_DATA/SEEK_HOLE). > > > > > > This is also the underlying problem with iomap_truncate_page() - > > > it fails to query the page cache for data over unwritten > > > extents, so fails to zero the post-EOF part of dirty folios over > > > unwritten extents and so it all goes wrong... > > > > > > > The filtering being done here means we essentially only care > > > > about dirty pages backed by delalloc blocks. That means if you > > > > get here with a dirty page and the portion of the page > > > > affected by this failed write is uptodate, this won't punch an > > > > underlying delalloc block even though nothing else may have > > > > written to it in the meantime. > > > > > > Hmmm. IOMAP_F_NEW implies that the newly allocated delalloc > > > iomap will not span ranges that have pre-existing *dirty* data > > > in the page cache. Those *must* already have (del)allocated > > > extents, hence the iomap for the newly allocated delalloc extent > > > will always end before pre-existing dirty data in the page cache > > > starts. > > > > > > Hence the seek scan range over an IOMAP_F_NEW IOMAP_DELALLOC map > > > precludes stepping into ranges that have pre-existing cached > > > dirty data. > > > > > > We also can't get a racing write() to the same range right now > > > because this is all under IOLOCK_EXCL, hence we only ever see > > > dirty folios as a result of race with page faults. page faults > > > zero the entire folio they insert into the page cache and > > > iomap_folio_mkwrite_iter() asserts that the entire folio is > > > marked up to date. Hence if we find a dirty folio outside the > > > range the write() dirtied, we are guaranteed that the entire > > > dirty folio is up to date.... > > > > > > Yes, there can be pre-existing *clean* folios (and clean > > > partially up to date folios) in the page cache, but we won't > > > have dirty partially up to date pages in the middle of the range > > > we are scanning. Hence we only need to care about the edge cases > > > (folios that overlap start and ends). We skip the partially > > > written start block, and we always punch up to the end block if > > > it is different from the last block we punched up to. If the end > > > of the data spans into a dirty folio, we know that dirty range > > > is up to date because the seek scan only returns ranges that are > > > up to date. Hence we don't punch those partial blocks out.... > > > > > > Regardless, let's assume we have a racing write that has > > > partially updated and dirtied a folio (because we've moved to > > > XFS_IOLOCK_SHARED locking for writes). This case is already > > > handled by the mapping_seek_hole_data() based iteration. > > > > > > That is, the mapping_seek_hole_data() loop provides us with > > > *discrete ranges of up to date data* that are independent of > > > folio size, up-to-date range granularity, dirty range tracking, > > > filesystem block size, etc. > > > > > > Hence if the next discrete range we discover is in the same > > > dirty folio as the previous discrete range of up to date data, > > > we know we have a sub-folio sized hole in the data that is not > > > up to date. Because there is no data over this range, we have > > > to punch out the underlying delalloc extent over that range. > > > > > > IOWs, the dirty state of the folio and/or the granularity of the > > > dirty range tracking is irrelevant here - we know there was no > > > data in the cache (dlean or dirty) over this range because it is > > > discontiguous with the previous range of data returned. > > > > > > IOWs, if we have this "up to date" map on a dirty folio like > > > this: > > > > > > Data +-------+UUUUUUU+-------+UUUUUUU+-------+ > > > Extent map +DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > > > > > Then the unrolled iteration and punching we do would look like > > > this: > > > > > > First iteration of the range: > > > > > > punch_start: V +-------+UUUUUUU+-------+UUUUUUU+-------+ > > > > > > SEEK_DATA: V +-------+UUUUUUU+-------+UUUUUUU+-------+ > > > SEEK_HOLE: ^ Data range: +UUUUUUU+ > > > Punch range: +-------+ Extent map: > > > +-------+DDDDDDD+DDDDDDD+DDDDDDD+DDDDDDD+ > > > > > > Second iteration: > > > > > > punch_start V > > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: > > > V +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_HOLE: > > > ^ Data range: +UUUUUUU+ Punch > > > range: +-------+ Extent map: > > > +-------+DDDDDDD+-------+DDDDDDD+DDDDDDD+ > > > > > > Third iteration: > > > > > > punch_start V > > > +-------+UUUUUUU+-------+UUUUUUU+-------+ SEEK_DATA: - moves > > > into next folio in cache .... Punch range: > > > +-------+ ...... Extent map: > > > +-------+DDDDDDD+-------+DDDDDDD+-------+ ...... (to end of > > > scan range or start of next data) > > > > > > As you can see, this scan does not care about folio size, > > > sub-folio range granularity or filesystem block sizes. It also > > > matches exactly how writeback handles dirty, partially up to > > > date folios, so there's no stray delalloc blocks left around to > > > be tripped over after failed or short writes occur. > > > > > > Indeed, if we move to sub-folio dirty range tracking, we can > > > simply add a mapping_seek_hole_data() variant that walks dirty > > > ranges in the page cache rather than up to date ranges. Then we > > > can remove the inner loop from this code that looks up folios to > > > determine dirty state. The above algorithm does not change - we > > > just walk from discrete range to discrete range punching the > > > gaps between them.... > > > > > > IOWs, the algorithm is largely future proof - the only thing > > > that needs to change if we change iomap to track sub-folio dirty > > > ranges is how we check the data range for being dirty. That > > > should be no surprise, really, the surprise should be that we > > > can make some simple mods to page cache seek to remove the need > > > for checking dirty state in this code altogether.... > > > > > > > That sort of state can be created by a prior read of the range > > > > on a sub-page block size fs, or perhaps a racing async > > > > readahead (via read fault of a lower offset..?), etc. > > > > > > Yup, generic/346 exercises this racing unaligned, sub-folio mmap > > > write vs write() case. This test, specifically, was the reason I > > > moved to using mapping_seek_hole_data() - g/346 found an endless > > > stream of bugs in the sub-multi-page-folio range iteration code > > > I kept trying to write.... > > > > > > > I suspect this is not a serious error because the page is > > > > dirty and writeback will thus convert the block. The only > > > > exception to that I can see is if the block is beyond EOF > > > > (consider a mapped read to a page that straddles EOF, followed > > > > by a post-eof write that fails), writeback won't actually map > > > > the block directly. > > > > > > I don't think that can happen. iomap_write_failed() calls > > > truncate_pagecache_range() to remove any newly instantiated > > > cached data beyond the original EOF. Hence the delalloc punch > > > will remove everything beyond the original EOF that was > > > allocated for the failed write. Hence when we get to writeback > > > we're not going to find any up-to-date data beyond the EOF block > > > in the page cache, nor any stray delalloc blocks way beyond > > > EOF.... > > > > > > > It can happen if the page straddles eof. I don't think much of the > > above relates to the behavior I'm describing. This doesn't require > > racing writes, theoretical shared write locking, the IOMAP_F_NEW > > flag or core iteration algorithm are not major factors, etc. > > > > It's easier to just use examples. Consider the following sequence > > on a bsize=1k fs on a kernel with this patch series. Note that my > > xfs_io is hacked up to turn 'pwrite -f' into a "fail" flag that > > triggers the -EFAULT error check in iomap_write_iter() by passing > > a bad user buffer: > > > > # set eof and make sure entire page is uptodate (even posteof) > > $XFS_IO_PROG -fc "truncate 1k" -c "mmap 0 4k" -c "mread 0 4k" > > $file # do a delalloc and open/close cycle to set dirty release > > $XFS_IO_PROG -c "pwrite 0 1" -c "close" -c "open $file" $file # > > fail an appending write to a posteof, discontig, uptodate range on > > already dirty page $XFS_IO_PROG -c "pwrite -f 2k 1" -c fsync -c > > "fiemap -v" $file > > > Firstly, can you tell me which patchset you reproduced this on? Just > a vanilla kernel, thev version of the patchset in this thread or the > latest one I sent out? > This series. > Secondly, can you please send a patch with that failed write > functionality upstream for xfs_io - it is clearly useful. > Sure. It trivially passes a NULL buffer, but I can post an RFC for now at least. > > wrote 1/1 bytes at offset 0 > > 1.000000 bytes, 1 ops; 0.0003 sec (3.255 KiB/sec and 3333.3333 ops/sec) > > pwrite: Bad address > > /mnt/file: > > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > > 0: [0..1]: 22..23 2 0x0 > > 1: [2..3]: hole 2 > > 2: [4..5]: 0..1 2 0x7 > > > > So clearly there's a posteof delalloc (FIEMAP_EXTENT_DELALLOC == 0x4) block > > there that hasn't been written to. > > Ok, so this is failing before even iomap_write_begin() is called, > so it's a failed write that has not touched page cache state, but > has allocated a delalloc block beyond EOF. > Yes. > This is a case that writeback handles because writeback > is supposed skip writeback for pages and blocks beyond EOF. i.e. > iomap_writepage_map() is passed the current EOF to terminate the > writeback mapping loop when the folio being written back spans EOF. > Hence iomap_writepage_map never walks past EOF and so does not ever > iterate filesystem blocks that contain up to date data beyond in the > page cache beyond EOF. That's why writeback leaves that stray > delalloc block beyond, even though it's covered by up to date data. > > IOWs, it appears that the problem here is that the page cache > considers data in the folio beyond EOF to be up to date and contain > data, whilst most filesystems do not consider data beyond EOF to be > valid. This means there is a boundary condition that > mapping_seek_hole_data() doesn't handle correctly - it completely > ignores EOF and so SEEK_DATA for an offset past EOF will report data > beyond EOF if the file has been mmap()d and the last folio faulted > in. > > This specific boundary condition behaviour is hidden by > iomap_seek_data() and shmem_file_llseek() by not allowing seeks to > start and/or end beyond EOF. Hence they never calls > mapping_seek_hole_data() with a start byte beyond EOF and so will > not ever try to find data in the page caceh beyond EOF. > > That looks easy enough to fix - just consider any data range > returned that is beyond i_size_read() to require punching, and then > the semantics match writeback ignoring anything beyond EOF.... > > > I call this a landmine because it's an > > unexpected/untracked instance of post-eof delalloc. I.e., post-eof block > > reclamation won't pick it up because the inode was never tagged for speculative > > preallocation: > > Of course - this wasn't allocated by speculative preallocation - > that doesn't kick in until the file is >64kB in length, so > xfs_bmapi_reserve_delalloc() doesn't set the flag saying there's > post-eof preallocation to trim away on the inode. > > > > > It may convert if contiguous with delalloc blocks inside EOF (and > > > > sufficiently sized physical extents exist), or even if not, should > > > > still otherwise be cleaned up by the various other means we > > > > already have to manage post-eof blocks. > > > > > > > > So IOW there's a tradeoff being made here for possible spurious > > > > allocation and I/O and a subtle dependency on writeback that > > > > should probably be documented somewhere. > > > > > > As per above, I don't think there is any spurious/stale allocation > > > left behind by the punch code, nor is there any dependency on > > > writeback to ignore it such issues. > > > > > > > The same sort of example as above (but within eof) demonstrates the > > writeback dependency. I.e., with this series alone the punch code leaves > > behind an in-eof uptodate delalloc block, but writeback comes along and > > maps it regardless because it also only has uptodate granularity within > > the dirty folio. > > This "unflushable delalloc extent" case can't happen within EOF. The > page fault initialises the folio full of zeroes, so when we get the > failed write, the only change is that we now have a single delalloc > block within EOF that writeback will iterate over, allocate and > write all the up to date zeros to it. > > Sure, this error path might cause a small amount of extra IO over > the dirty folio, but it does not cause any other sort of issue. > THere are no stray delalloc blocks, there are no data corruption > issues, there are no stale data exposure issues, etc. Leaving a > delalloc block under up-to-date page cache data should not cause > a user vsible issue. > This is why I suggested a comment might be sufficient. > > When writeback changes to only map dirty sub-ranges, clearly that no > > longer happens and the landmine turns into a stale delalloc leak. This > > is also easy to reproduce with the last dirty range RFC patches pulled > > on top of this series. > > No, not even then. If we have sub-folio dirty ranges, we know this > failed write range is *still clean* and so we'll punch it out, > regardless of whether it is inside or outside EOF. > At this point I suspect we're talking past eachother and based on previous discussion, I think we're on the same page wrt to what actually needs to happen and why. > > It may be possible to address that by doing more aggressive punching > > past the i_size boundary here, but that comes with other tradeoffs. > > Tradeoffs such as .... ? > When I wrote that I was thinking about the potential for odd interactions with speculative preallocation. For example, consider the case where the write that allocates speculative prealloc happens to fail, how xfs_can_free_eofblocks() is implemented, etc. I'm not explicitly saying it's some catastrophic problem, corruption vector, etc., or anything like that. I'm just saying it's replacing one potential wonky interaction with another, and thus introduces different considerations for things that probably don't have great test coverage to begin with. I suppose it might make some sense for the iomap mechanism to invoke the callback for post-eof ranges, but then have the fs handle that particular case specially if appropriate. So for example if an XFS inode sees a post-eof punch and is tagged with post-eof prealloc, it might make more sense to either punch the whole of it out or just leave it around to be cleaned up later vs. just punch out a subset of it. But I don't have a strong opinion either way and still no objection to simply documenting the quirk for the time being. > > For example, if the target folio of a failed write > > was already dirty and the target (sub-folio) range was previously an > > Uptodate hole before the failing write may have performed delayed > > allocation, we have no choice but to skip the punch for that sub-range > > of the folio because we can't tell whether sub-folio Uptodate ranges > > were actually written to or not. This means that delalloc block > > allocations may persist for writes that never succeeded. > > I can add something like that, though I'd need to make sure it is > clear that this is not a data corruption vector, just an side effect > of a highly unlikely set of corner case conditions we only care > about handling without corruption, not efficiency or performance. > I couldn't think of any such issues for upstream XFS, at least. One reason is that delalloc blocks convert to unwritten extents by default, so anything converted that wasn't directly targeted with writes should remain in that state. Another fallback is that iomap explicitly zeroes cache pages for post-eof blocks on reads, regardless of extent type. Of course those were just initial observations and don't necessarily mean problems don't exist. :) FWIW, the more subtle reason I commented on these quirks, but isn't so relevant upstream, is that if this work is going to get backported to older/stable/distro kernels it might be possible for this to land in kernels without one or both of those previously mentioned behaviors (at which point there very well might be a vector for stale data exposure or a subtle feature dependency, depending on context, what the final patch series looks like, etc.). Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 7bb55dbc19d3..2d48fcc7bd6f 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch( end_fsb - start_fsb); } +/* + * Scan the data range passed to us for dirty page cache folios. If we find a + * dirty folio, punch out the preceeding range and update the offset from which + * the next punch will start from. + * + * We can punch out clean pages because they either contain data that has been + * written back - in which case the delalloc punch over that range is a no-op - + * or they have been read faults in which case they contain zeroes and we can + * remove the delalloc backing range and any new writes to those pages will do + * the normal hole filling operation... + * + * This makes the logic simple: we only need to keep the delalloc extents only + * over the dirty ranges of the page cache. + */ +static int +xfs_buffered_write_delalloc_scan( + struct inode *inode, + loff_t *punch_start_byte, + loff_t start_byte, + loff_t end_byte) +{ + loff_t offset = start_byte; + + while (offset < end_byte) { + struct folio *folio; + + /* grab locked page */ + folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT); + if (!folio) { + offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE; + continue; + } + + /* if dirty, punch up to offset */ + if (folio_test_dirty(folio)) { + if (offset > *punch_start_byte) { + int error; + + error = xfs_buffered_write_delalloc_punch(inode, + *punch_start_byte, offset); + if (error) { + folio_unlock(folio); + folio_put(folio); + return error; + } + } + + /* + * Make sure the next punch start is correctly bound to + * the end of this data range, not the end of the folio. + */ + *punch_start_byte = min_t(loff_t, end_byte, + folio_next_index(folio) << PAGE_SHIFT); + } + + /* move offset to start of next folio in range */ + offset = folio_next_index(folio) << PAGE_SHIFT; + folio_unlock(folio); + folio_put(folio); + } + return 0; +} + +/* + * Punch out all the delalloc blocks in the range given except for those that + * have dirty data still pending in the page cache - those are going to be + * written and so must still retain the delalloc backing for writeback. + * + * As we are scanning the page cache for data, we don't need to reimplement the + * wheel - mapping_seek_hole_data() does exactly what we need to identify the + * start and end of data ranges correctly even for sub-folio block sizes. This + * byte range based iteration is especially convenient because it means we don't + * have to care about variable size folios, nor where the start or end of the + * data range lies within a folio, if they lie within the same folio or even if + * there are multiple discontiguous data ranges within the folio. + */ +static int +xfs_buffered_write_delalloc_release( + struct inode *inode, + loff_t start_byte, + loff_t end_byte) +{ + loff_t punch_start_byte = start_byte; + int error = 0; + + /* + * Lock the mapping to avoid races with page faults re-instantiating + * folios and dirtying them via ->page_mkwrite whilst we walk the + * cache and perform delalloc extent removal. Failing to do this can + * leave dirty pages with no space reservation in the cache. + */ + filemap_invalidate_lock(inode->i_mapping); + while (start_byte < end_byte) { + loff_t data_end; + + start_byte = mapping_seek_hole_data(inode->i_mapping, + start_byte, end_byte, SEEK_DATA); + /* + * If there is no more data to scan, all that is left is to + * punch out the remaining range. + */ + if (start_byte == -ENXIO || start_byte == end_byte) + break; + if (start_byte < 0) { + error = start_byte; + goto out_unlock; + } + ASSERT(start_byte >= punch_start_byte); + ASSERT(start_byte < end_byte); + + /* + * We find the end of this contiguous cached data range by + * seeking from start_byte to the beginning of the next hole. + */ + data_end = mapping_seek_hole_data(inode->i_mapping, start_byte, + end_byte, SEEK_HOLE); + if (data_end < 0) { + error = data_end; + goto out_unlock; + } + ASSERT(data_end > start_byte); + ASSERT(data_end <= end_byte); + + error = xfs_buffered_write_delalloc_scan(inode, + &punch_start_byte, start_byte, data_end); + if (error) + goto out_unlock; + + /* The next data search starts at the end of this one. */ + start_byte = data_end; + } + + if (punch_start_byte < end_byte) + error = xfs_buffered_write_delalloc_punch(inode, + punch_start_byte, end_byte); +out_unlock: + filemap_invalidate_unlock(inode->i_mapping); + return error; +} + static int xfs_buffered_write_iomap_end( struct inode *inode, @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end( if (start_byte >= end_byte) return 0; - /* - * Lock the mapping to avoid races with page faults re-instantiating - * folios and dirtying them via ->page_mkwrite between the page cache - * truncation and the delalloc extent removal. Failing to do this can - * leave dirty pages with no space reservation in the cache. - */ - filemap_invalidate_lock(inode->i_mapping); - truncate_pagecache_range(inode, start_byte, end_byte - 1); - error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte); - filemap_invalidate_unlock(inode->i_mapping); + error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte); if (error && !xfs_is_shutdown(mp)) { xfs_alert(mp, "%s: unable to clean up ino 0x%llx", __func__, XFS_I(inode)->i_ino);