Message ID | 20171027125328.25001-1-eguan@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > But it's possible that a buffer write overwrites the unwritten > extent, which won't be converted to a normal extent until I/O > completion, and iomap_truncate_page() skips zeroing wrongly because > of the not-converted unwritten extent. This would cause a subsequent > mmap read sees non-zeros beyond EOF. I suspect the right fix is to look at the in-core state im the iomap truncate helpers instead of doing a duplicate flush. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 27, 2017 at 11:05:29PM -0700, Christoph Hellwig wrote: > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > > But it's possible that a buffer write overwrites the unwritten > > extent, which won't be converted to a normal extent until I/O > > completion, and iomap_truncate_page() skips zeroing wrongly because > > of the not-converted unwritten extent. This would cause a subsequent > > mmap read sees non-zeros beyond EOF. > > I suspect the right fix is to look at the in-core state im the iomap > truncate helpers instead of doing a duplicate flush. I may (and very likely) miss something, but my understanding is that iomap_truncate_page() already looks at the in-core extent state provided by xfs_file_iomap_begin() xfs_setattr_size() iomap_truncate_page() iomap_zero_range() iomap_apply() xfs_file_iomap_begin() # finds extent according to the range iomap_zero_range_actor() # sees iomap->type == IOMAP_UNWRITTEN And this in-core extent state won't be converted to XFS_EXT_NORM until writeback & I/O completion, so I think a flush is required. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 31, 2017 at 06:09:58PM +0800, Eryu Guan wrote: > On Fri, Oct 27, 2017 at 11:05:29PM -0700, Christoph Hellwig wrote: > > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > > > But it's possible that a buffer write overwrites the unwritten > > > extent, which won't be converted to a normal extent until I/O > > > completion, and iomap_truncate_page() skips zeroing wrongly because > > > of the not-converted unwritten extent. This would cause a subsequent > > > mmap read sees non-zeros beyond EOF. > > > > I suspect the right fix is to look at the in-core state im the iomap > > truncate helpers instead of doing a duplicate flush. > > I may (and very likely) miss something, but my understanding is that > iomap_truncate_page() already looks at the in-core extent state provided > by xfs_file_iomap_begin() > > xfs_setattr_size() > iomap_truncate_page() > iomap_zero_range() > iomap_apply() > xfs_file_iomap_begin() # finds extent according to the range > iomap_zero_range_actor() # sees iomap->type == IOMAP_UNWRITTEN > > And this in-core extent state won't be converted to XFS_EXT_NORM until > writeback & I/O completion, so I think a flush is required. If we get to iomap_zero_range_actor & type == UNWRITTEN, can we ask the pagecache if there's a page backing the (unwritten) range and if so zero the appropriate parts of that page? --D > Thanks, > Eryu > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > On truncate down, if new size is not block size aligned, we zero the > rest of block via iomap_truncate_page() to avoid exposing stale data > to user, and iomap_truncate_page() skips zeroing if the range is > already in unwritten status or a hole. Unless the page is in the page cache already, and then it gets zeroed in memory as part of truncate_setsize() call. > But it's possible that a buffer write overwrites the unwritten > extent, which won't be converted to a normal extent until I/O > completion, and iomap_truncate_page() skips zeroing wrongly because > of the not-converted unwritten extent. This would cause a subsequent > mmap read sees non-zeros beyond EOF. Yes, it should skip the zeroing on disk. The page in the page cache over the unwritten extent will be zeroed on read. The real question is this: where are the zeros in the page that fsx is complaining about? > I occasionally see this in fsx runs in fstests generic/112, a > simplified fsx operation sequence is like (assuming 4k block size > xfs): What should have is: > fallocate 0x0 0x1000 0x0 keep_size Unwritten, no data. > write 0x0 0x1000 0x0 Unwritten, contains data in page cache. > truncate 0x0 0x800 0x1000 Unwritten, page contains data 0-0x800, zeros 0x800-0x1000 > punch_hole 0x0 0x800 0x800 Unwritten, page contains zeros 0x0-0x1000 > mapread 0x0 0x800 0x800 Should map a page full of zeros as it is either a read over an unwritten extent or a hole, or it finds a page cache page that is fully zeroed. The only wrinkle in this is if the write is direct IO, but then the truncate would see a written extent and this whole problem doesn't occur. So, more info required. :P Cheers, Dave.
On Tue, Oct 31, 2017 at 10:11:31AM -0700, Darrick J. Wong wrote: > On Tue, Oct 31, 2017 at 06:09:58PM +0800, Eryu Guan wrote: > > On Fri, Oct 27, 2017 at 11:05:29PM -0700, Christoph Hellwig wrote: > > > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > > > > But it's possible that a buffer write overwrites the unwritten > > > > extent, which won't be converted to a normal extent until I/O > > > > completion, and iomap_truncate_page() skips zeroing wrongly because > > > > of the not-converted unwritten extent. This would cause a subsequent > > > > mmap read sees non-zeros beyond EOF. > > > > > > I suspect the right fix is to look at the in-core state im the iomap > > > truncate helpers instead of doing a duplicate flush. > > > > I may (and very likely) miss something, but my understanding is that > > iomap_truncate_page() already looks at the in-core extent state provided > > by xfs_file_iomap_begin() > > > > xfs_setattr_size() > > iomap_truncate_page() > > iomap_zero_range() > > iomap_apply() > > xfs_file_iomap_begin() # finds extent according to the range > > iomap_zero_range_actor() # sees iomap->type == IOMAP_UNWRITTEN > > > > And this in-core extent state won't be converted to XFS_EXT_NORM until > > writeback & I/O completion, so I think a flush is required. > > If we get to iomap_zero_range_actor & type == UNWRITTEN, can we ask the > pagecache if there's a page backing the (unwritten) range and if so zero > the appropriate parts of that page? The partial page zeroing in the page cache should be taken care of by truncate_setsize truncate_pagecache() truncate_inode_pages() truncate_inode_pages_range() that follows up the on-disk EOF zeroing. See the partial page zeroing in that function.... Cheers, Dave.
On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote: > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > > On truncate down, if new size is not block size aligned, we zero the > > rest of block via iomap_truncate_page() to avoid exposing stale data > > to user, and iomap_truncate_page() skips zeroing if the range is > > already in unwritten status or a hole. > > Unless the page is in the page cache already, and then it gets > zeroed in memory as part of truncate_setsize() call. > > > But it's possible that a buffer write overwrites the unwritten > > extent, which won't be converted to a normal extent until I/O > > completion, and iomap_truncate_page() skips zeroing wrongly because > > of the not-converted unwritten extent. This would cause a subsequent > > mmap read sees non-zeros beyond EOF. > > Yes, it should skip the zeroing on disk. The page in the page cache > over the unwritten extent will be zeroed on read. > > The real question is this: where are the zeros in the page that fsx > is complaining about? The partial block that iomap_truncate_page() skipped zeroing was latter written back to disk, and the punch_hole before mmap read invalidated the page cache so mmap read from disk and saw non-zeros. This is a hard-to-hit sequence, it took me almost 2000 iterations of generic/112 runs to hit one failure. I'll provide more details below. > > > I occasionally see this in fsx runs in fstests generic/112, a > > simplified fsx operation sequence is like (assuming 4k block size > > xfs): > > What should have is: > > > fallocate 0x0 0x1000 0x0 keep_size > > Unwritten, no data. Yes, assuming 4k block size and 4k page size, unwritten extent with 1 block allocated, i_size stays 0. > > > write 0x0 0x1000 0x0 > > Unwritten, contains data in page cache. Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0. > > > truncate 0x0 0x800 0x1000 > > Unwritten, page contains data 0-0x800, zeros 0x800-0x1000 Yes, the page cache after truncate is correct. But before we zero the page cache (in truncate_setsize()), we skipped zeroing the partial block range 0x800-0x1000 and then triggered a writeback on range [di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back to disk too, which contained non-zeros. (newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true. if (did_zeroing || (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { error = filemap_write_and_wait_range(mapping, ip->i_d.di_size, newsize - 1); if (error) return error; } > > > punch_hole 0x0 0x800 0x800 > > Unwritten, page contains zeros 0x0-0x1000 i_mapping had no pages (nrpages == 0) after punch_hole. > > > mapread 0x0 0x800 0x800 pagefault read block from disk, 0-0x7ff was zero, but 0x800-0xfff was non-zero. > > Should map a page full of zeros as it is either a read over an > unwritten extent or a hole, or it finds a page cache page that is > fully zeroed. > > The only wrinkle in this is if the write is direct IO, but > then the truncate would see a written extent and this whole problem > doesn't occur. > > So, more info required. :P Above is what I observed in debugging, maybe I should have provided more details in the first place. (I did add some comments to the fstests case though..). Thanks, Eryu > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 01, 2017 at 11:46:39AM +0800, Eryu Guan wrote: > On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote: > > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > > > On truncate down, if new size is not block size aligned, we zero the > > > rest of block via iomap_truncate_page() to avoid exposing stale data > > > to user, and iomap_truncate_page() skips zeroing if the range is > > > already in unwritten status or a hole. > > > > Unless the page is in the page cache already, and then it gets > > zeroed in memory as part of truncate_setsize() call. > > > > > But it's possible that a buffer write overwrites the unwritten > > > extent, which won't be converted to a normal extent until I/O > > > completion, and iomap_truncate_page() skips zeroing wrongly because > > > of the not-converted unwritten extent. This would cause a subsequent > > > mmap read sees non-zeros beyond EOF. > > > > Yes, it should skip the zeroing on disk. The page in the page cache > > over the unwritten extent will be zeroed on read. > > > > The real question is this: where are the zeros in the page that fsx > > is complaining about? > > The partial block that iomap_truncate_page() skipped zeroing was latter > written back to disk, and the punch_hole before mmap read invalidated > the page cache so mmap read from disk and saw non-zeros. This is a > hard-to-hit sequence, it took me almost 2000 iterations of generic/112 > runs to hit one failure. I'll provide more details below. Oh, ok, so they weren't close together operations but far apart in the trace. I usually indicate that by showing [....] lines between the operations if there's stuff that occurred between them. > > > simplified fsx operation sequence is like (assuming 4k block size > > > xfs): > > > > What should have is: > > > > > fallocate 0x0 0x1000 0x0 keep_size > > > > Unwritten, no data. > > Yes, assuming 4k block size and 4k page size, unwritten extent with 1 > block allocated, i_size stays 0. > > > > > > write 0x0 0x1000 0x0 > > > > Unwritten, contains data in page cache. > > Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0. > > > > > > truncate 0x0 0x800 0x1000 > > > > Unwritten, page contains data 0-0x800, zeros 0x800-0x1000 > > Yes, the page cache after truncate is correct. But before we zero the > page cache (in truncate_setsize()), we skipped zeroing the partial block > range 0x800-0x1000 and then triggered a writeback on range > [di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back > to disk too, which contained non-zeros. > > (newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true. > > if (did_zeroing || > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > error = filemap_write_and_wait_range(mapping, ip->i_d.di_size, > newsize - 1); > if (error) > return error; > } Ok, so we're writing data between di_size and newsize before removing the page cache beyond newsize. As such, the page of data that newsize lies in has not been zeroed by page cache invalidation before it is written. Ok, that explains why the EOF page zeroing in xfs_do_writepage() isn't catching this - we haven't updated the inode size yet. IOWs, the /three places/ where we normally catch this and zero the partial tail page beyond EOF are not doing it because: 1. iomap_truncate_page() sees unwritten and skips. 2. truncate_setsize() has not yet been called so can't zero the tail of the page. 3. we haven't changed where EOF is yet, so xfs_do_writepage() hasn't triggered it's "zero data beyond EOF" case before it sends the page to disk. So, we have three options here: 1. iomap_truncate_page() always zeros 2. update inode size before writeback after zeroing so the xfs_do_writepage() zeros the tail page, or 3. move truncate_setsize() to before writeback so the page cache invalidation zeros the part page at the new EOF. I think 1) is a no-go for performance reasons. 2) is better, but I don't like the idea of separating the page cache invalidation from the size truncation. That leaves 3) - moving truncate_setsize(). I think I prefer 3) because it triggers multiple layers of defense against writing stale data past EOF, and from an crash behaviour point of view it makes no difference whether we truncate the page cache before or after triggering writeback because it will just make the result the same as if we were zeroing a written extent.... Cheers, Dave.
On Wed, Nov 01, 2017 at 03:44:51PM +1100, Dave Chinner wrote: > On Wed, Nov 01, 2017 at 11:46:39AM +0800, Eryu Guan wrote: > > On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote: > > > On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote: > > > > On truncate down, if new size is not block size aligned, we zero the > > > > rest of block via iomap_truncate_page() to avoid exposing stale data > > > > to user, and iomap_truncate_page() skips zeroing if the range is > > > > already in unwritten status or a hole. > > > > > > Unless the page is in the page cache already, and then it gets > > > zeroed in memory as part of truncate_setsize() call. > > > > > > > But it's possible that a buffer write overwrites the unwritten > > > > extent, which won't be converted to a normal extent until I/O > > > > completion, and iomap_truncate_page() skips zeroing wrongly because > > > > of the not-converted unwritten extent. This would cause a subsequent > > > > mmap read sees non-zeros beyond EOF. > > > > > > Yes, it should skip the zeroing on disk. The page in the page cache > > > over the unwritten extent will be zeroed on read. > > > > > > The real question is this: where are the zeros in the page that fsx > > > is complaining about? > > > > The partial block that iomap_truncate_page() skipped zeroing was latter > > written back to disk, and the punch_hole before mmap read invalidated > > the page cache so mmap read from disk and saw non-zeros. This is a > > hard-to-hit sequence, it took me almost 2000 iterations of generic/112 > > runs to hit one failure. I'll provide more details below. > > Oh, ok, so they weren't close together operations but far apart in > the trace. I usually indicate that by showing [....] lines between > the operations if there's stuff that occurred between them. They are not strictly one-by-one operations in the original fsxops log, but are close enough. Then I tailored the ops into a minimal step-by-step reproducer. > > > > > simplified fsx operation sequence is like (assuming 4k block size > > > > xfs): > > > > > > What should have is: > > > > > > > fallocate 0x0 0x1000 0x0 keep_size > > > > > > Unwritten, no data. > > > > Yes, assuming 4k block size and 4k page size, unwritten extent with 1 > > block allocated, i_size stays 0. > > > > > > > > > write 0x0 0x1000 0x0 > > > > > > Unwritten, contains data in page cache. > > > > Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0. > > > > > > > > > truncate 0x0 0x800 0x1000 > > > > > > Unwritten, page contains data 0-0x800, zeros 0x800-0x1000 > > > > Yes, the page cache after truncate is correct. But before we zero the > > page cache (in truncate_setsize()), we skipped zeroing the partial block > > range 0x800-0x1000 and then triggered a writeback on range > > [di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back > > to disk too, which contained non-zeros. > > > > (newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true. > > > > if (did_zeroing || > > (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > > error = filemap_write_and_wait_range(mapping, ip->i_d.di_size, > > newsize - 1); > > if (error) > > return error; > > } > > Ok, so we're writing data between di_size and newsize before > removing the page cache beyond newsize. As such, the page of data > that newsize lies in has not been zeroed by page cache invalidation > before it is written. > > Ok, that explains why the EOF page zeroing in xfs_do_writepage() > isn't catching this - we haven't updated the inode size yet. > > IOWs, the /three places/ where we normally catch this and zero the > partial tail page beyond EOF are not doing it because: > > 1. iomap_truncate_page() sees unwritten and skips. > 2. truncate_setsize() has not yet been called so can't > zero the tail of the page. > 3. we haven't changed where EOF is yet, so > xfs_do_writepage() hasn't triggered it's "zero data > beyond EOF" case before it sends the page to disk. > > So, we have three options here: > > 1. iomap_truncate_page() always zeros > 2. update inode size before writeback after zeroing so the > xfs_do_writepage() zeros the tail page, or > 3. move truncate_setsize() to before writeback so the page > cache invalidation zeros the part page at the new EOF. This really helps summarize the problem and solution, thanks! Yeah, I started to realize moving the order of writeback vs setsize around might be a fix when I was writing my last reply - explaining the problem to someone else really helps understand the problem itself :) > > I think 1) is a no-go for performance reasons. 2) is better, but > I don't like the idea of separating the page cache invalidation > from the size truncation. That leaves 3) - moving > truncate_setsize(). > > I think I prefer 3) because it triggers multiple layers of defense > against writing stale data past EOF, and from an crash behaviour > point of view it makes no difference whether we truncate the page > cache before or after triggering writeback because it will just make > the result the same as if we were zeroing a written extent.... I'm testing an updated patch based on option 3 now, the finished tests look good. I'll send the new version out for review soon. Thanks a lot for the suggestion and review! Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 17081c77ef86..cd4c612cc23c 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -824,6 +824,7 @@ xfs_setattr_size( { struct xfs_mount *mp = ip->i_mount; struct inode *inode = VFS_I(ip); + struct address_space *mapping = inode->i_mapping; xfs_off_t oldsize, newsize; struct xfs_trans *tp; int error; @@ -878,8 +879,22 @@ xfs_setattr_size( if (newsize > oldsize) { error = xfs_zero_eof(ip, newsize, oldsize, &did_zeroing); } else { - error = iomap_truncate_page(inode, newsize, &did_zeroing, - &xfs_iomap_ops); + /* + * Firstly write out the range that will be zeroed, otherwise + * iomap_truncate_page() may skip zeroing this range wrongly. + * Because the range could still be dirty after buffer writes + * over unwritten extents, and the extent still stays in + * UNWRITTEN state, because the conversion only happens at I/O + * completion. + */ + unsigned int blocksize = i_blocksize(inode); + unsigned int off = newsize & (blocksize - 1); + if (off && mapping->nrpages) + error = filemap_write_and_wait_range(mapping, newsize, + newsize + (blocksize - off) - 1); + if (!error) + error = iomap_truncate_page(inode, newsize, + &did_zeroing, &xfs_iomap_ops); } if (error) @@ -895,8 +910,8 @@ xfs_setattr_size( */ if (did_zeroing || (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, - ip->i_d.di_size, newsize); + error = filemap_write_and_wait_range(mapping, ip->i_d.di_size, + newsize - 1); if (error) return error; }
On truncate down, if new size is not block size aligned, we zero the rest of block via iomap_truncate_page() to avoid exposing stale data to user, and iomap_truncate_page() skips zeroing if the range is already in unwritten status or a hole. But it's possible that a buffer write overwrites the unwritten extent, which won't be converted to a normal extent until I/O completion, and iomap_truncate_page() skips zeroing wrongly because of the not-converted unwritten extent. This would cause a subsequent mmap read sees non-zeros beyond EOF. I occasionally see this in fsx runs in fstests generic/112, a simplified fsx operation sequence is like (assuming 4k block size xfs): fallocate 0x0 0x1000 0x0 keep_size write 0x0 0x1000 0x0 truncate 0x0 0x800 0x1000 punch_hole 0x0 0x800 0x800 mapread 0x0 0x800 0x800 This is introduced along with the switch to iomap based buffer write support, commit 68a9f5e7007c ("xfs: implement iomap based buffered write path"), because previously, block_truncate_page() did fs block based check and only skipped holes. Fix it by flushing the range we're going to zero to ensure iomap_truncate_page() sees the final extent state and zeros the range correctly. Also fixed the wrong 'end' param of filemap_write_and_wait_range() call while we're at it, the 'end' is inclusive and should be 'newsize - 1'. Signed-off-by: Eryu Guan <eguan@redhat.com> --- An fstests test case followed, I've verified it could reproduce the bug reliably and quickly. fs/xfs/xfs_iops.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)