xfs: redirty eof folio on truncate to avoid filemap flush

Message ID	20221028131109.977581-1-bfoster@redhat.com (mailing list archive)
State	Deferred, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Brian Foster <bfoster@redhat.com> To: linux-xfs@vger.kernel.org Subject: [PATCH] xfs: redirty eof folio on truncate to avoid filemap flush Date: Fri, 28 Oct 2022 09:11:09 -0400 Message-Id: <20221028131109.977581-1-bfoster@redhat.com> In-Reply-To: <20221028130411.977076-1-bfoster@redhat.com> References: <20221028130411.977076-1-bfoster@redhat.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: redirty eof folio on truncate to avoid filemap flush \| expand xfs: redirty eof folio on truncate to avoid filemap flush

Brian Foster Oct. 28, 2022, 1:11 p.m. UTC

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Here's a quick prototype of "option 3" described in my previous mail.
This has been spot tested and confirmed to prevent the original stale
data exposure problem. More thorough regression testing is still
required. Barring unforeseen issues with that, however, I think this is
tentatively my new preferred option. The primary reason for that is it
avoids looking at extent state and is more in line with what iomap based
zeroing should be doing more generically.

Because of that, I think this provides a bit more opportunity for follow
on fixes (there are other truncate/zeroing problems I've come across
during this investigation that still need fixing), cleanup and
consolidation of the zeroing code. For example, I think the trajectory
of this could look something like:

- Genericize a bit more to handle all truncates.
- Repurpose iomap_truncate_page() (currently only used by XFS) into a
  unique implementation from zero range that does explicit zeroing
  instead of relying on pagecache truncate.
- Refactor XFS ranged zeroing to an abstraction that uses a combination
  of iomap_zero_range() and the new iomap_truncate_page().

From there we'd hopefully have predictable and functionally correct
zeroing in the filesystem. The next step would probably be to see if/how
the truncate page and zero range implementations could combine into a
single zero range implementation. I have vague thoughts on that, but at
this stage I'm not going too deep into how that should look without some
sort of functional implementation to base it on.

Brian

 fs/xfs/xfs_iops.c | 49 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

Brian Foster Oct. 28, 2022, 6:26 p.m. UTC | #1

On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Here's a quick prototype of "option 3" described in my previous mail.
> This has been spot tested and confirmed to prevent the original stale
> data exposure problem. More thorough regression testing is still
> required. Barring unforeseen issues with that, however, I think this is
> tentatively my new preferred option. The primary reason for that is it
> avoids looking at extent state and is more in line with what iomap based
> zeroing should be doing more generically.
> 
> Because of that, I think this provides a bit more opportunity for follow
> on fixes (there are other truncate/zeroing problems I've come across
> during this investigation that still need fixing), cleanup and
> consolidation of the zeroing code. For example, I think the trajectory
> of this could look something like:
> 
> - Genericize a bit more to handle all truncates.
> - Repurpose iomap_truncate_page() (currently only used by XFS) into a
>   unique implementation from zero range that does explicit zeroing
>   instead of relying on pagecache truncate.
> - Refactor XFS ranged zeroing to an abstraction that uses a combination
>   of iomap_zero_range() and the new iomap_truncate_page().
> 

After playing with this and thinking a bit more about the above, I think
I managed to come up with an iomap_truncate_page() prototype that DTRT
based on this. Only spot tested so far, needs to pass iomap_flags to the
other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
other bugs/warts, etc. etc. This is just a quick prototype to
demonstrate the idea, which is essentially to check dirty state along
with extent state while under lock and transfer that state back to iomap
so it can decide whether it can shortcut or forcibly perform the zero.

In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
under lock and implies that the range is sub-block (single page).
IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
dirty, so perform the zero via buffered write regardless of extent
state.

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 91ee0b308e13..14a9734b2838 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
+	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
 		return length;
 
 	do {
@@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n",
+			__LINE__, folio->mapping->host->i_ino, offset, bytes);
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	return written;
 }
 
+static int
+__iomap_zero_range(struct iomap_iter *iter, bool *did_zero,
+		   const struct iomap_ops *ops)
+{
+	int ret;
+
+	while ((ret = iomap_iter(iter, ops)) > 0)
+		iter->processed = iomap_zero_iter(iter, did_zero);
+	return ret;
+}
+
 int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops)
@@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		.len		= len,
 		.flags		= IOMAP_ZERO,
 	};
-	int ret;
 
-	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero);
-	return ret;
+	return __iomap_zero_range(&iter, did_zero, ops);
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
@@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 {
 	unsigned int blocksize = i_blocksize(inode);
 	unsigned int off = pos & (blocksize - 1);
+	struct iomap_iter iter = {
+		.inode		= inode,
+		.pos		= pos,
+		.len		= blocksize - off,
+		.flags		= IOMAP_ZERO | IOMAP_TRUNC_PAGE,
+	};
 
 	/* Block boundary? Nothing to do */
 	if (!off)
 		return 0;
-	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+	return __iomap_zero_range(&iter, did_zero, ops);
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 07da03976ec1..16d9b838e82d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin(
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
 	unsigned int		lockmode = XFS_ILOCK_EXCL;
+	u16			iomap_flags = 0;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	if ((flags & IOMAP_TRUNC_PAGE) &&
+	    filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset))
+			iomap_flags |= IOMAP_F_TRUNC_PAGE;
+
 	/*
 	 * Search the data fork first to look up our source mapping.  We
 	 * always need the data fork map, as we have to return it to the
@@ -1100,7 +1105,7 @@ xfs_buffered_write_iomap_begin(
 
 found_imap:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0);
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags);
 
 found_cow:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2e10e1c66ad6..3c40a81d6da0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -840,16 +840,6 @@ xfs_setattr_size(
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
 	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17..6f9b4f991720 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -56,6 +56,7 @@ struct vm_fault;
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
 #define IOMAP_F_ZONE_APPEND	0x20
+#define IOMAP_F_TRUNC_PAGE	0x40
 
 /*
  * Flags set by the core iomap code during operations:
@@ -146,6 +147,7 @@ struct iomap_page_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_TRUNC_PAGE	(1 << 9)
 
 struct iomap_ops {
 	/*

Dave Chinner Oct. 28, 2022, 9:30 p.m. UTC | #2

On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Here's a quick prototype of "option 3" described in my previous mail.
> > This has been spot tested and confirmed to prevent the original stale
> > data exposure problem. More thorough regression testing is still
> > required. Barring unforeseen issues with that, however, I think this is
> > tentatively my new preferred option. The primary reason for that is it
> > avoids looking at extent state and is more in line with what iomap based
> > zeroing should be doing more generically.
> > 
> > Because of that, I think this provides a bit more opportunity for follow
> > on fixes (there are other truncate/zeroing problems I've come across
> > during this investigation that still need fixing), cleanup and
> > consolidation of the zeroing code. For example, I think the trajectory
> > of this could look something like:
> > 
> > - Genericize a bit more to handle all truncates.
> > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> >   unique implementation from zero range that does explicit zeroing
> >   instead of relying on pagecache truncate.
> > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> >   of iomap_zero_range() and the new iomap_truncate_page().
> > 
> 
> After playing with this and thinking a bit more about the above, I think
> I managed to come up with an iomap_truncate_page() prototype that DTRT
> based on this. Only spot tested so far, needs to pass iomap_flags to the
> other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> other bugs/warts, etc. etc. This is just a quick prototype to
> demonstrate the idea, which is essentially to check dirty state along
> with extent state while under lock and transfer that state back to iomap
> so it can decide whether it can shortcut or forcibly perform the zero.
> 
> In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> under lock and implies that the range is sub-block (single page).
> IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> dirty, so perform the zero via buffered write regardless of extent
> state.

I'd much prefer we fix this in the iomap infrastructure - failing to
zero dirty data in memory over an unwritten extent isn't an XFS bug,
so we shouldn't be working around it in XFS like we did previously.

I don't think this should be call "IOMAP_TRUNC_PAGE", though,
because that indicates the caller context, not what we are asking
the internal iomap code to do. What we are really asking is for
iomap_zero_iter() to do is zero the page cache if it exists in
memory, otherwise ignore unwritten/hole pages.  Hence I think a name
like IOMAP_ZERO_PAGECACHE is more appropriate,

> 
> Brian
> 
> --- 8< ---
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 91ee0b308e13..14a9734b2838 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	loff_t written = 0;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
>  		return length;

Why even involve the filesystem in this? We can do this directly
in iomap_zero_iter() with:

	if ((srcmap->type == IOMAP_HOLE)
		return;
	if (srcmap->type == IOMAP_UNWRITTEN) {
		if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
			return;
		if (!filemap_range_needs_writeback(inode->i_mapping,
			    iomap->offset, iomap->offset + iomap->length))
			return;
	}

It probably also warrants a coment that a clean folio over EOF on an
unwritten extent already contains zeros, so we're only interested in
folios that *have been dirtied* over this extent. If it's under
writeback, we should still be zeroing because it will shortly
contain real data on disk and so it needs to be zeroed and
redirtied....

> @@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (bytes > folio_size(folio) - offset)
>  			bytes = folio_size(folio) - offset;
>  
> +		trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n",
> +			__LINE__, folio->mapping->host->i_ino, offset, bytes);
>  		folio_zero_range(folio, offset, bytes);
>  		folio_mark_accessed(folio);
>  
> @@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	return written;
>  }
>  
> +static int
> +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero,
> +		   const struct iomap_ops *ops)
> +{
> +	int ret;
> +
> +	while ((ret = iomap_iter(iter, ops)) > 0)
> +		iter->processed = iomap_zero_iter(iter, did_zero);
> +	return ret;
> +}

I'd just leave this simple loop open coded in the two callers.

> +
>  int
>  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		const struct iomap_ops *ops)
> @@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		.len		= len,
>  		.flags		= IOMAP_ZERO,
>  	};
> -	int ret;
>  
> -	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.processed = iomap_zero_iter(&iter, did_zero);
> -	return ret;
> +	return __iomap_zero_range(&iter, did_zero, ops);
>  }
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
>  
> @@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>  {
>  	unsigned int blocksize = i_blocksize(inode);
>  	unsigned int off = pos & (blocksize - 1);
> +	struct iomap_iter iter = {
> +		.inode		= inode,
> +		.pos		= pos,
> +		.len		= blocksize - off,
> +		.flags		= IOMAP_ZERO | IOMAP_TRUNC_PAGE,
> +	};
>  
>  	/* Block boundary? Nothing to do */
>  	if (!off)
>  		return 0;
> -	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
> +	return __iomap_zero_range(&iter, did_zero, ops);
>  }
>  EXPORT_SYMBOL_GPL(iomap_truncate_page);
>  
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 07da03976ec1..16d9b838e82d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin(
>  	int			allocfork = XFS_DATA_FORK;
>  	int			error = 0;
>  	unsigned int		lockmode = XFS_ILOCK_EXCL;
> +	u16			iomap_flags = 0;
>  
>  	if (xfs_is_shutdown(mp))
>  		return -EIO;
> @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> +	if ((flags & IOMAP_TRUNC_PAGE) &&
> +	    filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset))
> +			iomap_flags |= IOMAP_F_TRUNC_PAGE;

As per above, I don't think we should be putting this check in the
filesystem. That simplifies this a lot as filesystems don't need to
know anything about how iomap manages the page cache for the
filesystem...

Cheers,

Dave.

kernel test robot Oct. 28, 2022, 9:35 p.m. UTC | #3

Hi Brian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on linus/master v6.1-rc2 next-20221028]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Brian-Foster/xfs-redirty-eof-folio-on-truncate-to-avoid-filemap-flush/20221028-211253
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
patch link:    https://lore.kernel.org/r/20221028131109.977581-1-bfoster%40redhat.com
patch subject: [PATCH] xfs: redirty eof folio on truncate to avoid filemap flush
config: x86_64-randconfig-a014
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/b16e2c202472f3fc46e66c37fdc7c381b968af75
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Brian-Foster/xfs-redirty-eof-folio-on-truncate-to-avoid-filemap-flush/20221028-211253
        git checkout b16e2c202472f3fc46e66c37fdc7c381b968af75
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash fs/xfs/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/xfs/xfs_iops.c:863:8: warning: variable 'eof_dirty' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
                           if (folio_test_dirty(eof_folio) ||
                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/xfs/xfs_iops.c:867:9: note: uninitialized use occurs here
                           if (!eof_dirty) {
                                ^~~~~~~~~
   fs/xfs/xfs_iops.c:863:4: note: remove the 'if' if its condition is always true
                           if (folio_test_dirty(eof_folio) ||
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/xfs/xfs_iops.c:793:18: note: initialize the variable 'eof_dirty' to silence this warning
           bool                    eof_dirty;
                                            ^
                                             = 0
   1 warning generated.


vim +863 fs/xfs/xfs_iops.c

   772	
   773	/*
   774	 * Truncate file.  Must have write permission and not be a directory.
   775	 *
   776	 * Caution: The caller of this function is responsible for calling
   777	 * setattr_prepare() or otherwise verifying the change is fine.
   778	 */
   779	STATIC int
   780	xfs_setattr_size(
   781		struct user_namespace	*mnt_userns,
   782		struct xfs_inode	*ip,
   783		struct iattr		*iattr)
   784	{
   785		struct xfs_mount	*mp = ip->i_mount;
   786		struct inode		*inode = VFS_I(ip);
   787		struct folio		*eof_folio = NULL;
   788		xfs_off_t		oldsize, newsize;
   789		struct xfs_trans	*tp;
   790		int			error;
   791		uint			lock_flags = 0;
   792		bool			did_zeroing = false;
   793		bool			eof_dirty;
   794	
   795		ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
   796		ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
   797		ASSERT(S_ISREG(inode->i_mode));
   798		ASSERT((iattr->ia_valid & (ATTR_UID|ATTR_GID|ATTR_ATIME|ATTR_ATIME_SET|
   799			ATTR_MTIME_SET|ATTR_TIMES_SET)) == 0);
   800	
   801		oldsize = inode->i_size;
   802		newsize = iattr->ia_size;
   803	
   804		/*
   805		 * Short circuit the truncate case for zero length files.
   806		 */
   807		if (newsize == 0 && oldsize == 0 && ip->i_df.if_nextents == 0) {
   808			if (!(iattr->ia_valid & (ATTR_CTIME|ATTR_MTIME)))
   809				return 0;
   810	
   811			/*
   812			 * Use the regular setattr path to update the timestamps.
   813			 */
   814			iattr->ia_valid &= ~ATTR_SIZE;
   815			return xfs_setattr_nonsize(mnt_userns, ip, iattr);
   816		}
   817	
   818		/*
   819		 * Make sure that the dquots are attached to the inode.
   820		 */
   821		error = xfs_qm_dqattach(ip);
   822		if (error)
   823			return error;
   824	
   825		/*
   826		 * Wait for all direct I/O to complete.
   827		 */
   828		inode_dio_wait(inode);
   829	
   830		/*
   831		 * File data changes must be complete before we start the transaction to
   832		 * modify the inode.  This needs to be done before joining the inode to
   833		 * the transaction because the inode cannot be unlocked once it is a
   834		 * part of the transaction.
   835		 *
   836		 * Start with zeroing any data beyond EOF that we may expose on file
   837		 * extension, or zeroing out the rest of the block on a downward
   838		 * truncate.
   839		 */
   840		if (newsize > oldsize) {
   841			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
   842			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
   843					&did_zeroing);
   844		} else {
   845			/*
   846			 * iomap won't detect a dirty folio over an unwritten block (or
   847			 * a cow block over a hole) and subsequently skips zeroing the
   848			 * newly post-EOF portion of the folio. Doing a flush here (i.e.
   849			 * as is done for fallocate ZERO_RANGE) updates extent state for
   850			 * iomap, but has too much overhead for the truncate path.
   851			 *
   852			 * Instead, check whether the new EOF is dirty in pagecache. If
   853			 * so, hold a reference across the pagecache truncate and dirty
   854			 * the folio. This ensures that partial folio zeroing from the
   855			 * truncate makes it to disk in the rare event that iomap skips
   856			 * zeroing and writeback happens to complete before the
   857			 * pagecache truncate. Note that this really should be handled
   858			 * properly by iomap zero range.
   859			 */
   860			eof_folio = filemap_lock_folio(inode->i_mapping,
   861						(newsize - 1) >> PAGE_SHIFT);
   862			if (eof_folio) {
 > 863				if (folio_test_dirty(eof_folio) ||
   864				    folio_test_writeback(eof_folio))
   865					eof_dirty = true;
   866				folio_unlock(eof_folio);
   867				if (!eof_dirty) {
   868					folio_put(eof_folio);
   869					eof_folio = NULL;
   870				}
   871			}
   872			error = xfs_truncate_page(ip, newsize, &did_zeroing);
   873		}
   874	
   875		if (error) {
   876			if (eof_folio)
   877				folio_put(eof_folio);
   878			return error;
   879		}
   880	
   881		/*
   882		 * We've already locked out new page faults, so now we can safely remove
   883		 * pages from the page cache knowing they won't get refaulted until we
   884		 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
   885		 * complete. The truncate_setsize() call also cleans partial EOF page
   886		 * PTEs on extending truncates and hence ensures sub-page block size
   887		 * filesystems are correctly handled, too.
   888		 *
   889		 * We have to do all the page cache truncate work outside the
   890		 * transaction context as the "lock" order is page lock->log space
   891		 * reservation as defined by extent allocation in the writeback path.
   892		 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
   893		 * having already truncated the in-memory version of the file (i.e. made
   894		 * user visible changes). There's not much we can do about this, except
   895		 * to hope that the caller sees ENOMEM and retries the truncate
   896		 * operation.
   897		 *
   898		 * And we update in-core i_size and truncate page cache beyond newsize
   899		 * before writeback the [i_disk_size, newsize] range, so we're
   900		 * guaranteed not to write stale data past the new EOF on truncate down.
   901		 */
   902		truncate_setsize(inode, newsize);
   903		if (eof_folio) {
   904			trace_printk("%d: ino 0x%llx newsize 0x%llx folio idx 0x%lx did_zeroing %d\n",
   905				__LINE__, ip->i_ino, newsize, folio_index(eof_folio), did_zeroing);
   906			if (!did_zeroing) {
   907				filemap_dirty_folio(inode->i_mapping, eof_folio);
   908				did_zeroing = true;
   909			}
   910			folio_put(eof_folio);
   911		}
   912	
   913		/*
   914		 * We are going to log the inode size change in this transaction so
   915		 * any previous writes that are beyond the on disk EOF and the new
   916		 * EOF that have not been written out need to be written here.  If we
   917		 * do not write the data out, we expose ourselves to the null files
   918		 * problem. Note that this includes any block zeroing we did above;
   919		 * otherwise those blocks may not be zeroed after a crash.
   920		 */
   921		if (did_zeroing ||
   922		    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
   923			error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
   924							ip->i_disk_size, newsize - 1);
   925			if (error)
   926				return error;
   927		}
   928	
   929		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
   930		if (error)
   931			return error;
   932	
   933		lock_flags |= XFS_ILOCK_EXCL;
   934		xfs_ilock(ip, XFS_ILOCK_EXCL);
   935		xfs_trans_ijoin(tp, ip, 0);
   936	
   937		/*
   938		 * Only change the c/mtime if we are changing the size or we are
   939		 * explicitly asked to change it.  This handles the semantic difference
   940		 * between truncate() and ftruncate() as implemented in the VFS.
   941		 *
   942		 * The regular truncate() case without ATTR_CTIME and ATTR_MTIME is a
   943		 * special case where we need to update the times despite not having
   944		 * these flags set.  For all other operations the VFS set these flags
   945		 * explicitly if it wants a timestamp update.
   946		 */
   947		if (newsize != oldsize &&
   948		    !(iattr->ia_valid & (ATTR_CTIME | ATTR_MTIME))) {
   949			iattr->ia_ctime = iattr->ia_mtime =
   950				current_time(inode);
   951			iattr->ia_valid |= ATTR_CTIME | ATTR_MTIME;
   952		}
   953	
   954		/*
   955		 * The first thing we do is set the size to new_size permanently on
   956		 * disk.  This way we don't have to worry about anyone ever being able
   957		 * to look at the data being freed even in the face of a crash.
   958		 * What we're getting around here is the case where we free a block, it
   959		 * is allocated to another file, it is written to, and then we crash.
   960		 * If the new data gets written to the file but the log buffers
   961		 * containing the free and reallocation don't, then we'd end up with
   962		 * garbage in the blocks being freed.  As long as we make the new size
   963		 * permanent before actually freeing any blocks it doesn't matter if
   964		 * they get written to.
   965		 */
   966		ip->i_disk_size = newsize;
   967		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
   968	
   969		if (newsize <= oldsize) {
   970			error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK, newsize);
   971			if (error)
   972				goto out_trans_cancel;
   973	
   974			/*
   975			 * Truncated "down", so we're removing references to old data
   976			 * here - if we delay flushing for a long time, we expose
   977			 * ourselves unduly to the notorious NULL files problem.  So,
   978			 * we mark this inode and flush it when the file is closed,
   979			 * and do not wait the usual (long) time for writeout.
   980			 */
   981			xfs_iflags_set(ip, XFS_ITRUNCATED);
   982	
   983			/* A truncate down always removes post-EOF blocks. */
   984			xfs_inode_clear_eofblocks_tag(ip);
   985		}
   986	
   987		ASSERT(!(iattr->ia_valid & (ATTR_UID | ATTR_GID)));
   988		setattr_copy(mnt_userns, inode, iattr);
   989		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
   990	
   991		XFS_STATS_INC(mp, xs_ig_attrchg);
   992	
   993		if (xfs_has_wsync(mp))
   994			xfs_trans_set_sync(tp);
   995	
   996		error = xfs_trans_commit(tp);
   997	out_unlock:
   998		if (lock_flags)
   999			xfs_iunlock(ip, lock_flags);
  1000		return error;
  1001	
  1002	out_trans_cancel:
  1003		xfs_trans_cancel(tp);
  1004		goto out_unlock;
  1005	}
  1006

Darrick J. Wong Oct. 28, 2022, 11:49 p.m. UTC | #4

On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote:
> On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > Here's a quick prototype of "option 3" described in my previous mail.
> > > This has been spot tested and confirmed to prevent the original stale
> > > data exposure problem. More thorough regression testing is still
> > > required. Barring unforeseen issues with that, however, I think this is
> > > tentatively my new preferred option. The primary reason for that is it
> > > avoids looking at extent state and is more in line with what iomap based
> > > zeroing should be doing more generically.
> > > 
> > > Because of that, I think this provides a bit more opportunity for follow
> > > on fixes (there are other truncate/zeroing problems I've come across
> > > during this investigation that still need fixing), cleanup and
> > > consolidation of the zeroing code. For example, I think the trajectory
> > > of this could look something like:
> > > 
> > > - Genericize a bit more to handle all truncates.
> > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > >   unique implementation from zero range that does explicit zeroing
> > >   instead of relying on pagecache truncate.
> > > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > >   of iomap_zero_range() and the new iomap_truncate_page().
> > > 
> > 
> > After playing with this and thinking a bit more about the above, I think
> > I managed to come up with an iomap_truncate_page() prototype that DTRT
> > based on this. Only spot tested so far, needs to pass iomap_flags to the
> > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> > other bugs/warts, etc. etc. This is just a quick prototype to
> > demonstrate the idea, which is essentially to check dirty state along
> > with extent state while under lock and transfer that state back to iomap
> > so it can decide whether it can shortcut or forcibly perform the zero.
> > 
> > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> > under lock and implies that the range is sub-block (single page).
> > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> > dirty, so perform the zero via buffered write regardless of extent
> > state.
> 
> I'd much prefer we fix this in the iomap infrastructure - failing to
> zero dirty data in memory over an unwritten extent isn't an XFS bug,
> so we shouldn't be working around it in XFS like we did previously.

Hmm, I think I agree, given that this is really a bug in cache handling.
Or so I gather; reading on...

> I don't think this should be call "IOMAP_TRUNC_PAGE", though,
> because that indicates the caller context, not what we are asking
> the internal iomap code to do. What we are really asking is for
> iomap_zero_iter() to do is zero the page cache if it exists in
> memory, otherwise ignore unwritten/hole pages.  Hence I think a name
> like IOMAP_ZERO_PAGECACHE is more appropriate,

I don't even like ZERO_PAGECACHE -- in my mind that implies that it
unconditionally zeroes any page it finds, whereas we really only want it
to zero dirty cache contents.  IOMAP_ZERO_DIRTY_CACHE?

> > 
> > Brian
> > 
> > --- 8< ---
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 91ee0b308e13..14a9734b2838 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  	loff_t written = 0;
> >  
> >  	/* already zeroed?  we're done. */
> > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> > +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
> >  		return length;
> 
> Why even involve the filesystem in this? We can do this directly
> in iomap_zero_iter() with:
> 
> 	if ((srcmap->type == IOMAP_HOLE)
> 		return;
> 	if (srcmap->type == IOMAP_UNWRITTEN) {
> 		if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
> 			return;
> 		if (!filemap_range_needs_writeback(inode->i_mapping,
> 			    iomap->offset, iomap->offset + iomap->length))
> 			return;
> 	}
> 
> It probably also warrants a coment that a clean folio over EOF on an
> unwritten extent already contains zeros, so we're only interested in
> folios that *have been dirtied* over this extent. If it's under
> writeback, we should still be zeroing because it will shortly
> contain real data on disk and so it needs to be zeroed and
> redirtied....

Yes, please.  I think we've screwed up eofpage handling enough in the
past to warrant good documentation of these corner cases.

> 
> > @@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  		if (bytes > folio_size(folio) - offset)
> >  			bytes = folio_size(folio) - offset;
> >  
> > +		trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n",
> > +			__LINE__, folio->mapping->host->i_ino, offset, bytes);
> >  		folio_zero_range(folio, offset, bytes);
> >  		folio_mark_accessed(folio);
> >  
> > @@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  	return written;
> >  }
> >  
> > +static int
> > +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero,
> > +		   const struct iomap_ops *ops)
> > +{
> > +	int ret;
> > +
> > +	while ((ret = iomap_iter(iter, ops)) > 0)
> > +		iter->processed = iomap_zero_iter(iter, did_zero);
> > +	return ret;
> > +}
> 
> I'd just leave this simple loop open coded in the two callers.
> 
> > +
> >  int
> >  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		const struct iomap_ops *ops)
> > @@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> >  		.len		= len,
> >  		.flags		= IOMAP_ZERO,
> >  	};
> > -	int ret;
> >  
> > -	while ((ret = iomap_iter(&iter, ops)) > 0)
> > -		iter.processed = iomap_zero_iter(&iter, did_zero);
> > -	return ret;
> > +	return __iomap_zero_range(&iter, did_zero, ops);
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_zero_range);
> >  
> > @@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> >  {
> >  	unsigned int blocksize = i_blocksize(inode);
> >  	unsigned int off = pos & (blocksize - 1);
> > +	struct iomap_iter iter = {
> > +		.inode		= inode,
> > +		.pos		= pos,
> > +		.len		= blocksize - off,
> > +		.flags		= IOMAP_ZERO | IOMAP_TRUNC_PAGE,
> > +	};
> >  
> >  	/* Block boundary? Nothing to do */
> >  	if (!off)
> >  		return 0;
> > -	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
> > +	return __iomap_zero_range(&iter, did_zero, ops);
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_truncate_page);
> >  
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 07da03976ec1..16d9b838e82d 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin(
> >  	int			allocfork = XFS_DATA_FORK;
> >  	int			error = 0;
> >  	unsigned int		lockmode = XFS_ILOCK_EXCL;
> > +	u16			iomap_flags = 0;
> >  
> >  	if (xfs_is_shutdown(mp))
> >  		return -EIO;
> > @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin(
> >  	if (error)
> >  		goto out_unlock;
> >  
> > +	if ((flags & IOMAP_TRUNC_PAGE) &&
> > +	    filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset))
> > +			iomap_flags |= IOMAP_F_TRUNC_PAGE;
> 
> As per above, I don't think we should be putting this check in the
> filesystem. That simplifies this a lot as filesystems don't need to
> know anything about how iomap manages the page cache for the
> filesystem...

I gather from the bug description that this appears to me to be a
problem with how we manage the page cache during a truncation when the
eofpage is backed by unwritten extents.  As such, I think that this
should be a fix within iomap.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Oct. 29, 2022, 10:01 p.m. UTC | #5

On Fri, Oct 28, 2022 at 04:49:37PM -0700, Darrick J. Wong wrote:
> On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote:
> > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > Here's a quick prototype of "option 3" described in my previous mail.
> > > > This has been spot tested and confirmed to prevent the original stale
> > > > data exposure problem. More thorough regression testing is still
> > > > required. Barring unforeseen issues with that, however, I think this is
> > > > tentatively my new preferred option. The primary reason for that is it
> > > > avoids looking at extent state and is more in line with what iomap based
> > > > zeroing should be doing more generically.
> > > > 
> > > > Because of that, I think this provides a bit more opportunity for follow
> > > > on fixes (there are other truncate/zeroing problems I've come across
> > > > during this investigation that still need fixing), cleanup and
> > > > consolidation of the zeroing code. For example, I think the trajectory
> > > > of this could look something like:
> > > > 
> > > > - Genericize a bit more to handle all truncates.
> > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > > >   unique implementation from zero range that does explicit zeroing
> > > >   instead of relying on pagecache truncate.
> > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > > >   of iomap_zero_range() and the new iomap_truncate_page().
> > > > 
> > > 
> > > After playing with this and thinking a bit more about the above, I think
> > > I managed to come up with an iomap_truncate_page() prototype that DTRT
> > > based on this. Only spot tested so far, needs to pass iomap_flags to the
> > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> > > other bugs/warts, etc. etc. This is just a quick prototype to
> > > demonstrate the idea, which is essentially to check dirty state along
> > > with extent state while under lock and transfer that state back to iomap
> > > so it can decide whether it can shortcut or forcibly perform the zero.
> > > 
> > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> > > under lock and implies that the range is sub-block (single page).
> > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> > > dirty, so perform the zero via buffered write regardless of extent
> > > state.
> > 
> > I'd much prefer we fix this in the iomap infrastructure - failing to
> > zero dirty data in memory over an unwritten extent isn't an XFS bug,
> > so we shouldn't be working around it in XFS like we did previously.
> 
> Hmm, I think I agree, given that this is really a bug in cache handling.
> Or so I gather; reading on...
> 
> > I don't think this should be call "IOMAP_TRUNC_PAGE", though,
> > because that indicates the caller context, not what we are asking
> > the internal iomap code to do. What we are really asking is for
> > iomap_zero_iter() to do is zero the page cache if it exists in
> > memory, otherwise ignore unwritten/hole pages.  Hence I think a name
> > like IOMAP_ZERO_PAGECACHE is more appropriate,
> 
> I don't even like ZERO_PAGECACHE -- in my mind that implies that it
> unconditionally zeroes any page it finds, whereas we really only want it
> to zero dirty cache contents.  IOMAP_ZERO_DIRTY_CACHE?

Fine by me, the name just needs to describe the action that needs to
be performed....

> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 07da03976ec1..16d9b838e82d 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin(
> > >  	int			allocfork = XFS_DATA_FORK;
> > >  	int			error = 0;
> > >  	unsigned int		lockmode = XFS_ILOCK_EXCL;
> > > +	u16			iomap_flags = 0;
> > >  
> > >  	if (xfs_is_shutdown(mp))
> > >  		return -EIO;
> > > @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin(
> > >  	if (error)
> > >  		goto out_unlock;
> > >  
> > > +	if ((flags & IOMAP_TRUNC_PAGE) &&
> > > +	    filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset))
> > > +			iomap_flags |= IOMAP_F_TRUNC_PAGE;
> > 
> > As per above, I don't think we should be putting this check in the
> > filesystem. That simplifies this a lot as filesystems don't need to
> > know anything about how iomap manages the page cache for the
> > filesystem...
> 
> I gather from the bug description that this appears to me to be a
> problem with how we manage the page cache during a truncation when the
> eofpage is backed by unwritten extents.

Right, think of iomap_truncate_page() as having exactly the same
responsibilites as block_truncate_page() has for filesystems using
bufferheads. i.e. both functions need to ensure the disk contents
are correctly zeroed such that the caller can safely call
truncate_setsize() afterwards resulting in both the on-disk state
and in-memory state remaining coherent.

Hence iomap_truncate_page() needs to ensure that we handle dirty
data over unwritten extents correctly. If we look further, it is
obvious that iomap already has this responsibility:
SEEK_HOLE/SEEK_DATA does page cache lookups to find data over
unwritten extents. Hence it makes no sense for one part of iomap to
take responsibility for managing data over unwritten extents, whilst
another part ignores it...

If there was another filesystem using iomap and unwritten extents,
it would have exactly the same issues with iomap_truncate_page().
Hence:

> As such, I think that this
> should be a fix within iomap.

This.

-Dave.

Christoph Hellwig Nov. 2, 2022, 8:15 a.m. UTC | #6

On Sun, Oct 30, 2022 at 09:01:31AM +1100, Dave Chinner wrote:
> Right, think of iomap_truncate_page() as having exactly the same
> responsibilites as block_truncate_page() has for filesystems using
> bufferheads. i.e. both functions need to ensure the disk contents
> are correctly zeroed such that the caller can safely call
> truncate_setsize() afterwards resulting in both the on-disk state
> and in-memory state remaining coherent.

XFS always had these kinds of hacks even when it was using
block_truncate_page.  That being said, I fully agree that handling
it in iomap is the absolutely right thing to do and we should have
done it earlier.

Christoph Hellwig Nov. 2, 2022, 8:25 a.m. UTC | #7

On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> index 91ee0b308e13..14a9734b2838 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	loff_t written = 0;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))

As mentioned elsewere in the thread I think we can just move the
filemap_range_needs_writeback here, which removes the need for
IOMAP_F_TRUNC_PAGE.

> +static int
> +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero,
> +		   const struct iomap_ops *ops)
> +{
> +	int ret;
> +
> +	while ((ret = iomap_iter(iter, ops)) > 0)
> +		iter->processed = iomap_zero_iter(iter, did_zero);
> +	return ret;
> +}

I'd be tempted to just duplicate this two line loop instead of adding
such a tivial helper, but that's just a matter of taste.

Brian Foster Nov. 3, 2022, 2:53 p.m. UTC | #8

On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote:
> On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > Here's a quick prototype of "option 3" described in my previous mail.
> > > This has been spot tested and confirmed to prevent the original stale
> > > data exposure problem. More thorough regression testing is still
> > > required. Barring unforeseen issues with that, however, I think this is
> > > tentatively my new preferred option. The primary reason for that is it
> > > avoids looking at extent state and is more in line with what iomap based
> > > zeroing should be doing more generically.
> > > 
> > > Because of that, I think this provides a bit more opportunity for follow
> > > on fixes (there are other truncate/zeroing problems I've come across
> > > during this investigation that still need fixing), cleanup and
> > > consolidation of the zeroing code. For example, I think the trajectory
> > > of this could look something like:
> > > 
> > > - Genericize a bit more to handle all truncates.
> > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > >   unique implementation from zero range that does explicit zeroing
> > >   instead of relying on pagecache truncate.
> > > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > >   of iomap_zero_range() and the new iomap_truncate_page().
> > > 
> > 
> > After playing with this and thinking a bit more about the above, I think
> > I managed to come up with an iomap_truncate_page() prototype that DTRT
> > based on this. Only spot tested so far, needs to pass iomap_flags to the
> > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> > other bugs/warts, etc. etc. This is just a quick prototype to
> > demonstrate the idea, which is essentially to check dirty state along
> > with extent state while under lock and transfer that state back to iomap
> > so it can decide whether it can shortcut or forcibly perform the zero.
> > 
> > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> > under lock and implies that the range is sub-block (single page).
> > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> > dirty, so perform the zero via buffered write regardless of extent
> > state.
> 
> I'd much prefer we fix this in the iomap infrastructure - failing to
> zero dirty data in memory over an unwritten extent isn't an XFS bug,
> so we shouldn't be working around it in XFS like we did previously.
> 

I agree, but that was the original goal from the start. It's easier said
than done to just have iomap accurately write/skip the appropriate
ranges..

> I don't think this should be call "IOMAP_TRUNC_PAGE", though,
> because that indicates the caller context, not what we are asking
> the internal iomap code to do. What we are really asking is for
> iomap_zero_iter() to do is zero the page cache if it exists in
> memory, otherwise ignore unwritten/hole pages.  Hence I think a name
> like IOMAP_ZERO_PAGECACHE is more appropriate,
> 

That was kind of the point for this prototype. The flag isn't just
asking iomap to perform some generic write behavior. It also indicates
this is a special snowflake mode with assumptions from the caller (i.e.,
unflushed, sub-block/partial range) that facilitate forced zeroing
internally.

I've since come to the conclusion that this approach is just premature.
It really only does the right thing in this very particular case,
otherwise there is potential for odd/unexpected behavior in sub-page
blocksize scenarios. It could be made to work more appropriately
eventually, but more thought and work is required and there's a jump in
complexity that isn't required to fix the immediate performance problem
and additional stale data exposure problems.

> > 
> > Brian
> > 
> > --- 8< ---
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 91ee0b308e13..14a9734b2838 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  	loff_t written = 0;
> >  
> >  	/* already zeroed?  we're done. */
> > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> > +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
> >  		return length;
> 
> Why even involve the filesystem in this? We can do this directly
> in iomap_zero_iter() with:
> 
> 	if ((srcmap->type == IOMAP_HOLE)
> 		return;
> 	if (srcmap->type == IOMAP_UNWRITTEN) {
> 		if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
> 			return;
> 		if (!filemap_range_needs_writeback(inode->i_mapping,
> 			    iomap->offset, iomap->offset + iomap->length))
> 			return;
> 	}
> 

This reintroduces the same stale data exposure race fixed by the
original patch. folio writeback can complete and convert an extent
reported as unwritten such that zeroing will not occur when it should.
The writeback check either needs to be in the fs code where it can
prevent writeback completion from converting extents (under ilock), or
earlier in iomap such that we're guaranteed to see either writeback
state or the converted extent.

Given the above, I'm currently putting the prototype shown below through
some regression testing. It lifts the writeback check into iomap and
issues a flush and retry of the current iteration from iomap_zero_iter()
when necessary. In more focused testing thus far, this addresses the XFS
truncate performance problem by making the flush conditional, prevents
the original stale data exposure problem, and also addresses one or two
similar outstanding problems buried in write extending patterns and
whatnot.

Since this all exists in iomap, no fs changes are required to provide a
minimum guarantee of correctness (at the cost of additional, occasional
flushes). If there are scenarios where the performance cost is a
problem, an additional flag akin to the IOMAP_TRUNC_PAGE proposal can
still be added and conditionally set by the fs if/when it wants to make
that tradeoff. For example, XFS could pass some ->private state through
the iomap_truncate_page() variant that ->iomap_begin() could use to
trivially eliminate the flush in the truncate path, or implement a size
or range trimming heuristic to eliminate/mitigate it in others. If/when
iomap has enough information to do selective zeroing properly on its own
without fs intervention, the flush fallback can simply go away.

I have some patches around that demonstrate such things, but given the
current behavior of an unconditional flush and so far only seeing one
user report from an oddball pattern, I don't see much need for
additional hacks at the moment.

Thoughts? IMO, the iter retry and writeback check are both wonky and
this is something that should first go into ->iomap_begin(), and only
lift into iomap after a period of time. That's just my .02. I do plan to
clean up with a retry helper and add comments and whatnot if this
survives testing and/or any functional issues are worked out.

Brian

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 91ee0b308e13..649d94ad3808 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
 static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	struct inode *inode = iter->inode;
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
+	int status;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
-		return length;
+	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
+		if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE))
+			return length;
+
+		status = filemap_write_and_wait_range(inode->i_mapping, pos,
+						      pos + length - 1);
+		if (status)
+			return status;
+		/* XXX: hacked up iter retry */
+		iter->iomap.length = 0;
+		return 1;
+	}
 
 	do {
 		struct folio *folio;
-		int status;
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
@@ -916,6 +927,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		trace_printk("%d: zero ino 0x%lx offset 0x%lx bytes 0x%lx\n",
+			__LINE__, folio->mapping->host->i_ino, offset, bytes);
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index a1c7592d2ade..ebfcc07ace94 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -5,6 +5,7 @@
  */
 #include <linux/fs.h>
 #include <linux/iomap.h>
+#include <linux/pagemap.h>
 #include "trace.h"
 
 static inline int iomap_iter_advance(struct iomap_iter *iter)
@@ -28,15 +29,18 @@ static inline int iomap_iter_advance(struct iomap_iter *iter)
 	return 1;
 }
 
-static inline void iomap_iter_done(struct iomap_iter *iter)
+static inline void iomap_iter_done(struct iomap_iter *iter, u16 iomap_flags)
 {
 	WARN_ON_ONCE(iter->iomap.offset > iter->pos);
 	WARN_ON_ONCE(iter->iomap.length == 0);
 	WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
 
+	iter->iomap.flags |= iomap_flags;
 	trace_iomap_iter_dstmap(iter->inode, &iter->iomap);
-	if (iter->srcmap.type != IOMAP_HOLE)
+	if (iter->srcmap.type != IOMAP_HOLE) {
+		iter->srcmap.flags |= iomap_flags;
 		trace_iomap_iter_srcmap(iter->inode, &iter->srcmap);
+	}
 }
 
 /**
@@ -57,6 +61,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter)
 int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 {
 	int ret;
+	u16 iomap_flags = 0;
 
 	if (iter->iomap.length && ops->iomap_end) {
 		ret = ops->iomap_end(iter->inode, iter->pos, iomap_length(iter),
@@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 	if (ret <= 0)
 		return ret;
 
+	if ((iter->flags & IOMAP_ZERO) &&
+	    filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos,
+					  iter->pos + iter->len - 1)) {
+		iomap_flags |= IOMAP_F_DIRTY_CACHE;
+	}
+
 	ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
 			       &iter->iomap, &iter->srcmap);
 	if (ret < 0)
 		return ret;
-	iomap_iter_done(iter);
+	iomap_iter_done(iter, iomap_flags);
 	return 1;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2e10e1c66ad6..3c40a81d6da0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -840,16 +840,6 @@ xfs_setattr_size(
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
 	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17..bbfbda7bc905 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -56,6 +56,7 @@ struct vm_fault;
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
 #define IOMAP_F_ZONE_APPEND	0x20
+#define IOMAP_F_DIRTY_CACHE	0x40
 
 /*
  * Flags set by the core iomap code during operations:

Dave Chinner Nov. 3, 2022, 10:25 p.m. UTC | #9

On Thu, Nov 03, 2022 at 10:53:16AM -0400, Brian Foster wrote:
> On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote:
> > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > > 
> > > > Here's a quick prototype of "option 3" described in my previous mail.
> > > > This has been spot tested and confirmed to prevent the original stale
> > > > data exposure problem. More thorough regression testing is still
> > > > required. Barring unforeseen issues with that, however, I think this is
> > > > tentatively my new preferred option. The primary reason for that is it
> > > > avoids looking at extent state and is more in line with what iomap based
> > > > zeroing should be doing more generically.
> > > > 
> > > > Because of that, I think this provides a bit more opportunity for follow
> > > > on fixes (there are other truncate/zeroing problems I've come across
> > > > during this investigation that still need fixing), cleanup and
> > > > consolidation of the zeroing code. For example, I think the trajectory
> > > > of this could look something like:
> > > > 
> > > > - Genericize a bit more to handle all truncates.
> > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > > >   unique implementation from zero range that does explicit zeroing
> > > >   instead of relying on pagecache truncate.
> > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > > >   of iomap_zero_range() and the new iomap_truncate_page().
> > > > 
> > > 
> > > After playing with this and thinking a bit more about the above, I think
> > > I managed to come up with an iomap_truncate_page() prototype that DTRT
> > > based on this. Only spot tested so far, needs to pass iomap_flags to the
> > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> > > other bugs/warts, etc. etc. This is just a quick prototype to
> > > demonstrate the idea, which is essentially to check dirty state along
> > > with extent state while under lock and transfer that state back to iomap
> > > so it can decide whether it can shortcut or forcibly perform the zero.
> > > 
> > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> > > under lock and implies that the range is sub-block (single page).
> > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> > > dirty, so perform the zero via buffered write regardless of extent
> > > state.
> > 
> > I'd much prefer we fix this in the iomap infrastructure - failing to
> > zero dirty data in memory over an unwritten extent isn't an XFS bug,
> > so we shouldn't be working around it in XFS like we did previously.
> > 
> 
> I agree, but that was the original goal from the start. It's easier said
> than done to just have iomap accurately write/skip the appropriate
> ranges..
> 
> > I don't think this should be call "IOMAP_TRUNC_PAGE", though,
> > because that indicates the caller context, not what we are asking
> > the internal iomap code to do. What we are really asking is for
> > iomap_zero_iter() to do is zero the page cache if it exists in
> > memory, otherwise ignore unwritten/hole pages.  Hence I think a name
> > like IOMAP_ZERO_PAGECACHE is more appropriate,
> > 
> 
> That was kind of the point for this prototype. The flag isn't just
> asking iomap to perform some generic write behavior. It also indicates
> this is a special snowflake mode with assumptions from the caller (i.e.,
> unflushed, sub-block/partial range) that facilitate forced zeroing
> internally.
> 
> I've since come to the conclusion that this approach is just premature.
> It really only does the right thing in this very particular case,
> otherwise there is potential for odd/unexpected behavior in sub-page
> blocksize scenarios. It could be made to work more appropriately
> eventually, but more thought and work is required and there's a jump in
> complexity that isn't required to fix the immediate performance problem
> and additional stale data exposure problems.
> 
> > > 
> > > Brian
> > > 
> > > --- 8< ---
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 91ee0b308e13..14a9734b2838 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > >  	loff_t written = 0;
> > >  
> > >  	/* already zeroed?  we're done. */
> > > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> > > +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
> > >  		return length;
> > 
> > Why even involve the filesystem in this? We can do this directly
> > in iomap_zero_iter() with:
> > 
> > 	if ((srcmap->type == IOMAP_HOLE)
> > 		return;
> > 	if (srcmap->type == IOMAP_UNWRITTEN) {
> > 		if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
> > 			return;
> > 		if (!filemap_range_needs_writeback(inode->i_mapping,
> > 			    iomap->offset, iomap->offset + iomap->length))
> > 			return;
> > 	}
> > 
> 
> This reintroduces the same stale data exposure race fixed by the
> original patch. folio writeback can complete and convert an extent
> reported as unwritten such that zeroing will not occur when it should.

That sounds the exact data corruption bug that is being fixed in
this xfs/iomap series here:

https://lore.kernel.org/linux-xfs/20221101003412.3842572-1-david@fromorbit.com/

> The writeback check either needs to be in the fs code where it can
> prevent writeback completion from converting extents (under ilock), or
> earlier in iomap such that we're guaranteed to see either writeback
> state or the converted extent.

Or we catch stale iomaps once we've locked the page we are going to
modify and remap the extent before continuing the modification
operation.

> Given the above, I'm currently putting the prototype shown below through
> some regression testing. It lifts the writeback check into iomap and
> issues a flush and retry of the current iteration from iomap_zero_iter()
> when necessary.

Hmmm. It looks exactly like you are trying to reinvent the patchset
I've been working on for the past few weeks...

[...]

> I have some patches around that demonstrate such things, but given the
> current behavior of an unconditional flush and so far only seeing one
> user report from an oddball pattern, I don't see much need for
> additional hacks at the moment.
> 
> Thoughts? IMO, the iter retry and writeback check are both wonky and
> this is something that should first go into ->iomap_begin(), and only
> lift into iomap after a period of time. That's just my .02. I do plan to
> clean up with a retry helper and add comments and whatnot if this
> survives testing and/or any functional issues are worked out.

I think I've already solved all these problems with the above
IOMAP_F_STALE infrastructure in the above patchset. If that's the
case, then it seems like a no-brainer to me to base the fix for this
problem on the IOMAP_F_STALE capability we will soon have in this
code...

> index 91ee0b308e13..649d94ad3808 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +	struct inode *inode = iter->inode;
>  	loff_t pos = iter->pos;
>  	loff_t length = iomap_length(iter);
>  	loff_t written = 0;
> +	int status;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> -		return length;
> +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> +		if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE))
> +			return length;
> +
> +		status = filemap_write_and_wait_range(inode->i_mapping, pos,
> +						      pos + length - 1);
> +		if (status)
> +			return status;
> +		/* XXX: hacked up iter retry */
> +		iter->iomap.length = 0;
> +		return 1;

Yup, the IOMAP_F_STALE detection in iomap_write_begin() will detect
writeback races w/ unwritten extents that change the extent state
without needing to actually do physical writeback here.

IOWs, I think the fix here is not to skip IOMAP_UNWRITTEN extents up
front, but to allow zeroing to detect data in the page cache over
unwritten extents naturally.

i.e. if the iomap type is unwritten and we are zeroing, then
iomap_write_begin() drops the FGP_CREAT flag. This means the folio
lookup will only returns a folio if there was already a cached folio
in the page cache. If there is a cached folio, it will do iomap
validation and detect writeback races, marking the iomap stale. If
there is no folio or the folio returned is not dirty, then
iomap_zero_iter just moves on to checking the next folio in the
range...

> @@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
>  	if (ret <= 0)
>  		return ret;
>  
> +	if ((iter->flags & IOMAP_ZERO) &&
> +	    filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos,
> +					  iter->pos + iter->len - 1)) {
> +		iomap_flags |= IOMAP_F_DIRTY_CACHE;
> +	}

I really don't think we need this - I think we can do the page cache
state check and the iomap stale check directly in the
iomap_zero_iter() loop as per above...

Cheers,

Dave.

Brian Foster Nov. 4, 2022, 6:22 p.m. UTC | #10

On Fri, Nov 04, 2022 at 09:25:41AM +1100, Dave Chinner wrote:
> On Thu, Nov 03, 2022 at 10:53:16AM -0400, Brian Foster wrote:
> > On Sat, Oct 29, 2022 at 08:30:14AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> > > > On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > ---
> > > > > 
> > > > > Here's a quick prototype of "option 3" described in my previous mail.
> > > > > This has been spot tested and confirmed to prevent the original stale
> > > > > data exposure problem. More thorough regression testing is still
> > > > > required. Barring unforeseen issues with that, however, I think this is
> > > > > tentatively my new preferred option. The primary reason for that is it
> > > > > avoids looking at extent state and is more in line with what iomap based
> > > > > zeroing should be doing more generically.
> > > > > 
> > > > > Because of that, I think this provides a bit more opportunity for follow
> > > > > on fixes (there are other truncate/zeroing problems I've come across
> > > > > during this investigation that still need fixing), cleanup and
> > > > > consolidation of the zeroing code. For example, I think the trajectory
> > > > > of this could look something like:
> > > > > 
> > > > > - Genericize a bit more to handle all truncates.
> > > > > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > > > >   unique implementation from zero range that does explicit zeroing
> > > > >   instead of relying on pagecache truncate.
> > > > > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > > > >   of iomap_zero_range() and the new iomap_truncate_page().
> > > > > 
> > > > 
> > > > After playing with this and thinking a bit more about the above, I think
> > > > I managed to come up with an iomap_truncate_page() prototype that DTRT
> > > > based on this. Only spot tested so far, needs to pass iomap_flags to the
> > > > other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> > > > other bugs/warts, etc. etc. This is just a quick prototype to
> > > > demonstrate the idea, which is essentially to check dirty state along
> > > > with extent state while under lock and transfer that state back to iomap
> > > > so it can decide whether it can shortcut or forcibly perform the zero.
> > > > 
> > > > In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> > > > under lock and implies that the range is sub-block (single page).
> > > > IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> > > > dirty, so perform the zero via buffered write regardless of extent
> > > > state.
> > > 
> > > I'd much prefer we fix this in the iomap infrastructure - failing to
> > > zero dirty data in memory over an unwritten extent isn't an XFS bug,
> > > so we shouldn't be working around it in XFS like we did previously.
> > > 
> > 
> > I agree, but that was the original goal from the start. It's easier said
> > than done to just have iomap accurately write/skip the appropriate
> > ranges..
> > 
> > > I don't think this should be call "IOMAP_TRUNC_PAGE", though,
> > > because that indicates the caller context, not what we are asking
> > > the internal iomap code to do. What we are really asking is for
> > > iomap_zero_iter() to do is zero the page cache if it exists in
> > > memory, otherwise ignore unwritten/hole pages.  Hence I think a name
> > > like IOMAP_ZERO_PAGECACHE is more appropriate,
> > > 
> > 
> > That was kind of the point for this prototype. The flag isn't just
> > asking iomap to perform some generic write behavior. It also indicates
> > this is a special snowflake mode with assumptions from the caller (i.e.,
> > unflushed, sub-block/partial range) that facilitate forced zeroing
> > internally.
> > 
> > I've since come to the conclusion that this approach is just premature.
> > It really only does the right thing in this very particular case,
> > otherwise there is potential for odd/unexpected behavior in sub-page
> > blocksize scenarios. It could be made to work more appropriately
> > eventually, but more thought and work is required and there's a jump in
> > complexity that isn't required to fix the immediate performance problem
> > and additional stale data exposure problems.
> > 
> > > > 
> > > > Brian
> > > > 
> > > > --- 8< ---
> > > > 
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 91ee0b308e13..14a9734b2838 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > > >  	loff_t written = 0;
> > > >  
> > > >  	/* already zeroed?  we're done. */
> > > > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > > > +	if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> > > > +	    !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
> > > >  		return length;
> > > 
> > > Why even involve the filesystem in this? We can do this directly
> > > in iomap_zero_iter() with:
> > > 
> > > 	if ((srcmap->type == IOMAP_HOLE)
> > > 		return;
> > > 	if (srcmap->type == IOMAP_UNWRITTEN) {
> > > 		if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
> > > 			return;
> > > 		if (!filemap_range_needs_writeback(inode->i_mapping,
> > > 			    iomap->offset, iomap->offset + iomap->length))
> > > 			return;
> > > 	}
> > > 
> > 
> > This reintroduces the same stale data exposure race fixed by the
> > original patch. folio writeback can complete and convert an extent
> > reported as unwritten such that zeroing will not occur when it should.
> 
> That sounds the exact data corruption bug that is being fixed in
> this xfs/iomap series here:
> 
> https://lore.kernel.org/linux-xfs/20221101003412.3842572-1-david@fromorbit.com/
> 
> > The writeback check either needs to be in the fs code where it can
> > prevent writeback completion from converting extents (under ilock), or
> > earlier in iomap such that we're guaranteed to see either writeback
> > state or the converted extent.
> 
> Or we catch stale iomaps once we've locked the page we are going to
> modify and remap the extent before continuing the modification
> operation.
> 

I wasn't planning to modify pages for reasons mentioned earlier. It's a
non-trivial jump in complexity for the problem I'm trying to address.

> > Given the above, I'm currently putting the prototype shown below through
> > some regression testing. It lifts the writeback check into iomap and
> > issues a flush and retry of the current iteration from iomap_zero_iter()
> > when necessary.
> 
> Hmmm. It looks exactly like you are trying to reinvent the patchset
> I've been working on for the past few weeks...
> 

Possibly.. I've not looked at it. First and foremost what I'm really
trying to do here is find a better way to deal with the truncate flush
problem that moves things closer to the long term goal of being handled
by iomap.

> [...]
> 
> > I have some patches around that demonstrate such things, but given the
> > current behavior of an unconditional flush and so far only seeing one
> > user report from an oddball pattern, I don't see much need for
> > additional hacks at the moment.
> > 
> > Thoughts? IMO, the iter retry and writeback check are both wonky and
> > this is something that should first go into ->iomap_begin(), and only
> > lift into iomap after a period of time. That's just my .02. I do plan to
> > clean up with a retry helper and add comments and whatnot if this
> > survives testing and/or any functional issues are worked out.
> 
> I think I've already solved all these problems with the above
> IOMAP_F_STALE infrastructure in the above patchset. If that's the
> case, then it seems like a no-brainer to me to base the fix for this
> problem on the IOMAP_F_STALE capability we will soon have in this
> code...
> 

I'd have to take a closer look. If the IOMAP_F_STALE thing is
essentially a superset of what I was trying to do wrt to the retry thing
(which is what it sounds like), then I agree it makes sense to reuse
that.

> > index 91ee0b308e13..649d94ad3808 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -894,17 +894,28 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
> >  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  {
> >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > +	struct inode *inode = iter->inode;
> >  	loff_t pos = iter->pos;
> >  	loff_t length = iomap_length(iter);
> >  	loff_t written = 0;
> > +	int status;
> >  
> >  	/* already zeroed?  we're done. */
> > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > -		return length;
> > +	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
> > +		if (!(srcmap->flags & IOMAP_F_DIRTY_CACHE))
> > +			return length;
> > +
> > +		status = filemap_write_and_wait_range(inode->i_mapping, pos,
> > +						      pos + length - 1);
> > +		if (status)
> > +			return status;
> > +		/* XXX: hacked up iter retry */
> > +		iter->iomap.length = 0;
> > +		return 1;
> 
> Yup, the IOMAP_F_STALE detection in iomap_write_begin() will detect
> writeback races w/ unwritten extents that change the extent state
> without needing to actually do physical writeback here.
> 
> IOWs, I think the fix here is not to skip IOMAP_UNWRITTEN extents up
> front, but to allow zeroing to detect data in the page cache over
> unwritten extents naturally.
> 
> i.e. if the iomap type is unwritten and we are zeroing, then
> iomap_write_begin() drops the FGP_CREAT flag. This means the folio
> lookup will only returns a folio if there was already a cached folio
> in the page cache. If there is a cached folio, it will do iomap
> validation and detect writeback races, marking the iomap stale. If
> there is no folio or the folio returned is not dirty, then
> iomap_zero_iter just moves on to checking the next folio in the
> range...
> 

The problem I have is that I've got three or four additional different
prototypes trying to do such things that haven't even been worth posting
because they all produce tricky to diagnose issues, including one that
literally does exactly what you note above regarding dropping FGP_CREAT
for mappings trimmed down based on sub-extent dirty state. It probably
has its own bugs, but I suspect aspects of iomap zeroing as well as
extent reporting in XFS (related to zeroing) need to change to make this
all function correctly, but I don't have it characterized enough to say
for sure.

> > @@ -71,10 +76,16 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
> >  	if (ret <= 0)
> >  		return ret;
> >  
> > +	if ((iter->flags & IOMAP_ZERO) &&
> > +	    filemap_range_needs_writeback(iter->inode->i_mapping, iter->pos,
> > +					  iter->pos + iter->len - 1)) {
> > +		iomap_flags |= IOMAP_F_DIRTY_CACHE;
> > +	}
> 
> I really don't think we need this - I think we can do the page cache
> state check and the iomap stale check directly in the
> iomap_zero_iter() loop as per above...
> 

Ok, this may be a reasonable end goal, but there's a clear jump in
complexity, dependency requirements, and test scope between the targeted
fix I proposed and ultimate iomap zeroing perfection. I'm not trying to
achieve the latter. I'm trying to address a performance escalation that
moves things more in that direction to the degree possible.

I agree that if IOMAP_F_STALE overlaps with the approach in this patch,
it makes no sense to try and duplicate that here or create code
conflicts or dependencies. I have another version of this patch that
implements the same dirty check, flush, retry logic within
xfs_iomap_buffered_write_iomap_begin() such that it never returns a
"dirty," hole/unwritten mapping. It's a bit more code, but effective
from a functional and performance standpoint and logically simple enough
to backport as far back as needed with minimal risk and dependencies.
IMO, that's a fairly reasonable intermediate step to get from the
currently broken state to a more ideal iomap-based implementation. I'll
post it shortly and run some more testing against it.

Brian

xfs: redirty eof folio on truncate to avoid filemap flush

Commit Message

Comments

Patch