diff mbox series

[1/2] iomap: write iomap validity checks

Message ID 20220921082959.1411675-2-david@fromorbit.com (mailing list archive)
State New, archived
Headers show
Series iomap/xfs: fix data corruption due to stale cached iomaps | expand

Commit Message

Dave Chinner Sept. 21, 2022, 8:29 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

A recent multithreaded write data corruption has been uncovered in
the iomap write code. The core of the problem is partial folio
writes can be flushed to disk while a new racing write can map it
and fill the rest of the page:

writeback			new write

allocate blocks
  blocks are unwritten
submit IO
.....
				map blocks
				iomap indicates UNWRITTEN range
				loop {
				  lock folio
				  copyin data
.....
IO completes
  runs unwritten extent conv
    blocks are marked written
				  <iomap now stale>
				  get next folio
				}

Now add memory pressure such that memory reclaim evicts the
partially written folio that has already been written to disk.

When the new write finally gets to the last partial page of the new
write, it does not find it in cache, so it instantiates a new page,
sees the iomap is unwritten, and zeros the part of the page that
it does not have data from. This overwrites the data on disk that
was originally written.

The full description of the corruption mechanism can be found here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/

To solve this problem, we need to check whether the iomap is still
valid after we lock each folio during the write. We have to do it
after we lock the page so that we don't end up with state changes
occurring while we wait for the folio to be locked.

Hence we need a mechanism to be able to check that the cached iomap
is still valid (similar to what we already do in buffered
writeback), and we need a way for ->begin_write to back out and
tell the high level iomap iterator that we need to remap the
remaining write range.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++---------
 fs/iomap/iter.c        | 33 ++++++++++++++++++++++++--
 include/linux/iomap.h  | 17 ++++++++++++++
 3 files changed, 90 insertions(+), 13 deletions(-)

Comments

Darrick J. Wong Sept. 22, 2022, 3:40 a.m. UTC | #1
On Wed, Sep 21, 2022 at 06:29:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A recent multithreaded write data corruption has been uncovered in
> the iomap write code. The core of the problem is partial folio
> writes can be flushed to disk while a new racing write can map it
> and fill the rest of the page:
> 
> writeback			new write
> 
> allocate blocks
>   blocks are unwritten
> submit IO
> .....
> 				map blocks
> 				iomap indicates UNWRITTEN range
> 				loop {
> 				  lock folio
> 				  copyin data
> .....
> IO completes
>   runs unwritten extent conv
>     blocks are marked written
> 				  <iomap now stale>
> 				  get next folio
> 				}
> 
> Now add memory pressure such that memory reclaim evicts the
> partially written folio that has already been written to disk.
> 
> When the new write finally gets to the last partial page of the new
> write, it does not find it in cache, so it instantiates a new page,
> sees the iomap is unwritten, and zeros the part of the page that
> it does not have data from. This overwrites the data on disk that
> was originally written.
> 
> The full description of the corruption mechanism can be found here:
> 
> https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> 
> To solve this problem, we need to check whether the iomap is still
> valid after we lock each folio during the write. We have to do it
> after we lock the page so that we don't end up with state changes
> occurring while we wait for the folio to be locked.
> 
> Hence we need a mechanism to be able to check that the cached iomap
> is still valid (similar to what we already do in buffered
> writeback), and we need a way for ->begin_write to back out and
> tell the high level iomap iterator that we need to remap the
> remaining write range.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++---------
>  fs/iomap/iter.c        | 33 ++++++++++++++++++++++++--
>  include/linux/iomap.h  | 17 ++++++++++++++
>  3 files changed, 90 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ca5c62901541..44c806d46be4 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -584,8 +584,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
>  	return iomap_read_inline_data(iter, folio);
>  }
>  
> -static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> -		size_t len, struct folio **foliop)
> +static int iomap_write_begin(struct iomap_iter *iter,
> +		const struct iomap_ops *ops, loff_t pos, size_t len,
> +		struct folio **foliop)
>  {
>  	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> @@ -618,6 +619,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
>  		goto out_no_page;
>  	}
> +
> +	/*
> +	 * Now we have a locked folio, before we do anything with it we need to
> +	 * check that the iomap we have cached is not stale. The inode extent
> +	 * mapping can change due to concurrent IO in flight (e.g.
> +	 * IOMAP_UNWRITTEN state can change and memory reclaim could have

Earlier, we were talking this specific problem involves zeroing folio
areas around a folio-unaligned write to a !uptodate folio backed by an
unwritten extent.  Is the scope of this corruption still limited to
*just* that case?  Or does it affect any write to a new mapping or
beyond EOF?  Can it affect a write to an extent that somehow gets
switched to unwritten or a hole after the mapping has been sampled?

> +	 * reclaimed a previously partially written page at this index after IO
> +	 * completion before this write reaches this file offset) and hence we
> +	 * could do the wrong thing here (zero a page range incorrectly or fail
> +	 * to zero) and corrupt data.
> +	 */
> +	if (ops->iomap_valid) {
> +		bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);

At this point in the (xfs) buffered write cycle, we've taken the IOLOCK
and folio lock.  Is it possible for ->iomap_valid to detect that the
mapping is no longer valid, trylock the ILOCK, and update the mapping so
that we don't have to drop the folio lock and go all the way back to
->iomap_begin?

If we have to pay the cost of an indirect call, we could try to update
the mapping if one's available, right?  Obviously, if we don't find a
mapping we're probably going to have to drop the folio lock and allocate
more space.

> +
> +		if (!iomap_valid) {
> +			iter->iomap.flags |= IOMAP_F_STALE;
> +			status = 0;
> +			goto out_unlock;
> +		}
> +	}
> +
>  	if (pos + len > folio_pos(folio) + folio_size(folio))
>  		len = folio_pos(folio) + folio_size(folio) - pos;
>  
> @@ -727,7 +749,8 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  	return ret;
>  }
>  
> -static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
> +static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
> +		const struct iomap_ops *ops)
>  {
>  	loff_t length = iomap_length(iter);
>  	loff_t pos = iter->pos;
> @@ -770,9 +793,11 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			break;
>  		}
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> +		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
>  		if (unlikely(status))
>  			break;
> +		if (iter->iomap.flags & IOMAP_F_STALE)
> +			break;
>  
>  		page = folio_file_page(folio, pos >> PAGE_SHIFT);
>  		if (mapping_writably_mapped(mapping))
> @@ -825,14 +850,15 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  		iter.flags |= IOMAP_NOWAIT;
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.processed = iomap_write_iter(&iter, i);
> +		iter.processed = iomap_write_iter(&iter, i, ops);
>  	if (iter.pos == iocb->ki_pos)
>  		return ret;
>  	return iter.pos - iocb->ki_pos;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>  
> -static loff_t iomap_unshare_iter(struct iomap_iter *iter)
> +static loff_t iomap_unshare_iter(struct iomap_iter *iter,
> +		const struct iomap_ops *ops)
>  {
>  	struct iomap *iomap = &iter->iomap;
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> @@ -853,9 +879,11 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
>  		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
>  		struct folio *folio;
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> +		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
>  		if (unlikely(status))
>  			return status;
> +		if (iter->iomap.flags & IOMAP_F_STALE)
> +			break;
>  
>  		status = iomap_write_end(iter, pos, bytes, bytes, folio);
>  		if (WARN_ON_ONCE(status == 0))
> @@ -886,12 +914,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  	int ret;
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.processed = iomap_unshare_iter(&iter);
> +		iter.processed = iomap_unshare_iter(&iter, ops);
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> +static loff_t iomap_zero_iter(struct iomap_iter *iter,
> +		const struct iomap_ops *ops, bool *did_zero)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
>  	loff_t pos = iter->pos;
> @@ -908,9 +937,11 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		size_t offset;
>  		size_t bytes = min_t(u64, SIZE_MAX, length);
>  
> -		status = iomap_write_begin(iter, pos, bytes, &folio);
> +		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
>  		if (status)
>  			return status;
> +		if (iter->iomap.flags & IOMAP_F_STALE)
> +			break;
>  
>  		offset = offset_in_folio(folio, pos);
>  		if (bytes > folio_size(folio) - offset)
> @@ -946,7 +977,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  	int ret;
>  
>  	while ((ret = iomap_iter(&iter, ops)) > 0)
> -		iter.processed = iomap_zero_iter(&iter, did_zero);
> +		iter.processed = iomap_zero_iter(&iter, ops, did_zero);
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
> diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
> index a1c7592d2ade..370e8a23c1b8 100644
> --- a/fs/iomap/iter.c
> +++ b/fs/iomap/iter.c
> @@ -7,12 +7,36 @@
>  #include <linux/iomap.h>
>  #include "trace.h"
>  
> +/*
> + * Advance to the next range we need to map.
> + *
> + * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
> + * processed - it was aborted because the extent the iomap spanned may have been
> + * changed during the operation. In this case, the iteration behaviour is to
> + * remap the unprocessed range of the iter, and that means we may need to remap
> + * even when we've made no progress (i.e. iter->processed = 0). Hence the
> + * "finished iterating" case needs to distinguish between
> + * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
> + * need to remap the entire remaining range.
> + *
> + * We also need to preserve IOMAP_F_STALE on the iomap so that the next call
> + * to iomap_begin() knows that it is reprocessing a stale map. Similarly, we
> + * need to preserve IOMAP_F_NEW as the filesystem may not realise that it
> + * is remapping a region that it allocated in the previous cycle and we still
> + * need to initialise partially filled pages within the remapped range.

Must the ->iomap_begin implementation also take care to carry-over the
IOMAP_F_NEW to the mapping if IOMAP_F_STALE is set?  Is it responsible
for dropping IOMAP_F_STALE, or will iomap_iter* do that?

> + *
> + */
>  static inline int iomap_iter_advance(struct iomap_iter *iter)
>  {
> +	bool stale = iter->iomap.flags & IOMAP_F_STALE;
> +	int new = iter->iomap.flags & IOMAP_F_NEW;
> +
>  	/* handle the previous iteration (if any) */
>  	if (iter->iomap.length) {
> -		if (iter->processed <= 0)
> +		if (iter->processed < 0)
>  			return iter->processed;
> +		if (!iter->processed && !stale)
> +			return 0;
>  		if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
>  			return -EIO;
>  		iter->pos += iter->processed;
> @@ -25,6 +49,8 @@ static inline int iomap_iter_advance(struct iomap_iter *iter)
>  	iter->processed = 0;
>  	memset(&iter->iomap, 0, sizeof(iter->iomap));
>  	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
> +	if (stale)
> +		iter->iomap.flags |= IOMAP_F_STALE | new;
>  	return 1;
>  }
>  
> @@ -33,6 +59,7 @@ static inline void iomap_iter_done(struct iomap_iter *iter)
>  	WARN_ON_ONCE(iter->iomap.offset > iter->pos);
>  	WARN_ON_ONCE(iter->iomap.length == 0);
>  	WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
> +	WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE);
>  
>  	trace_iomap_iter_dstmap(iter->inode, &iter->iomap);
>  	if (iter->srcmap.type != IOMAP_HOLE)
> @@ -68,8 +95,10 @@ int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
>  
>  	trace_iomap_iter(iter, ops, _RET_IP_);
>  	ret = iomap_iter_advance(iter);
> -	if (ret <= 0)
> +	if (ret < 0)
>  		return ret;
> +	if (!ret && !(iter->iomap.flags & IOMAP_F_STALE))
> +		return 0;
>  
>  	ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
>  			       &iter->iomap, &iter->srcmap);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 238a03087e17..308931f0840a 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -62,8 +62,13 @@ struct vm_fault;
>   *
>   * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
>   * has changed as the result of this write operation.
> + *
> + * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
> + * range it covers needs to be remapped by the high level before the operation
> + * can proceed.

I think it might be worth mentioning that /any/ iomap_iter caller can
set this flag in the loop body and kick things back to ->iomap_begin.

>   */
>  #define IOMAP_F_SIZE_CHANGED	0x100
> +#define IOMAP_F_STALE		0x200
>  
>  /*
>   * Flags from 0x1000 up are for file system specific usage:
> @@ -165,6 +170,18 @@ struct iomap_ops {

Hmmmm I wish this ops struct had been named better.  Oh well.

--D

>  	 */
>  	int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
>  			ssize_t written, unsigned flags, struct iomap *iomap);
> +
> +	/*
> +	 * Check that the cached iomap still maps correctly to the filesystem's
> +	 * internal extent map. FS internal extent maps can change while iomap
> +	 * is iterating a cached iomap, so this hook allows iomap to detect that
> +	 * the iomap needs to be refreshed during a long running write
> +	 * operation.
> +	 *
> +	 * This is called with the folio over the specified file position
> +	 * held locked by the iomap code.
> +	 */
> +	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
>  };
>  
>  /**
> -- 
> 2.37.2
>
Dave Chinner Sept. 22, 2022, 11:16 p.m. UTC | #2
On Wed, Sep 21, 2022 at 08:40:29PM -0700, Darrick J. Wong wrote:
> On Wed, Sep 21, 2022 at 06:29:58PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > A recent multithreaded write data corruption has been uncovered in
> > the iomap write code. The core of the problem is partial folio
> > writes can be flushed to disk while a new racing write can map it
> > and fill the rest of the page:
> > 
> > writeback			new write
> > 
> > allocate blocks
> >   blocks are unwritten
> > submit IO
> > .....
> > 				map blocks
> > 				iomap indicates UNWRITTEN range
> > 				loop {
> > 				  lock folio
> > 				  copyin data
> > .....
> > IO completes
> >   runs unwritten extent conv
> >     blocks are marked written
> > 				  <iomap now stale>
> > 				  get next folio
> > 				}
> > 
> > Now add memory pressure such that memory reclaim evicts the
> > partially written folio that has already been written to disk.
> > 
> > When the new write finally gets to the last partial page of the new
> > write, it does not find it in cache, so it instantiates a new page,
> > sees the iomap is unwritten, and zeros the part of the page that
> > it does not have data from. This overwrites the data on disk that
> > was originally written.
> > 
> > The full description of the corruption mechanism can be found here:
> > 
> > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > 
> > To solve this problem, we need to check whether the iomap is still
> > valid after we lock each folio during the write. We have to do it
> > after we lock the page so that we don't end up with state changes
> > occurring while we wait for the folio to be locked.
> > 
> > Hence we need a mechanism to be able to check that the cached iomap
> > is still valid (similar to what we already do in buffered
> > writeback), and we need a way for ->begin_write to back out and
> > tell the high level iomap iterator that we need to remap the
> > remaining write range.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++---------
> >  fs/iomap/iter.c        | 33 ++++++++++++++++++++++++--
> >  include/linux/iomap.h  | 17 ++++++++++++++
> >  3 files changed, 90 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index ca5c62901541..44c806d46be4 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -584,8 +584,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
> >  	return iomap_read_inline_data(iter, folio);
> >  }
> >  
> > -static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > -		size_t len, struct folio **foliop)
> > +static int iomap_write_begin(struct iomap_iter *iter,
> > +		const struct iomap_ops *ops, loff_t pos, size_t len,
> > +		struct folio **foliop)
> >  {
> >  	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > @@ -618,6 +619,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >  		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> >  		goto out_no_page;
> >  	}
> > +
> > +	/*
> > +	 * Now we have a locked folio, before we do anything with it we need to
> > +	 * check that the iomap we have cached is not stale. The inode extent
> > +	 * mapping can change due to concurrent IO in flight (e.g.
> > +	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
> 
> Earlier, we were talking this specific problem involves zeroing folio
> areas around a folio-unaligned write to a !uptodate folio backed by an
> unwritten extent.  Is the scope of this corruption still limited to
> *just* that case?  Or does it affect any write to a new mapping or
> beyond EOF?  Can it affect a write to an extent that somehow gets
> switched to unwritten or a hole after the mapping has been sampled?

We only hold the IOLOCK while we are iterating the cached mapping.
Hence anything that can modify the extent tree without holding the
IOLOCK in a way that would exclude iomap iteration can trigger stale
mappings.

AFAICT, the ways this can happen right now are:

1. page fault path takes mapping->invalidate_lock, not IOLOCK
2. writeback allocating delalloc regions as unwritten
3. IO completion converting unwritten regions to written.
4. DIO writes doing extent allocation under IOLOCK_SHARED while
   other iomap iterators are holding IOLOCK_SHARED, too. (i.e. hole
   -> unwritten).

Note that this affects my recent suggestion about using
IOLOCK_SHARED for buffered writes, too - that would be case #5,
similar to #4.

Stuff that takes the IOLOCK_EXCL, such as truncate, fallocate, etc
for direct extent manipulation cannot trigger stale cached iomaps in
the IO path because they wait for all IO path operations to drain
and for dirty data to flush across the range they are operating on.

> > +	 * reclaimed a previously partially written page at this index after IO
> > +	 * completion before this write reaches this file offset) and hence we
> > +	 * could do the wrong thing here (zero a page range incorrectly or fail
> > +	 * to zero) and corrupt data.
> > +	 */
> > +	if (ops->iomap_valid) {
> > +		bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);
> 
> At this point in the (xfs) buffered write cycle, we've taken the IOLOCK
> and folio lock.  Is it possible for ->iomap_valid to detect that the
> mapping is no longer valid, trylock the ILOCK, and update the mapping so
> that we don't have to drop the folio lock and go all the way back to
> ->iomap_begin?

Possibly. But we can fail to get the ILOCK, and that means we still
have to fall back all to being outside the page lock before we can
try again.

The other issue is that the iomap iter still currently points to the
offset at the start of the write, so we aren't guaranteed that the
refetched iomap even covers the range of the page we currently have
locked. IOWs, we have to advance the iterator before we remap the
range to guarantee that the new iomap covers the range of the file
that we are currently operating on....

> If we have to pay the cost of an indirect call, we could try to update
> the mapping if one's available, right?  Obviously, if we don't find a
> mapping we're probably going to have to drop the folio lock and allocate
> more space.

I'd prefer that we get the slow, fall-back-all-the-way path correct
first. We are always going to need this slow path, so let's focus on
making it correct and robust rather than trying to prematurely
optimise it....

> > --- a/fs/iomap/iter.c
> > +++ b/fs/iomap/iter.c
> > @@ -7,12 +7,36 @@
> >  #include <linux/iomap.h>
> >  #include "trace.h"
> >  
> > +/*
> > + * Advance to the next range we need to map.
> > + *
> > + * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
> > + * processed - it was aborted because the extent the iomap spanned may have been
> > + * changed during the operation. In this case, the iteration behaviour is to
> > + * remap the unprocessed range of the iter, and that means we may need to remap
> > + * even when we've made no progress (i.e. iter->processed = 0). Hence the
> > + * "finished iterating" case needs to distinguish between
> > + * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
> > + * need to remap the entire remaining range.
> > + *
> > + * We also need to preserve IOMAP_F_STALE on the iomap so that the next call
> > + * to iomap_begin() knows that it is reprocessing a stale map. Similarly, we
> > + * need to preserve IOMAP_F_NEW as the filesystem may not realise that it
> > + * is remapping a region that it allocated in the previous cycle and we still
> > + * need to initialise partially filled pages within the remapped range.
> 
> Must the ->iomap_begin implementation also take care to carry-over the
> IOMAP_F_NEW to the mapping if IOMAP_F_STALE is set?  Is it responsible
> for dropping IOMAP_F_STALE, or will iomap_iter* do that?

iomap_iter() carries STALE and NEW over to the the next
->iomap_begin() call, which is then responsible for clearing STALE
and propagating NEW if required.


> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index 238a03087e17..308931f0840a 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -62,8 +62,13 @@ struct vm_fault;
> >   *
> >   * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> >   * has changed as the result of this write operation.
> > + *
> > + * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
> > + * range it covers needs to be remapped by the high level before the operation
> > + * can proceed.
> 
> I think it might be worth mentioning that /any/ iomap_iter caller can
> set this flag in the loop body and kick things back to ->iomap_begin.

*nod*

-Dave.
Darrick J. Wong Sept. 28, 2022, 4:48 a.m. UTC | #3
On Fri, Sep 23, 2022 at 09:16:55AM +1000, Dave Chinner wrote:
> On Wed, Sep 21, 2022 at 08:40:29PM -0700, Darrick J. Wong wrote:
> > On Wed, Sep 21, 2022 at 06:29:58PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > A recent multithreaded write data corruption has been uncovered in
> > > the iomap write code. The core of the problem is partial folio
> > > writes can be flushed to disk while a new racing write can map it
> > > and fill the rest of the page:
> > > 
> > > writeback			new write
> > > 
> > > allocate blocks
> > >   blocks are unwritten
> > > submit IO
> > > .....
> > > 				map blocks
> > > 				iomap indicates UNWRITTEN range
> > > 				loop {
> > > 				  lock folio
> > > 				  copyin data
> > > .....
> > > IO completes
> > >   runs unwritten extent conv
> > >     blocks are marked written
> > > 				  <iomap now stale>
> > > 				  get next folio
> > > 				}
> > > 
> > > Now add memory pressure such that memory reclaim evicts the
> > > partially written folio that has already been written to disk.
> > > 
> > > When the new write finally gets to the last partial page of the new
> > > write, it does not find it in cache, so it instantiates a new page,
> > > sees the iomap is unwritten, and zeros the part of the page that
> > > it does not have data from. This overwrites the data on disk that
> > > was originally written.
> > > 
> > > The full description of the corruption mechanism can be found here:
> > > 
> > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > > 
> > > To solve this problem, we need to check whether the iomap is still
> > > valid after we lock each folio during the write. We have to do it
> > > after we lock the page so that we don't end up with state changes
> > > occurring while we wait for the folio to be locked.
> > > 
> > > Hence we need a mechanism to be able to check that the cached iomap
> > > is still valid (similar to what we already do in buffered
> > > writeback), and we need a way for ->begin_write to back out and
> > > tell the high level iomap iterator that we need to remap the
> > > remaining write range.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++---------
> > >  fs/iomap/iter.c        | 33 ++++++++++++++++++++++++--
> > >  include/linux/iomap.h  | 17 ++++++++++++++
> > >  3 files changed, 90 insertions(+), 13 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index ca5c62901541..44c806d46be4 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -584,8 +584,9 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
> > >  	return iomap_read_inline_data(iter, folio);
> > >  }
> > >  
> > > -static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > > -		size_t len, struct folio **foliop)
> > > +static int iomap_write_begin(struct iomap_iter *iter,
> > > +		const struct iomap_ops *ops, loff_t pos, size_t len,
> > > +		struct folio **foliop)
> > >  {
> > >  	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
> > >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > @@ -618,6 +619,27 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > >  		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> > >  		goto out_no_page;
> > >  	}
> > > +
> > > +	/*
> > > +	 * Now we have a locked folio, before we do anything with it we need to
> > > +	 * check that the iomap we have cached is not stale. The inode extent
> > > +	 * mapping can change due to concurrent IO in flight (e.g.
> > > +	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
> > 
> > Earlier, we were talking this specific problem involves zeroing folio
> > areas around a folio-unaligned write to a !uptodate folio backed by an
> > unwritten extent.  Is the scope of this corruption still limited to
> > *just* that case?  Or does it affect any write to a new mapping or
> > beyond EOF?  Can it affect a write to an extent that somehow gets
> > switched to unwritten or a hole after the mapping has been sampled?
> 
> We only hold the IOLOCK while we are iterating the cached mapping.
> Hence anything that can modify the extent tree without holding the
> IOLOCK in a way that would exclude iomap iteration can trigger stale
> mappings.
> 
> AFAICT, the ways this can happen right now are:
> 
> 1. page fault path takes mapping->invalidate_lock, not IOLOCK
> 2. writeback allocating delalloc regions as unwritten
> 3. IO completion converting unwritten regions to written.
> 4. DIO writes doing extent allocation under IOLOCK_SHARED while
>    other iomap iterators are holding IOLOCK_SHARED, too. (i.e. hole
>    -> unwritten).

<nod> Ok, I had kinda suspected that this applied to any mapping, since
a lot can go on with IOLOCK_SHARED held.

> Note that this affects my recent suggestion about using
> IOLOCK_SHARED for buffered writes, too - that would be case #5,
> similar to #4.

<nod>

> Stuff that takes the IOLOCK_EXCL, such as truncate, fallocate, etc
> for direct extent manipulation cannot trigger stale cached iomaps in
> the IO path because they wait for all IO path operations to drain
> and for dirty data to flush across the range they are operating on.
> 
> > > +	 * reclaimed a previously partially written page at this index after IO
> > > +	 * completion before this write reaches this file offset) and hence we
> > > +	 * could do the wrong thing here (zero a page range incorrectly or fail
> > > +	 * to zero) and corrupt data.
> > > +	 */
> > > +	if (ops->iomap_valid) {
> > > +		bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);
> > 
> > At this point in the (xfs) buffered write cycle, we've taken the IOLOCK
> > and folio lock.  Is it possible for ->iomap_valid to detect that the
> > mapping is no longer valid, trylock the ILOCK, and update the mapping so
> > that we don't have to drop the folio lock and go all the way back to
> > ->iomap_begin?
> 
> Possibly. But we can fail to get the ILOCK, and that means we still
> have to fall back all to being outside the page lock before we can
> try again.

<nod>

> The other issue is that the iomap iter still currently points to the
> offset at the start of the write, so we aren't guaranteed that the
> refetched iomap even covers the range of the page we currently have
> locked. IOWs, we have to advance the iterator before we remap the
> range to guarantee that the new iomap covers the range of the file
> that we are currently operating on....

Aha!  I hadn't caught on to that subtlety.

> > If we have to pay the cost of an indirect call, we could try to update
> > the mapping if one's available, right?  Obviously, if we don't find a
> > mapping we're probably going to have to drop the folio lock and allocate
> > more space.
> 
> I'd prefer that we get the slow, fall-back-all-the-way path correct
> first. We are always going to need this slow path, so let's focus on
> making it correct and robust rather than trying to prematurely
> optimise it....

Agreed.

> > > --- a/fs/iomap/iter.c
> > > +++ b/fs/iomap/iter.c
> > > @@ -7,12 +7,36 @@
> > >  #include <linux/iomap.h>
> > >  #include "trace.h"
> > >  
> > > +/*
> > > + * Advance to the next range we need to map.
> > > + *
> > > + * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
> > > + * processed - it was aborted because the extent the iomap spanned may have been
> > > + * changed during the operation. In this case, the iteration behaviour is to
> > > + * remap the unprocessed range of the iter, and that means we may need to remap
> > > + * even when we've made no progress (i.e. iter->processed = 0). Hence the
> > > + * "finished iterating" case needs to distinguish between
> > > + * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
> > > + * need to remap the entire remaining range.
> > > + *
> > > + * We also need to preserve IOMAP_F_STALE on the iomap so that the next call
> > > + * to iomap_begin() knows that it is reprocessing a stale map. Similarly, we
> > > + * need to preserve IOMAP_F_NEW as the filesystem may not realise that it
> > > + * is remapping a region that it allocated in the previous cycle and we still
> > > + * need to initialise partially filled pages within the remapped range.
> > 
> > Must the ->iomap_begin implementation also take care to carry-over the
> > IOMAP_F_NEW to the mapping if IOMAP_F_STALE is set?  Is it responsible
> > for dropping IOMAP_F_STALE, or will iomap_iter* do that?
> 
> iomap_iter() carries STALE and NEW over to the the next
> ->iomap_begin() call, which is then responsible for clearing STALE
> and propagating NEW if required.

Ok.  I suspect from the IRC traffic that you've moved on a bit from this
approach, so I'll wait for it to appear.

--D

> 
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 238a03087e17..308931f0840a 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -62,8 +62,13 @@ struct vm_fault;
> > >   *
> > >   * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
> > >   * has changed as the result of this write operation.
> > > + *
> > > + * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
> > > + * range it covers needs to be remapped by the high level before the operation
> > > + * can proceed.
> > 
> > I think it might be worth mentioning that /any/ iomap_iter caller can
> > set this flag in the loop body and kick things back to ->iomap_begin.
> 
> *nod*
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ca5c62901541..44c806d46be4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -584,8 +584,9 @@  static int iomap_write_begin_inline(const struct iomap_iter *iter,
 	return iomap_read_inline_data(iter, folio);
 }
 
-static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
-		size_t len, struct folio **foliop)
+static int iomap_write_begin(struct iomap_iter *iter,
+		const struct iomap_ops *ops, loff_t pos, size_t len,
+		struct folio **foliop)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -618,6 +619,27 @@  static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		status = (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
 		goto out_no_page;
 	}
+
+	/*
+	 * Now we have a locked folio, before we do anything with it we need to
+	 * check that the iomap we have cached is not stale. The inode extent
+	 * mapping can change due to concurrent IO in flight (e.g.
+	 * IOMAP_UNWRITTEN state can change and memory reclaim could have
+	 * reclaimed a previously partially written page at this index after IO
+	 * completion before this write reaches this file offset) and hence we
+	 * could do the wrong thing here (zero a page range incorrectly or fail
+	 * to zero) and corrupt data.
+	 */
+	if (ops->iomap_valid) {
+		bool iomap_valid = ops->iomap_valid(iter->inode, &iter->iomap);
+
+		if (!iomap_valid) {
+			iter->iomap.flags |= IOMAP_F_STALE;
+			status = 0;
+			goto out_unlock;
+		}
+	}
+
 	if (pos + len > folio_pos(folio) + folio_size(folio))
 		len = folio_pos(folio) + folio_size(folio) - pos;
 
@@ -727,7 +749,8 @@  static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 	return ret;
 }
 
-static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
+static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i,
+		const struct iomap_ops *ops)
 {
 	loff_t length = iomap_length(iter);
 	loff_t pos = iter->pos;
@@ -770,9 +793,11 @@  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			break;
 		}
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
 		if (unlikely(status))
 			break;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		page = folio_file_page(folio, pos >> PAGE_SHIFT);
 		if (mapping_writably_mapped(mapping))
@@ -825,14 +850,15 @@  iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 		iter.flags |= IOMAP_NOWAIT;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_write_iter(&iter, i);
+		iter.processed = iomap_write_iter(&iter, i, ops);
 	if (iter.pos == iocb->ki_pos)
 		return ret;
 	return iter.pos - iocb->ki_pos;
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
-static loff_t iomap_unshare_iter(struct iomap_iter *iter)
+static loff_t iomap_unshare_iter(struct iomap_iter *iter,
+		const struct iomap_ops *ops)
 {
 	struct iomap *iomap = &iter->iomap;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -853,9 +879,11 @@  static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
 		struct folio *folio;
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
 		if (unlikely(status))
 			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		status = iomap_write_end(iter, pos, bytes, bytes, folio);
 		if (WARN_ON_ONCE(status == 0))
@@ -886,12 +914,13 @@  iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_unshare_iter(&iter);
+		iter.processed = iomap_unshare_iter(&iter, ops);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
+static loff_t iomap_zero_iter(struct iomap_iter *iter,
+		const struct iomap_ops *ops, bool *did_zero)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
@@ -908,9 +937,11 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		size_t offset;
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
-		status = iomap_write_begin(iter, pos, bytes, &folio);
+		status = iomap_write_begin(iter, ops, pos, bytes, &folio);
 		if (status)
 			return status;
+		if (iter->iomap.flags & IOMAP_F_STALE)
+			break;
 
 		offset = offset_in_folio(folio, pos);
 		if (bytes > folio_size(folio) - offset)
@@ -946,7 +977,7 @@  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero);
+		iter.processed = iomap_zero_iter(&iter, ops, did_zero);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
diff --git a/fs/iomap/iter.c b/fs/iomap/iter.c
index a1c7592d2ade..370e8a23c1b8 100644
--- a/fs/iomap/iter.c
+++ b/fs/iomap/iter.c
@@ -7,12 +7,36 @@ 
 #include <linux/iomap.h>
 #include "trace.h"
 
+/*
+ * Advance to the next range we need to map.
+ *
+ * If the iomap is marked IOMAP_F_STALE, it means the existing map was not fully
+ * processed - it was aborted because the extent the iomap spanned may have been
+ * changed during the operation. In this case, the iteration behaviour is to
+ * remap the unprocessed range of the iter, and that means we may need to remap
+ * even when we've made no progress (i.e. iter->processed = 0). Hence the
+ * "finished iterating" case needs to distinguish between
+ * (processed = 0) meaning we are done and (processed = 0 && stale) meaning we
+ * need to remap the entire remaining range.
+ *
+ * We also need to preserve IOMAP_F_STALE on the iomap so that the next call
+ * to iomap_begin() knows that it is reprocessing a stale map. Similarly, we
+ * need to preserve IOMAP_F_NEW as the filesystem may not realise that it
+ * is remapping a region that it allocated in the previous cycle and we still
+ * need to initialise partially filled pages within the remapped range.
+ *
+ */
 static inline int iomap_iter_advance(struct iomap_iter *iter)
 {
+	bool stale = iter->iomap.flags & IOMAP_F_STALE;
+	int new = iter->iomap.flags & IOMAP_F_NEW;
+
 	/* handle the previous iteration (if any) */
 	if (iter->iomap.length) {
-		if (iter->processed <= 0)
+		if (iter->processed < 0)
 			return iter->processed;
+		if (!iter->processed && !stale)
+			return 0;
 		if (WARN_ON_ONCE(iter->processed > iomap_length(iter)))
 			return -EIO;
 		iter->pos += iter->processed;
@@ -25,6 +49,8 @@  static inline int iomap_iter_advance(struct iomap_iter *iter)
 	iter->processed = 0;
 	memset(&iter->iomap, 0, sizeof(iter->iomap));
 	memset(&iter->srcmap, 0, sizeof(iter->srcmap));
+	if (stale)
+		iter->iomap.flags |= IOMAP_F_STALE | new;
 	return 1;
 }
 
@@ -33,6 +59,7 @@  static inline void iomap_iter_done(struct iomap_iter *iter)
 	WARN_ON_ONCE(iter->iomap.offset > iter->pos);
 	WARN_ON_ONCE(iter->iomap.length == 0);
 	WARN_ON_ONCE(iter->iomap.offset + iter->iomap.length <= iter->pos);
+	WARN_ON_ONCE(iter->iomap.flags & IOMAP_F_STALE);
 
 	trace_iomap_iter_dstmap(iter->inode, &iter->iomap);
 	if (iter->srcmap.type != IOMAP_HOLE)
@@ -68,8 +95,10 @@  int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
 
 	trace_iomap_iter(iter, ops, _RET_IP_);
 	ret = iomap_iter_advance(iter);
-	if (ret <= 0)
+	if (ret < 0)
 		return ret;
+	if (!ret && !(iter->iomap.flags & IOMAP_F_STALE))
+		return 0;
 
 	ret = ops->iomap_begin(iter->inode, iter->pos, iter->len, iter->flags,
 			       &iter->iomap, &iter->srcmap);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 238a03087e17..308931f0840a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -62,8 +62,13 @@  struct vm_fault;
  *
  * IOMAP_F_SIZE_CHANGED indicates to the iomap_end method that the file size
  * has changed as the result of this write operation.
+ *
+ * IOMAP_F_STALE indicates that the iomap is not valid any longer and the file
+ * range it covers needs to be remapped by the high level before the operation
+ * can proceed.
  */
 #define IOMAP_F_SIZE_CHANGED	0x100
+#define IOMAP_F_STALE		0x200
 
 /*
  * Flags from 0x1000 up are for file system specific usage:
@@ -165,6 +170,18 @@  struct iomap_ops {
 	 */
 	int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
 			ssize_t written, unsigned flags, struct iomap *iomap);
+
+	/*
+	 * Check that the cached iomap still maps correctly to the filesystem's
+	 * internal extent map. FS internal extent maps can change while iomap
+	 * is iterating a cached iomap, so this hook allows iomap to detect that
+	 * the iomap needs to be refreshed during a long running write
+	 * operation.
+	 *
+	 * This is called with the folio over the specified file position
+	 * held locked by the iomap code.
+	 */
+	bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };
 
 /**