diff mbox series

[RFC] iomap: invalidate pages past eof in iomap_do_writepage()

Message ID 20220601011116.495988-1-clm@fb.com (mailing list archive)
State New, archived
Headers show
Series [RFC] iomap: invalidate pages past eof in iomap_do_writepage() | expand

Commit Message

Chris Mason June 1, 2022, 1:11 a.m. UTC
iomap_do_writepage() sends pages past i_size through
folio_redirty_for_writepage(), which normally isn't a problem because
truncate and friends clean them very quickly.

When the system a variety of cgroups, we can end up in situations where
one cgroup has almost no dirty pages at all.  This is especially common
in our XFS workloads in production because they tend to use O_DIRECT for
everything.

We've hit storms where the redirty path hits millions of times in a few
seconds, on all a single file that's only ~40 pages long.  This ends up
leading to long tail latencies for file writes because the page reclaim
workers are hogging the CPU from some kworkers bound to the same CPU.

That's the theory anyway.  We know the storms exist, but the tail
latencies are so sporadic that it's hard to have any certainty about the
cause without patching a large number of boxes.

There are a few different problems here.  First is just that I don't
understand how invalidating the page instead of redirtying might upset
the rest of the iomap/xfs world.  Btrfs invalidates in this case, which
seems like the right thing to me, but we all have slightly different
sharp corners in the truncate path so I thought I'd ask for comments.

Second is the VM should take wbc->pages_skipped into account, or use
some other way to avoid looping over and over.  I think we actually want
both but I wanted to understand the page invalidation path first.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Domas Mituzas <domas@fb.com>
---
 fs/iomap/buffered-io.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig June 1, 2022, 12:18 p.m. UTC | #1
This does look sane to me.  How much testing did this get?  Especially
for the block size < page sie case?  Also adding Dave as he has spent
a lot of time on this code.

On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote:
> iomap_do_writepage() sends pages past i_size through
> folio_redirty_for_writepage(), which normally isn't a problem because
> truncate and friends clean them very quickly.
> 
> When the system a variety of cgroups,

^^^ This sentence does not parse ^^^

> we can end up in situations where
> one cgroup has almost no dirty pages at all.  This is especially common
> in our XFS workloads in production because they tend to use O_DIRECT for
> everything.
> 
> We've hit storms where the redirty path hits millions of times in a few
> seconds, on all a single file that's only ~40 pages long.  This ends up
> leading to long tail latencies for file writes because the page reclaim
> workers are hogging the CPU from some kworkers bound to the same CPU.
> 
> That's the theory anyway.  We know the storms exist, but the tail
> latencies are so sporadic that it's hard to have any certainty about the
> cause without patching a large number of boxes.
> 
> There are a few different problems here.  First is just that I don't
> understand how invalidating the page instead of redirtying might upset
> the rest of the iomap/xfs world.  Btrfs invalidates in this case, which
> seems like the right thing to me, but we all have slightly different
> sharp corners in the truncate path so I thought I'd ask for comments.
> 
> Second is the VM should take wbc->pages_skipped into account, or use
> some other way to avoid looping over and over.  I think we actually want
> both but I wanted to understand the page invalidation path first.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Reported-by: Domas Mituzas <domas@fb.com>
> ---
>  fs/iomap/buffered-io.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8ce8720093b9..4a687a2a9ed9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		pgoff_t end_index = isize >> PAGE_SHIFT;
>  
>  		/*
> -		 * Skip the page if it's fully outside i_size, e.g. due to a
> -		 * truncate operation that's in progress. We must redirty the
> -		 * page so that reclaim stops reclaiming it. Otherwise
> -		 * iomap_vm_releasepage() is called on it and gets confused.
> +		 * invalidate the page if it's fully outside i_size, e.g.
> +		 * due to a truncate operation that's in progress.
>  		 *
>  		 * Note that the end_index is unsigned long.  If the given
>  		 * offset is greater than 16TB on a 32-bit system then if we
> @@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  		 * offset is just equal to the EOF.
>  		 */
>  		if (folio->index > end_index ||
> -		    (folio->index == end_index && poff == 0))
> -			goto redirty;
> +		    (folio->index == end_index && poff == 0)) {
> +			folio_invalidate(folio, 0, folio_size(folio));
> +			goto unlock;
> +		}
>  
>  		/*
>  		 * The page straddles i_size.  It must be zeroed out on each
> @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  
>  redirty:
>  	folio_redirty_for_writepage(wbc, folio);
> +unlock:
>  	folio_unlock(folio);
>  	return 0;
>  }
> -- 
> 2.30.2
> 
---end quoted text---
Chris Mason June 1, 2022, 2:13 p.m. UTC | #2
> On Jun 1, 2022, at 8:18 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> This does look sane to me. How much testing did this get?

Almost none at all, I made sure the invalidates were triggering and bashed on it with fsx, but haven’t even done xfstests yet.  The first rule about truncate is that v1 patches are always broken, so I’m expecting explosions.

> Especially
> for the block size < page sie case? Also adding Dave as he has spent
> a lot of time on this code.
> 

Sorry Dave, I thought I had you in here already.

> On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote:
>> iomap_do_writepage() sends pages past i_size through
>> folio_redirty_for_writepage(), which normally isn't a problem because
>> truncate and friends clean them very quickly.
>> 
>> When the system a variety of cgroups,
> 
> ^^^ This sentence does not parse ^^^
> 

Most of production is setup with one cgroup tree for the workloads we love and care about, and a few cgroup trees for everything else.  We tend to crank down memory or IO limits on the unloved cgroups and prioritize the workload cgroups.

This problem is hitting our mysql workloads, which are mostly O_DIRECT on a relatively small number of files.  From a kernel point of view it’s a lot of IO and not much actual resource management.  What’s happening in prod (on an older 5.6 kernel) is the non-mysql cgroups are blowing past the background dirty threshold, which kicks off the async writeback workers.

The actual call path is: wb_workfn()->wb_do_writeback()->wb_check_background_flush()->wb_writeback()->__writeback_inodes_sb()

Johannes explained to me that wb_over_bg_thresh(wb) ends up returning true on the mysql cgroups because the global background limit has been reached, even though mysql didn’t really contribute much of the dirty.  So we call down into wb_writeback(), which will loop as long as __writeback_inodes_wb() returns that it’s making progress and we’re still globally over the bg threshold.

In prod, bpftrace showed looping on a single inode inside a mysql cgroup.  That inode was usually in the middle of being deleted, i_size set to zero, but it still had 40-90 pages sitting in the xarray waiting for truncation.  We’d loop through the whole call path above over and over again, mostly because writepages() was returning progress had been made on this one inode.  The redirty_page_for_writepage() path does drop wbc->nr_to_write, so the rest of the writepages machinery believes real work is being done.  nr_to_write is LONG_MAX, so we’ve got a while to loop.

I had dreams of posting a trivial reproduction with two cgroups, dd, and a single file being written and truncated in a loop, which works pretty well on 5.6 and refuses to be useful upstream.   Johannes and I talked it over and we still think this patch makes sense, since the redirty path feels suboptimal.  I’ll try to make a better reproduction as well.

To give an idea of how rare this is, I’d run bpftrace for 300 seconds at a time on 10K machines and usually find a single machine in the loop.

-chris
Dave Chinner June 2, 2022, 6:52 a.m. UTC | #3
On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote:
> 
> > On Jun 1, 2022, at 8:18 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > This does look sane to me. How much testing did this get?
> 
> Almost none at all, I made sure the invalidates were triggering
> and bashed on it with fsx, but haven’t even done xfstests yet.
> The first rule about truncate is that v1 patches are always
> broken, so I’m expecting explosions.

If there are going to be explosions, it will be on block size < page
size configs, I think. Otherwise the folio_invalidate() call is a
largely a no-op.

> 
> > Especially
> > for the block size < page sie case? Also adding Dave as he has spent
> > a lot of time on this code.
> > 
> 
> Sorry Dave, I thought I had you in here already.

You did - you got me with linux-xfs@...

> > On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote:
> >> iomap_do_writepage() sends pages past i_size through
> >> folio_redirty_for_writepage(), which normally isn't a problem because
> >> truncate and friends clean them very quickly.
> >> 
> >> When the system a variety of cgroups,
> > 
> > ^^^ This sentence does not parse ^^^
>
> Most of production is setup with one cgroup tree for the workloads
> we love and care about, and a few cgroup trees for everything
> else.  We tend to crank down memory or IO limits on the unloved
> cgroups and prioritize the workload cgroups.
> 
> This problem is hitting our mysql workloads, which are mostly
> O_DIRECT on a relatively small number of files.  From a kernel
> point of view it’s a lot of IO and not much actual resource
> management.  What’s happening in prod (on an older 5.6 kernel) is
> the non-mysql cgroups are blowing past the background dirty
> threshold, which kicks off the async writeback workers.
> 
> The actual call path is:
> wb_workfn()->wb_do_writeback()->wb_check_background_flush()->wb_writeback()->__writeback_inodes_sb()
> 
> Johannes explained to me that wb_over_bg_thresh(wb) ends up
> returning true on the mysql cgroups because the global background
> limit has been reached, even though mysql didn’t really contribute
> much of the dirty.  So we call down into wb_writeback(), which
> will loop as long as __writeback_inodes_wb() returns that it’s
> making progress and we’re still globally over the bg threshold.

*nod*

> In prod, bpftrace showed looping on a single inode inside a mysql
> cgroup.  That inode was usually in the middle of being deleted,
> i_size set to zero, but it still had 40-90 pages sitting in the
> xarray waiting for truncation.  We’d loop through the whole call
> path above over and over again, mostly because writepages() was
> returning progress had been made on this one inode.  The
> redirty_page_for_writepage() path does drop wbc->nr_to_write, so
> the rest of the writepages machinery believes real work is being
> done.  nr_to_write is LONG_MAX, so we’ve got a while to loop.

Yup, this code relies on truncate making progress to avoid looping
forever. Truncate should only block on the page while it locks it
and waits for writeback to complete, then it gets forcibly
invalidated and removed from the page cache.

Ok, so assuming this is the case, why is truncate apparently not
making progress? I can understand that it might lockstep with
writeback trying to write and then redirtying folios, but that
lockstep pass of ascending folio index should only happen once
because once truncate locks the folio the page gets removed from
the mapping regardless of whether it is clean or dirty.

Oh.

Aren't there starvation problems with folio_lock() because the
wakeup is not atomic with granting the lock bit? Hence if something
is locking and unlocking a page in quick succession, a waiter can be
woken but by the time it has been scheduled and running the lock bit
has been taken by someone else and so it goes back to sleep?

And the patch fixes this problem by breaking the lock/unlock cycling
in writeback() by not redirtying the folio and and so writback
doesn't iterate straight back to it and lock it again?

Regardless of the cause, why do we redirty the page there?

/me trolls through git history

commit ff9a28f6c25d18a635abcab1f49db68108203dfb
Author: Jan Kara <jack@suse.cz>
Date:   Thu Mar 14 14:30:54 2013 +0100

    xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage()
    
    When a dirty page is truncated from a file but reclaim gets to it before
    truncate_inode_pages(), we hit WARN_ON(delalloc) in
    xfs_vm_releasepage(). This is because reclaim tries to write the page,
    xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim
    thinks it can continue and calls xfs_vm_releasepage() on page with dirty
    buffers.
    
    Fix the issue by redirtying the page in xfs_vm_writepage(). This makes
    reclaim stop reclaiming the page and also logically it keeps page in a
    more consistent state where page with dirty buffers has PageDirty set.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Signed-off-by: Ben Myers <bpm@sgi.com>

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 5f707e537171..3244c988d379 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -953,13 +953,13 @@ xfs_vm_writepage(
                unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
 
                /*
-                * Just skip the page if it is fully outside i_size, e.g. due
-                * to a truncate operation that is in progress.
+                * Skip the page if it is fully outside i_size, e.g. due to a
+                * truncate operation that is in progress. We must redirty the
+                * page so that reclaim stops reclaiming it. Otherwise
+                * xfs_vm_releasepage() is called on it and gets confused.
                 */
-               if (page->index >= end_index + 1 || offset_into_page == 0) {
-                       unlock_page(page);
-                       return 0;
-               }
+               if (page->index >= end_index + 1 || offset_into_page == 0)
+                       goto redirty;
 
                /*
                 * The page straddles i_size.  It must be zeroed out on each

Ah, I remember those problems. Random WARN_ON()s during workloads
under heavy memory pressure and lots of page cache dirtying.
Essentially, this fix was working around the old "writepage to clean
dirty pages at the end of the LRU during memory reclaim"
anti-pattern.  Well, that problem doesn't exist anymore - XFS hasn't
had a ->writepage method for some time now memory reclaim can't
clean dirty pages on XFS filesystems directly and hence can't try to
clean and release dirty pages while a truncate is in progress.

Further, the filesystem extent state coherency problems with
bufferhead caching that triggered in ->releasepage don't exist
anymore, either. That was one of the exact problems we invented
iomap to solve....

Hence I think we can remove the redirtying completely - it's not
needed and hasn't been for some time.

Further, I don't think we need to invalidate the folio, either. If
it's beyond EOF, then it is because a truncate is in progress that
means it is somebody else's problem to clean up. Hence we should
leave it to the truncate to deal with, just like the pre-2013 code
did....

Cheers,

Dave.
Johannes Weiner June 2, 2022, 3:32 p.m. UTC | #4
On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
> On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote:
> > In prod, bpftrace showed looping on a single inode inside a mysql
> > cgroup.  That inode was usually in the middle of being deleted,
> > i_size set to zero, but it still had 40-90 pages sitting in the
> > xarray waiting for truncation.  We’d loop through the whole call
> > path above over and over again, mostly because writepages() was
> > returning progress had been made on this one inode.  The
> > redirty_page_for_writepage() path does drop wbc->nr_to_write, so
> > the rest of the writepages machinery believes real work is being
> > done.  nr_to_write is LONG_MAX, so we’ve got a while to loop.
> 
> Yup, this code relies on truncate making progress to avoid looping
> forever. Truncate should only block on the page while it locks it
> and waits for writeback to complete, then it gets forcibly
> invalidated and removed from the page cache.

It's not looping forever, truncate can just take a relatively long
time during which the flusher is busy-spinning full bore on a
relatively small number of unflushable pages (range_cyclic).

But you raise a good point asking "why is truncate stuck?". I first
thought they might be cannibalizing each other over the page locks,
but that wasn't it (and wouldn't explain the clear asymmetry between
truncate and flusher). That leaves the waiting for writeback. I just
confirmed with tracing that that's exactly where truncate sits while
the flusher goes bananas on the same inode. So the race must be this:

truncate:                flusher
                         put a subset of pages under writeback
i_size_write(0)
wait_on_page_writeback()
                         loop with range_cyclic over remaining dirty >EOF pages

> Hence I think we can remove the redirtying completely - it's not
> needed and hasn't been for some time.
> 
> Further, I don't think we need to invalidate the folio, either. If
> it's beyond EOF, then it is because a truncate is in progress that
> means it is somebody else's problem to clean up. Hence we should
> leave it to the truncate to deal with, just like the pre-2013 code
> did....

Perfect, that works.
Chris Mason June 2, 2022, 7:41 p.m. UTC | #5
> On Jun 2, 2022, at 11:32 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
>> 
>> Further, I don't think we need to invalidate the folio, either. If
>> it's beyond EOF, then it is because a truncate is in progress that
>> means it is somebody else's problem to clean up. Hence we should
>> leave it to the truncate to deal with, just like the pre-2013 code
>> did....
> 
> Perfect, that works.

Ok, I’ll run it through xfstests and resend a v2.  Will this cover all of the subpage blocksize concerns?

-chris
Matthew Wilcox June 2, 2022, 7:59 p.m. UTC | #6
On Thu, Jun 02, 2022 at 07:41:55PM +0000, Chris Mason wrote:
> 
> > On Jun 2, 2022, at 11:32 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
> >> 
> >> Further, I don't think we need to invalidate the folio, either. If
> >> it's beyond EOF, then it is because a truncate is in progress that
> >> means it is somebody else's problem to clean up. Hence we should
> >> leave it to the truncate to deal with, just like the pre-2013 code
> >> did....
> > 
> > Perfect, that works.
> 
> Ok, I’ll run it through xfstests and resend a v2.  Will this cover all of the subpage blocksize concerns?

I run with:
export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"

Dave may have other options he'd like to see.
Dave Chinner June 2, 2022, 10:06 p.m. UTC | #7
On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner wrote:
> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
> > On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote:
> > > In prod, bpftrace showed looping on a single inode inside a mysql
> > > cgroup.  That inode was usually in the middle of being deleted,
> > > i_size set to zero, but it still had 40-90 pages sitting in the
> > > xarray waiting for truncation.  We’d loop through the whole call
> > > path above over and over again, mostly because writepages() was
> > > returning progress had been made on this one inode.  The
> > > redirty_page_for_writepage() path does drop wbc->nr_to_write, so
> > > the rest of the writepages machinery believes real work is being
> > > done.  nr_to_write is LONG_MAX, so we’ve got a while to loop.
> > 
> > Yup, this code relies on truncate making progress to avoid looping
> > forever. Truncate should only block on the page while it locks it
> > and waits for writeback to complete, then it gets forcibly
> > invalidated and removed from the page cache.
> 
> It's not looping forever, truncate can just take a relatively long
> time during which the flusher is busy-spinning full bore on a
> relatively small number of unflushable pages (range_cyclic).
> 
> But you raise a good point asking "why is truncate stuck?". I first
> thought they might be cannibalizing each other over the page locks,
> but that wasn't it (and wouldn't explain the clear asymmetry between
> truncate and flusher). That leaves the waiting for writeback. I just
> confirmed with tracing that that's exactly where truncate sits while
> the flusher goes bananas on the same inode. So the race must be this:
> 
> truncate:                flusher
>                          put a subset of pages under writeback
> i_size_write(0)
> wait_on_page_writeback()
>                          loop with range_cyclic over remaining dirty >EOF pages

But write_cache_pages() doesn't repeatedly loop over the pages.

The flusher is

->writepages
  iomap_writepages
    write_cache_pages()
      loop over mapping tree
        lock page
	iomap_do_writepage
	  set_page_writeback()
	  add page to ioend
     <end of mapping reached>
  iomap_submit_ioend()
    <pages under writeback get sent for IO>
return to high level writeback

And eventually IO completion will clear page writeback state.

i.e. write_cache_pages() should not be hard looping over the pages
beyond EOF even if range_cyclic is set - it's skipping those pages,
submitting any that are under writeback, and the, going back to high
level code for it to make a decision about continuation of
writeback. It may call back down and we loop over dirty pages beyond
EOF again, but the flusher should not be holding on to pages under
writeback for any signification length of time before they are
submitted for IO.

IOWs, if truncate is getting stuck waiting on writeback, then that
implies something is holding up IO completions for a long time, not
that there's a problem in writeback submission. i.e. you might
actually be looking at a workqueue backlog or scheduling starvation
problem here preventing IO completion from clearing writeback
state....

> > Hence I think we can remove the redirtying completely - it's not
> > needed and hasn't been for some time.
> > 
> > Further, I don't think we need to invalidate the folio, either. If
> > it's beyond EOF, then it is because a truncate is in progress that
> > means it is somebody else's problem to clean up. Hence we should
> > leave it to the truncate to deal with, just like the pre-2013 code
> > did....
> 
> Perfect, that works.

If there's actually a IO completion latency problem, this will not
fix it - it'll just hide the soft-lockup symptom that the pages
stuck in writeback are manifested through this path.

Cheers,

Dave.
Dave Chinner June 2, 2022, 10:07 p.m. UTC | #8
On Thu, Jun 02, 2022 at 08:59:01PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 02, 2022 at 07:41:55PM +0000, Chris Mason wrote:
> > 
> > > On Jun 2, 2022, at 11:32 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > 
> > > On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
> > >> 
> > >> Further, I don't think we need to invalidate the folio, either. If
> > >> it's beyond EOF, then it is because a truncate is in progress that
> > >> means it is somebody else's problem to clean up. Hence we should
> > >> leave it to the truncate to deal with, just like the pre-2013 code
> > >> did....
> > > 
> > > Perfect, that works.
> > 
> > Ok, I’ll run it through xfstests and resend a v2.  Will this cover all of the subpage blocksize concerns?
> 
> I run with:
> export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"
> 
> Dave may have other options he'd like to see.

No, that above is what I'd suggest.

Cheers,

Dave.
Chris Mason June 3, 2022, 1:29 a.m. UTC | #9
> On Jun 2, 2022, at 6:06 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner wrote:
>> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
>>> On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote:
>>>> In prod, bpftrace showed looping on a single inode inside a mysql
>>>> cgroup. That inode was usually in the middle of being deleted,
>>>> i_size set to zero, but it still had 40-90 pages sitting in the
>>>> xarray waiting for truncation. We’d loop through the whole call
>>>> path above over and over again, mostly because writepages() was
>>>> returning progress had been made on this one inode. The
>>>> redirty_page_for_writepage() path does drop wbc->nr_to_write, so
>>>> the rest of the writepages machinery believes real work is being
>>>> done. nr_to_write is LONG_MAX, so we’ve got a while to loop.
>>> 
>>> Yup, this code relies on truncate making progress to avoid looping
>>> forever. Truncate should only block on the page while it locks it
>>> and waits for writeback to complete, then it gets forcibly
>>> invalidated and removed from the page cache.
>> 
>> It's not looping forever, truncate can just take a relatively long
>> time during which the flusher is busy-spinning full bore on a
>> relatively small number of unflushable pages (range_cyclic).
>> 
>> But you raise a good point asking "why is truncate stuck?". I first
>> thought they might be cannibalizing each other over the page locks,
>> but that wasn't it (and wouldn't explain the clear asymmetry between
>> truncate and flusher). That leaves the waiting for writeback. I just
>> confirmed with tracing that that's exactly where truncate sits while
>> the flusher goes bananas on the same inode. So the race must be this:
>> 
>> truncate: flusher
>> put a subset of pages under writeback
>> i_size_write(0)
>> wait_on_page_writeback()
>> loop with range_cyclic over remaining dirty >EOF pages
> 
> But write_cache_pages() doesn't repeatedly loop over the pages.
> 
> The flusher is
> 
> ->writepages
> iomap_writepages
> write_cache_pages()
> loop over mapping tree
> lock page
> 	iomap_do_writepage
> 	 set_page_writeback()
> 	 add page to ioend
> <end of mapping reached>
> iomap_submit_ioend()
> <pages under writeback get sent for IO>
> return to high level writeback
> 
> And eventually IO completion will clear page writeback state.
> 

Yes, this is actually happening before the truncate starts.  The truncate finds these writeback pages and waits for them to finish IO, and while it’s waiting wb_check_background_flush() goes wild on the redirty path.

> i.e. write_cache_pages() should not be hard looping over the pages
> beyond EOF even if range_cyclic is set - it's skipping those pages,
> submitting any that are under writeback, and the, going back to high
> level code for it to make a decision about continuation of
> writeback. It may call back down and we loop over dirty pages beyond
> EOF again, but the flusher should not be holding on to pages under
> writeback for any signification length of time before they are
> submitted for IO.
> 

I spent a while trying to blame write_cache_pages() for looping repeatedly, and ended up making a series of bpftrace scripts that collected call stack frequency counters for all the different ways to wander into write_cache_pages().  I eventually ran it across 100K systems to figure out exactly how we were getting into trouble.  It was (thankfully) really consistent.

As you describe above, the loops are definitely coming from higher in the stack.  wb_writeback() will loop as long as __writeback_inodes_wb() returns that it’s making progress and we’re still globally over the bg threshold, so write_cache_pages() is just being called over and over again.  We’re coming from wb_check_background_flush(), so:

                struct wb_writeback_work work = {
                        .nr_pages       = LONG_MAX,
                        .sync_mode      = WB_SYNC_NONE,
                        .for_background = 1,
                        .range_cyclic   = 1,
                        .reason         = WB_REASON_BACKGROUND,
                };

> IOWs, if truncate is getting stuck waiting on writeback, then that
> implies something is holding up IO completions for a long time,

From Johannes’s tracing today, that’s about 17us.  Our victim cgroup has just a handful of files, so we can burn through a lot of write_cache_pages loops in the time truncate is waiting for a single IO on a relatively fast ssd.

> not
> that there's a problem in writeback submission. i.e. you might
> actually be looking at a workqueue backlog

I actually think a workqueue backlog is coming from the flusher thread hogging the CPU.  The investigation started by looking for long tail latencies in write() systemcalls, and Domas’s original finding was IO completion workers were waiting for CPUs.  That’s how he ended up finding the redirty calls using high levels of CPU.  We honestly won’t be sure until we live patch a lot of boxes and look for long tail latency improvements, but I’m pretty optimistic.

-chris
Dave Chinner June 3, 2022, 5:20 a.m. UTC | #10
[ Chris, can you lne wrap your emails at 72 columns, please? ]

On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
> > On Jun 2, 2022, at 6:06 PM, Dave Chinner <david@fromorbit.com>
> > wrote: On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner
> > wrote:
> >> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote:
> >>> On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote:
> >>>> In prod, bpftrace showed looping on a single inode inside a
> >>>> mysql cgroup. That inode was usually in the middle of being
> >>>> deleted, i_size set to zero, but it still had 40-90 pages
> >>>> sitting in the xarray waiting for truncation. We’d loop
> >>>> through the whole call path above over and over again, mostly
> >>>> because writepages() was returning progress had been made on
> >>>> this one inode. The redirty_page_for_writepage() path does
> >>>> drop wbc->nr_to_write, so the rest of the writepages
> >>>> machinery believes real work is being done. nr_to_write is
> >>>> LONG_MAX, so we’ve got a while to loop.
> >>> 
> >>> Yup, this code relies on truncate making progress to avoid
> >>> looping forever. Truncate should only block on the page while
> >>> it locks it and waits for writeback to complete, then it gets
> >>> forcibly invalidated and removed from the page cache.
> >> 
> >> It's not looping forever, truncate can just take a relatively
> >> long time during which the flusher is busy-spinning full bore
> >> on a relatively small number of unflushable pages
> >> (range_cyclic).
> >> 
> >> But you raise a good point asking "why is truncate stuck?". I
> >> first thought they might be cannibalizing each other over the
> >> page locks, but that wasn't it (and wouldn't explain the clear
> >> asymmetry between truncate and flusher). That leaves the
> >> waiting for writeback. I just confirmed with tracing that
> >> that's exactly where truncate sits while the flusher goes
> >> bananas on the same inode. So the race must be this:
> >> 
> >> truncate: flusher put a subset of pages under writeback
> >> i_size_write(0) wait_on_page_writeback() loop with range_cyclic
> >> over remaining dirty >EOF pages
> > 
> > But write_cache_pages() doesn't repeatedly loop over the
> > pages.
> > 
> > The flusher is
> > 
> > ->writepages iomap_writepages write_cache_pages() loop over
> > mapping tree lock page iomap_do_writepage set_page_writeback()
> > add page to ioend <end of mapping reached> iomap_submit_ioend()
> > <pages under writeback get sent for IO> return to high level
> > writeback
> > 
> > And eventually IO completion will clear page writeback state.
> > 
> 
> Yes, this is actually happening before the truncate starts.  The
> truncate finds these writeback pages and waits for them to finish
> IO, and while it’s waiting wb_check_background_flush() goes wild
> on the redirty path.

I still don't think the redirty path is the underlying problem. Yes,
it triggers it, but it looks to be triggering an existing behaviour
that the writeback path should not have...

> > i.e. write_cache_pages() should not be hard looping over the
> > pages beyond EOF even if range_cyclic is set - it's skipping
> > those pages, submitting any that are under writeback, and the,
> > going back to high level code for it to make a decision about
> > continuation of writeback. It may call back down and we loop
> > over dirty pages beyond EOF again, but the flusher should not be
> > holding on to pages under writeback for any signification length
> > of time before they are submitted for IO.
> > 
> 
> I spent a while trying to blame write_cache_pages() for looping
> repeatedly, and ended up making a series of bpftrace scripts that
> collected call stack frequency counters for all the different ways
> to wander into write_cache_pages().  I eventually ran it across
> 100K systems to figure out exactly how we were getting into
> trouble.  It was (thankfully) really consistent.
> 
> As you describe above, the loops are definitely coming from higher
> in the stack.  wb_writeback() will loop as long as
> __writeback_inodes_wb() returns that it’s making progress and
> we’re still globally over the bg threshold, so write_cache_pages()
> is just being called over and over again.  We’re coming from
> wb_check_background_flush(), so:
> 
>                 struct wb_writeback_work work = {
>                         .nr_pages       = LONG_MAX,
>                         .sync_mode      = WB_SYNC_NONE,
>                         .for_background = 1,
>                         .range_cyclic   = 1,
>                         .reason         = WB_REASON_BACKGROUND,
>                 };

Sure, but we end up in writeback_sb_inodes() which does this after
the __writeback_single_inode()->do_writepages() call that iterates
the dirty pages:

               if (need_resched()) {
                        /*
                         * We're trying to balance between building up a nice
                         * long list of IOs to improve our merge rate, and
                         * getting those IOs out quickly for anyone throttling
                         * in balance_dirty_pages().  cond_resched() doesn't
                         * unplug, so get our IOs out the door before we
                         * give up the CPU.
                         */
                        blk_flush_plug(current->plug, false);
                        cond_resched();
                }

So if there is a pending IO completion on this CPU on a work queue
here, we'll reschedule to it because the work queue kworkers are
bound to CPUs and they take priority over user threads.

Also, this then requeues the inode of the b_more_io queue, and
wb_check_background_flush() won't come back to it until all other
inodes on all other superblocks on the bdi have had writeback
attempted. So if the system truly is over the background dirty
threshold, why is writeback getting stuck on this one inode in this
way?

> > IOWs, if truncate is getting stuck waiting on writeback, then that
> > implies something is holding up IO completions for a long time,
> 
> From Johannes’s tracing today, that’s about 17us.

Which means there should only be a delay of 17us between IO completion
being queued on this CPU and the flusher thread being preempted and
completion being run, right?

> Our victim cgroup has just a handful of files, so we can burn
> through a lot of write_cache_pages loops in the time truncate is
> waiting for a single IO on a relatively fast ssd.

Yes, I'm not denying that, but background writeback when over global
dirty thresholds should not be spinning on a single inode with
50-100 pages attached to it. It should visit it, try to flush it,
and then move on to the next.

> > not
> > that there's a problem in writeback submission. i.e. you might
> > actually be looking at a workqueue backlog
> 
> I actually think a workqueue backlog is coming from the flusher
> thread hogging the CPU.

I can't see how that happens with that need_resched() check in
writeback_sb_inodes().

> The investigation started by looking for long tail latencies in
> write() systemcalls, and Domas’s original finding was IO
> completion workers were waiting for CPUs.  That’s how he ended up
> finding the redirty calls using high levels of CPU.  We honestly
> won’t be sure until we live patch a lot of boxes and look for long
> tail latency improvements, but I’m pretty optimistic.

Something still doesn't add up. I don't see how submission spinning
is holding off completion because it plays nice with resched
checks. You're now saying that this started because someone found
long tail completion latencies, and that explains truncate getting
stuck. But AFAICT the writeback behaviour isn't responsible for
completion latencies as writeback will give up the CPU to any other
thread scheduled to run on that CPU pretty quickly.

It feels like there still something unknown behaviour here, and it
still smells to me like IO completions are getting backed up by
something. I wonder - are there large files getting written using
buffered IO on these machines? Commit ebb7fb1557b1 ("xfs, iomap:
limit individual ioend chain lengths in writeback") might be
relevant if that is the case....

Cheers,

Dave.
Johannes Weiner June 3, 2022, 3:06 p.m. UTC | #11
Hello Dave,

On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
> On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
> > As you describe above, the loops are definitely coming from higher
> > in the stack.  wb_writeback() will loop as long as
> > __writeback_inodes_wb() returns that it’s making progress and
> > we’re still globally over the bg threshold, so write_cache_pages()
> > is just being called over and over again.  We’re coming from
> > wb_check_background_flush(), so:
> > 
> >                 struct wb_writeback_work work = {
> >                         .nr_pages       = LONG_MAX,
> >                         .sync_mode      = WB_SYNC_NONE,
> >                         .for_background = 1,
> >                         .range_cyclic   = 1,
> >                         .reason         = WB_REASON_BACKGROUND,
> >                 };
> 
> Sure, but we end up in writeback_sb_inodes() which does this after
> the __writeback_single_inode()->do_writepages() call that iterates
> the dirty pages:
> 
>                if (need_resched()) {
>                         /*
>                          * We're trying to balance between building up a nice
>                          * long list of IOs to improve our merge rate, and
>                          * getting those IOs out quickly for anyone throttling
>                          * in balance_dirty_pages().  cond_resched() doesn't
>                          * unplug, so get our IOs out the door before we
>                          * give up the CPU.
>                          */
>                         blk_flush_plug(current->plug, false);
>                         cond_resched();
>                 }
> 
> So if there is a pending IO completion on this CPU on a work queue
> here, we'll reschedule to it because the work queue kworkers are
> bound to CPUs and they take priority over user threads.

The flusher thread is also a kworker, though. So it may hit this
cond_resched(), but it doesn't yield until the timeslice expires.

> Also, this then requeues the inode of the b_more_io queue, and
> wb_check_background_flush() won't come back to it until all other
> inodes on all other superblocks on the bdi have had writeback
> attempted. So if the system truly is over the background dirty
> threshold, why is writeback getting stuck on this one inode in this
> way?

The explanation for this part at least is that the bdi/flush domain is
split per cgroup. The cgroup in question is over its proportional bg
thresh. It has very few dirty pages, but it also has very few
*dirtyable* pages, which makes for a high dirty ratio. And those
handful of dirty pages are the unflushable ones past EOF.

There is no next inode to move onto on subsequent loops.
Chris Mason June 3, 2022, 4:09 p.m. UTC | #12
[ From a different message, Dave asks wtf my email client was doing. 
Thanks Dave, apparently exchange is being exchangey with base64 in 
unpredictable ways.  This was better in my test reply, lets see. ]

On 6/3/22 11:06 AM, Johannes Weiner wrote:
> Hello Dave,
> 
> On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
>> On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
>>> As you describe above, the loops are definitely coming from higher
>>> in the stack.  wb_writeback() will loop as long as
>>> __writeback_inodes_wb() returns that it’s making progress and
>>> we’re still globally over the bg threshold, so write_cache_pages()
>>> is just being called over and over again.  We’re coming from
>>> wb_check_background_flush(), so:
>>>
>>>                  struct wb_writeback_work work = {
>>>                          .nr_pages       = LONG_MAX,
>>>                          .sync_mode      = WB_SYNC_NONE,
>>>                          .for_background = 1,
>>>                          .range_cyclic   = 1,
>>>                          .reason         = WB_REASON_BACKGROUND,
>>>                  };
>>
>> Sure, but we end up in writeback_sb_inodes() which does this after
>> the __writeback_single_inode()->do_writepages() call that iterates
>> the dirty pages:
>>
>>                 if (need_resched()) {
>>                          /*
>>                           * We're trying to balance between building up a nice
>>                           * long list of IOs to improve our merge rate, and
>>                           * getting those IOs out quickly for anyone throttling
>>                           * in balance_dirty_pages().  cond_resched() doesn't
>>                           * unplug, so get our IOs out the door before we
>>                           * give up the CPU.
>>                           */
>>                          blk_flush_plug(current->plug, false);
>>                          cond_resched();
>>                  }
>>
>> So if there is a pending IO completion on this CPU on a work queue
>> here, we'll reschedule to it because the work queue kworkers are
>> bound to CPUs and they take priority over user threads.
> 
> The flusher thread is also a kworker, though. So it may hit this
> cond_resched(), but it doesn't yield until the timeslice expires.
>

Just to underline this, the long tail latencies aren't softlockups or 
major explosions.  It's just suboptimal enough that different metrics 
and dashboards noticed it.

-chris
Dave Chinner June 5, 2022, 11:32 p.m. UTC | #13
On Fri, Jun 03, 2022 at 12:09:06PM -0400, Chris Mason wrote:
> [ From a different message, Dave asks wtf my email client was doing. Thanks
> Dave, apparently exchange is being exchangey with base64 in unpredictable
> ways.  This was better in my test reply, lets see. ]
> 
> On 6/3/22 11:06 AM, Johannes Weiner wrote:
> > Hello Dave,
> > 
> > On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
> > > > As you describe above, the loops are definitely coming from higher
> > > > in the stack.  wb_writeback() will loop as long as
> > > > __writeback_inodes_wb() returns that it’s making progress and
> > > > we’re still globally over the bg threshold, so write_cache_pages()
> > > > is just being called over and over again.  We’re coming from
> > > > wb_check_background_flush(), so:
> > > > 
> > > >                  struct wb_writeback_work work = {
> > > >                          .nr_pages       = LONG_MAX,
> > > >                          .sync_mode      = WB_SYNC_NONE,
> > > >                          .for_background = 1,
> > > >                          .range_cyclic   = 1,
> > > >                          .reason         = WB_REASON_BACKGROUND,
> > > >                  };
> > > 
> > > Sure, but we end up in writeback_sb_inodes() which does this after
> > > the __writeback_single_inode()->do_writepages() call that iterates
> > > the dirty pages:
> > > 
> > >                 if (need_resched()) {
> > >                          /*
> > >                           * We're trying to balance between building up a nice
> > >                           * long list of IOs to improve our merge rate, and
> > >                           * getting those IOs out quickly for anyone throttling
> > >                           * in balance_dirty_pages().  cond_resched() doesn't
> > >                           * unplug, so get our IOs out the door before we
> > >                           * give up the CPU.
> > >                           */
> > >                          blk_flush_plug(current->plug, false);
> > >                          cond_resched();
> > >                  }
> > > 
> > > So if there is a pending IO completion on this CPU on a work queue
> > > here, we'll reschedule to it because the work queue kworkers are
> > > bound to CPUs and they take priority over user threads.
> > 
> > The flusher thread is also a kworker, though. So it may hit this
> > cond_resched(), but it doesn't yield until the timeslice expires.

17us or 10ms, it doesn't matter. The fact is the writeback thread
will give up the CPU long before the latency durations (seconds)
that were reported upthread are seen. Writeback spinning can
not explain why truncate is not making progress - everything points
to it being a downstream symptom, not a cause.

Also important to note, as we are talking about kworker sheduling
hold-offs, the writeback flusher work is unbound (can run on any
CPU), whilst the IO completion workers in XFS are per-CPU and bound
to individual CPUs. Bound kernel tasks usually take run queue
priority on a CPU over unbound and/or user tasks that can be punted
to a different CPU. So, again, with this taken into account I don't
see how the flusher thread consuming CPU would cause long duration
hold-offs of IO completion work....

> Just to underline this, the long tail latencies aren't softlockups or major
> explosions.  It's just suboptimal enough that different metrics and
> dashboards noticed it.

Sure, but you've brought a problem we don't understand the root
cause of to my attention. I want to know what the root cause is so
that I can determine that there are no other unknown underlying
issues that are contributing to this issue.

Nothing found so far explains why truncate is not making progress
and that's what *I* need to understand here. Sure, go ahead and
patch your systems to get rid of the bad writeback looping
behaviour, but that does not explain or solve the underlying long IO
completion tail latency issue that originally lead to finding
writeback spinning.

Cheers,

Dave.
Johannes Weiner June 6, 2022, 2:46 p.m. UTC | #14
Hello,

On Mon, Jun 06, 2022 at 09:32:13AM +1000, Dave Chinner wrote:
> On Fri, Jun 03, 2022 at 12:09:06PM -0400, Chris Mason wrote:
> > On 6/3/22 11:06 AM, Johannes Weiner wrote:
> > > On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
> > > > > As you describe above, the loops are definitely coming from higher
> > > > > in the stack.  wb_writeback() will loop as long as
> > > > > __writeback_inodes_wb() returns that it’s making progress and
> > > > > we’re still globally over the bg threshold, so write_cache_pages()
> > > > > is just being called over and over again.  We’re coming from
> > > > > wb_check_background_flush(), so:
> > > > > 
> > > > >                  struct wb_writeback_work work = {
> > > > >                          .nr_pages       = LONG_MAX,
> > > > >                          .sync_mode      = WB_SYNC_NONE,
> > > > >                          .for_background = 1,
> > > > >                          .range_cyclic   = 1,
> > > > >                          .reason         = WB_REASON_BACKGROUND,
> > > > >                  };
> > > > 
> > > > Sure, but we end up in writeback_sb_inodes() which does this after
> > > > the __writeback_single_inode()->do_writepages() call that iterates
> > > > the dirty pages:
> > > > 
> > > >                 if (need_resched()) {
> > > >                          /*
> > > >                           * We're trying to balance between building up a nice
> > > >                           * long list of IOs to improve our merge rate, and
> > > >                           * getting those IOs out quickly for anyone throttling
> > > >                           * in balance_dirty_pages().  cond_resched() doesn't
> > > >                           * unplug, so get our IOs out the door before we
> > > >                           * give up the CPU.
> > > >                           */
> > > >                          blk_flush_plug(current->plug, false);
> > > >                          cond_resched();
> > > >                  }
> > > > 
> > > > So if there is a pending IO completion on this CPU on a work queue
> > > > here, we'll reschedule to it because the work queue kworkers are
> > > > bound to CPUs and they take priority over user threads.
> > > 
> > > The flusher thread is also a kworker, though. So it may hit this
> > > cond_resched(), but it doesn't yield until the timeslice expires.
> 
> 17us or 10ms, it doesn't matter. The fact is the writeback thread
> will give up the CPU long before the latency durations (seconds)
> that were reported upthread are seen. Writeback spinning can
> not explain why truncate is not making progress - everything points
> to it being a downstream symptom, not a cause.

Chris can clarify, but I don't remember second-long latencies being
mentioned. Rather sampling periods of multiple seconds during which
the spin bursts occur multiple times.

> Also important to note, as we are talking about kworker sheduling
> hold-offs, the writeback flusher work is unbound (can run on any
> CPU), whilst the IO completion workers in XFS are per-CPU and bound
> to individual CPUs. Bound kernel tasks usually take run queue
> priority on a CPU over unbound and/or user tasks that can be punted
> to a different CPU.

Is that actually true? I'm having trouble finding the corresponding
code in the scheduler.

That said, I'm not sure it matters that much. Even if you take CPU
contention out of the equation entirely, I think we agree it's not a
good idea (from a climate POV) to have CPUs busywait on IO. Even if
that IO is just an ordinary wait_on_page_writeback() on a fast drive.

So if we can get rid of the redirtying, and it sounds like we can, IMO
we should just go ahead and do so.

> > Just to underline this, the long tail latencies aren't softlockups or major
> > explosions.  It's just suboptimal enough that different metrics and
> > dashboards noticed it.
> 
> Sure, but you've brought a problem we don't understand the root
> cause of to my attention. I want to know what the root cause is so
> that I can determine that there are no other unknown underlying
> issues that are contributing to this issue.

It seems to me we're just not on the same page on what the reported
bug is. From my POV, there currently isn't a missing piece in this
puzzle. But Chris worked closer with the prod folks on this, so I'll
leave it to him :)
Chris Mason June 6, 2022, 3:13 p.m. UTC | #15
On 6/6/22 10:46 AM, Johannes Weiner wrote:
> Hello,
> 
> On Mon, Jun 06, 2022 at 09:32:13AM +1000, Dave Chinner wrote:

>> Sure, but you've brought a problem we don't understand the root
>> cause of to my attention. I want to know what the root cause is so
>> that I can determine that there are no other unknown underlying
>> issues that are contributing to this issue.
> 
> It seems to me we're just not on the same page on what the reported
> bug is. From my POV, there currently isn't a missing piece in this
> puzzle. But Chris worked closer with the prod folks on this, so I'll
> leave it to him :)

The basic description of the investigation:

* Multiple hits per hour on per 100K machines, but almost impossible to 
catch across a single box.
* The debugging information from the long tail detector showed high IO 
and high CPU time.  (high CPU is relative here, these machines tend to 
be IO bound).
* Kernel stack analysis showed IO completion threads waiting for CPU.
* CPU profiling showed redirty_page_for_writepage() dominating.

 From here we made a relatively simple reproduction of the 
redirty_page_for_writepage() part of the problem.  It's a good fix in 
isolation, but we'll have to circle back to see how much of the long 
tail latency issue it solves.

We can livepatch it quickly, but filtering out the long tail latency 
hits for just this one bug is labor intensive, so it'll take a little 
bit of time to get good data.

I've got a v2 of the patch that drops the invalidate, doing a load test 
with fsx this morning and then getting a second xfstests baseline run to 
see if I've added new failures.

-chris
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..4a687a2a9ed9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1482,10 +1482,8 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		pgoff_t end_index = isize >> PAGE_SHIFT;
 
 		/*
-		 * Skip the page if it's fully outside i_size, e.g. due to a
-		 * truncate operation that's in progress. We must redirty the
-		 * page so that reclaim stops reclaiming it. Otherwise
-		 * iomap_vm_releasepage() is called on it and gets confused.
+		 * invalidate the page if it's fully outside i_size, e.g.
+		 * due to a truncate operation that's in progress.
 		 *
 		 * Note that the end_index is unsigned long.  If the given
 		 * offset is greater than 16TB on a 32-bit system then if we
@@ -1499,8 +1497,10 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * offset is just equal to the EOF.
 		 */
 		if (folio->index > end_index ||
-		    (folio->index == end_index && poff == 0))
-			goto redirty;
+		    (folio->index == end_index && poff == 0)) {
+			folio_invalidate(folio, 0, folio_size(folio));
+			goto unlock;
+		}
 
 		/*
 		 * The page straddles i_size.  It must be zeroed out on each
@@ -1518,6 +1518,7 @@  iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 
 redirty:
 	folio_redirty_for_writepage(wbc, folio);
+unlock:
 	folio_unlock(folio);
 	return 0;
 }