Message ID | 20220601011116.495988-1-clm@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] iomap: invalidate pages past eof in iomap_do_writepage() | expand |
This does look sane to me. How much testing did this get? Especially for the block size < page sie case? Also adding Dave as he has spent a lot of time on this code. On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote: > iomap_do_writepage() sends pages past i_size through > folio_redirty_for_writepage(), which normally isn't a problem because > truncate and friends clean them very quickly. > > When the system a variety of cgroups, ^^^ This sentence does not parse ^^^ > we can end up in situations where > one cgroup has almost no dirty pages at all. This is especially common > in our XFS workloads in production because they tend to use O_DIRECT for > everything. > > We've hit storms where the redirty path hits millions of times in a few > seconds, on all a single file that's only ~40 pages long. This ends up > leading to long tail latencies for file writes because the page reclaim > workers are hogging the CPU from some kworkers bound to the same CPU. > > That's the theory anyway. We know the storms exist, but the tail > latencies are so sporadic that it's hard to have any certainty about the > cause without patching a large number of boxes. > > There are a few different problems here. First is just that I don't > understand how invalidating the page instead of redirtying might upset > the rest of the iomap/xfs world. Btrfs invalidates in this case, which > seems like the right thing to me, but we all have slightly different > sharp corners in the truncate path so I thought I'd ask for comments. > > Second is the VM should take wbc->pages_skipped into account, or use > some other way to avoid looping over and over. I think we actually want > both but I wanted to understand the page invalidation path first. > > Signed-off-by: Chris Mason <clm@fb.com> > Reported-by: Domas Mituzas <domas@fb.com> > --- > fs/iomap/buffered-io.c | 13 +++++++------ > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 8ce8720093b9..4a687a2a9ed9 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > pgoff_t end_index = isize >> PAGE_SHIFT; > > /* > - * Skip the page if it's fully outside i_size, e.g. due to a > - * truncate operation that's in progress. We must redirty the > - * page so that reclaim stops reclaiming it. Otherwise > - * iomap_vm_releasepage() is called on it and gets confused. > + * invalidate the page if it's fully outside i_size, e.g. > + * due to a truncate operation that's in progress. > * > * Note that the end_index is unsigned long. If the given > * offset is greater than 16TB on a 32-bit system then if we > @@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > * offset is just equal to the EOF. > */ > if (folio->index > end_index || > - (folio->index == end_index && poff == 0)) > - goto redirty; > + (folio->index == end_index && poff == 0)) { > + folio_invalidate(folio, 0, folio_size(folio)); > + goto unlock; > + } > > /* > * The page straddles i_size. It must be zeroed out on each > @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) > > redirty: > folio_redirty_for_writepage(wbc, folio); > +unlock: > folio_unlock(folio); > return 0; > } > -- > 2.30.2 > ---end quoted text---
> On Jun 1, 2022, at 8:18 AM, Christoph Hellwig <hch@infradead.org> wrote: > > This does look sane to me. How much testing did this get? Almost none at all, I made sure the invalidates were triggering and bashed on it with fsx, but haven’t even done xfstests yet. The first rule about truncate is that v1 patches are always broken, so I’m expecting explosions. > Especially > for the block size < page sie case? Also adding Dave as he has spent > a lot of time on this code. > Sorry Dave, I thought I had you in here already. > On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote: >> iomap_do_writepage() sends pages past i_size through >> folio_redirty_for_writepage(), which normally isn't a problem because >> truncate and friends clean them very quickly. >> >> When the system a variety of cgroups, > > ^^^ This sentence does not parse ^^^ > Most of production is setup with one cgroup tree for the workloads we love and care about, and a few cgroup trees for everything else. We tend to crank down memory or IO limits on the unloved cgroups and prioritize the workload cgroups. This problem is hitting our mysql workloads, which are mostly O_DIRECT on a relatively small number of files. From a kernel point of view it’s a lot of IO and not much actual resource management. What’s happening in prod (on an older 5.6 kernel) is the non-mysql cgroups are blowing past the background dirty threshold, which kicks off the async writeback workers. The actual call path is: wb_workfn()->wb_do_writeback()->wb_check_background_flush()->wb_writeback()->__writeback_inodes_sb() Johannes explained to me that wb_over_bg_thresh(wb) ends up returning true on the mysql cgroups because the global background limit has been reached, even though mysql didn’t really contribute much of the dirty. So we call down into wb_writeback(), which will loop as long as __writeback_inodes_wb() returns that it’s making progress and we’re still globally over the bg threshold. In prod, bpftrace showed looping on a single inode inside a mysql cgroup. That inode was usually in the middle of being deleted, i_size set to zero, but it still had 40-90 pages sitting in the xarray waiting for truncation. We’d loop through the whole call path above over and over again, mostly because writepages() was returning progress had been made on this one inode. The redirty_page_for_writepage() path does drop wbc->nr_to_write, so the rest of the writepages machinery believes real work is being done. nr_to_write is LONG_MAX, so we’ve got a while to loop. I had dreams of posting a trivial reproduction with two cgroups, dd, and a single file being written and truncated in a loop, which works pretty well on 5.6 and refuses to be useful upstream. Johannes and I talked it over and we still think this patch makes sense, since the redirty path feels suboptimal. I’ll try to make a better reproduction as well. To give an idea of how rare this is, I’d run bpftrace for 300 seconds at a time on 10K machines and usually find a single machine in the loop. -chris
On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote: > > > On Jun 1, 2022, at 8:18 AM, Christoph Hellwig <hch@infradead.org> wrote: > > > > This does look sane to me. How much testing did this get? > > Almost none at all, I made sure the invalidates were triggering > and bashed on it with fsx, but haven’t even done xfstests yet. > The first rule about truncate is that v1 patches are always > broken, so I’m expecting explosions. If there are going to be explosions, it will be on block size < page size configs, I think. Otherwise the folio_invalidate() call is a largely a no-op. > > > Especially > > for the block size < page sie case? Also adding Dave as he has spent > > a lot of time on this code. > > > > Sorry Dave, I thought I had you in here already. You did - you got me with linux-xfs@... > > On Tue, May 31, 2022 at 06:11:17PM -0700, Chris Mason wrote: > >> iomap_do_writepage() sends pages past i_size through > >> folio_redirty_for_writepage(), which normally isn't a problem because > >> truncate and friends clean them very quickly. > >> > >> When the system a variety of cgroups, > > > > ^^^ This sentence does not parse ^^^ > > Most of production is setup with one cgroup tree for the workloads > we love and care about, and a few cgroup trees for everything > else. We tend to crank down memory or IO limits on the unloved > cgroups and prioritize the workload cgroups. > > This problem is hitting our mysql workloads, which are mostly > O_DIRECT on a relatively small number of files. From a kernel > point of view it’s a lot of IO and not much actual resource > management. What’s happening in prod (on an older 5.6 kernel) is > the non-mysql cgroups are blowing past the background dirty > threshold, which kicks off the async writeback workers. > > The actual call path is: > wb_workfn()->wb_do_writeback()->wb_check_background_flush()->wb_writeback()->__writeback_inodes_sb() > > Johannes explained to me that wb_over_bg_thresh(wb) ends up > returning true on the mysql cgroups because the global background > limit has been reached, even though mysql didn’t really contribute > much of the dirty. So we call down into wb_writeback(), which > will loop as long as __writeback_inodes_wb() returns that it’s > making progress and we’re still globally over the bg threshold. *nod* > In prod, bpftrace showed looping on a single inode inside a mysql > cgroup. That inode was usually in the middle of being deleted, > i_size set to zero, but it still had 40-90 pages sitting in the > xarray waiting for truncation. We’d loop through the whole call > path above over and over again, mostly because writepages() was > returning progress had been made on this one inode. The > redirty_page_for_writepage() path does drop wbc->nr_to_write, so > the rest of the writepages machinery believes real work is being > done. nr_to_write is LONG_MAX, so we’ve got a while to loop. Yup, this code relies on truncate making progress to avoid looping forever. Truncate should only block on the page while it locks it and waits for writeback to complete, then it gets forcibly invalidated and removed from the page cache. Ok, so assuming this is the case, why is truncate apparently not making progress? I can understand that it might lockstep with writeback trying to write and then redirtying folios, but that lockstep pass of ascending folio index should only happen once because once truncate locks the folio the page gets removed from the mapping regardless of whether it is clean or dirty. Oh. Aren't there starvation problems with folio_lock() because the wakeup is not atomic with granting the lock bit? Hence if something is locking and unlocking a page in quick succession, a waiter can be woken but by the time it has been scheduled and running the lock bit has been taken by someone else and so it goes back to sleep? And the patch fixes this problem by breaking the lock/unlock cycling in writeback() by not redirtying the folio and and so writback doesn't iterate straight back to it and lock it again? Regardless of the cause, why do we redirty the page there? /me trolls through git history commit ff9a28f6c25d18a635abcab1f49db68108203dfb Author: Jan Kara <jack@suse.cz> Date: Thu Mar 14 14:30:54 2013 +0100 xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage() When a dirty page is truncated from a file but reclaim gets to it before truncate_inode_pages(), we hit WARN_ON(delalloc) in xfs_vm_releasepage(). This is because reclaim tries to write the page, xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim thinks it can continue and calls xfs_vm_releasepage() on page with dirty buffers. Fix the issue by redirtying the page in xfs_vm_writepage(). This makes reclaim stop reclaiming the page and also logically it keeps page in a more consistent state where page with dirty buffers has PageDirty set. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 5f707e537171..3244c988d379 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -953,13 +953,13 @@ xfs_vm_writepage( unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1); /* - * Just skip the page if it is fully outside i_size, e.g. due - * to a truncate operation that is in progress. + * Skip the page if it is fully outside i_size, e.g. due to a + * truncate operation that is in progress. We must redirty the + * page so that reclaim stops reclaiming it. Otherwise + * xfs_vm_releasepage() is called on it and gets confused. */ - if (page->index >= end_index + 1 || offset_into_page == 0) { - unlock_page(page); - return 0; - } + if (page->index >= end_index + 1 || offset_into_page == 0) + goto redirty; /* * The page straddles i_size. It must be zeroed out on each Ah, I remember those problems. Random WARN_ON()s during workloads under heavy memory pressure and lots of page cache dirtying. Essentially, this fix was working around the old "writepage to clean dirty pages at the end of the LRU during memory reclaim" anti-pattern. Well, that problem doesn't exist anymore - XFS hasn't had a ->writepage method for some time now memory reclaim can't clean dirty pages on XFS filesystems directly and hence can't try to clean and release dirty pages while a truncate is in progress. Further, the filesystem extent state coherency problems with bufferhead caching that triggered in ->releasepage don't exist anymore, either. That was one of the exact problems we invented iomap to solve.... Hence I think we can remove the redirtying completely - it's not needed and hasn't been for some time. Further, I don't think we need to invalidate the folio, either. If it's beyond EOF, then it is because a truncate is in progress that means it is somebody else's problem to clean up. Hence we should leave it to the truncate to deal with, just like the pre-2013 code did.... Cheers, Dave.
On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: > On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote: > > In prod, bpftrace showed looping on a single inode inside a mysql > > cgroup. That inode was usually in the middle of being deleted, > > i_size set to zero, but it still had 40-90 pages sitting in the > > xarray waiting for truncation. We’d loop through the whole call > > path above over and over again, mostly because writepages() was > > returning progress had been made on this one inode. The > > redirty_page_for_writepage() path does drop wbc->nr_to_write, so > > the rest of the writepages machinery believes real work is being > > done. nr_to_write is LONG_MAX, so we’ve got a while to loop. > > Yup, this code relies on truncate making progress to avoid looping > forever. Truncate should only block on the page while it locks it > and waits for writeback to complete, then it gets forcibly > invalidated and removed from the page cache. It's not looping forever, truncate can just take a relatively long time during which the flusher is busy-spinning full bore on a relatively small number of unflushable pages (range_cyclic). But you raise a good point asking "why is truncate stuck?". I first thought they might be cannibalizing each other over the page locks, but that wasn't it (and wouldn't explain the clear asymmetry between truncate and flusher). That leaves the waiting for writeback. I just confirmed with tracing that that's exactly where truncate sits while the flusher goes bananas on the same inode. So the race must be this: truncate: flusher put a subset of pages under writeback i_size_write(0) wait_on_page_writeback() loop with range_cyclic over remaining dirty >EOF pages > Hence I think we can remove the redirtying completely - it's not > needed and hasn't been for some time. > > Further, I don't think we need to invalidate the folio, either. If > it's beyond EOF, then it is because a truncate is in progress that > means it is somebody else's problem to clean up. Hence we should > leave it to the truncate to deal with, just like the pre-2013 code > did.... Perfect, that works.
> On Jun 2, 2022, at 11:32 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: >> >> Further, I don't think we need to invalidate the folio, either. If >> it's beyond EOF, then it is because a truncate is in progress that >> means it is somebody else's problem to clean up. Hence we should >> leave it to the truncate to deal with, just like the pre-2013 code >> did.... > > Perfect, that works. Ok, I’ll run it through xfstests and resend a v2. Will this cover all of the subpage blocksize concerns? -chris
On Thu, Jun 02, 2022 at 07:41:55PM +0000, Chris Mason wrote: > > > On Jun 2, 2022, at 11:32 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: > >> > >> Further, I don't think we need to invalidate the folio, either. If > >> it's beyond EOF, then it is because a truncate is in progress that > >> means it is somebody else's problem to clean up. Hence we should > >> leave it to the truncate to deal with, just like the pre-2013 code > >> did.... > > > > Perfect, that works. > > Ok, I’ll run it through xfstests and resend a v2. Will this cover all of the subpage blocksize concerns? I run with: export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024" Dave may have other options he'd like to see.
On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner wrote: > On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: > > On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote: > > > In prod, bpftrace showed looping on a single inode inside a mysql > > > cgroup. That inode was usually in the middle of being deleted, > > > i_size set to zero, but it still had 40-90 pages sitting in the > > > xarray waiting for truncation. We’d loop through the whole call > > > path above over and over again, mostly because writepages() was > > > returning progress had been made on this one inode. The > > > redirty_page_for_writepage() path does drop wbc->nr_to_write, so > > > the rest of the writepages machinery believes real work is being > > > done. nr_to_write is LONG_MAX, so we’ve got a while to loop. > > > > Yup, this code relies on truncate making progress to avoid looping > > forever. Truncate should only block on the page while it locks it > > and waits for writeback to complete, then it gets forcibly > > invalidated and removed from the page cache. > > It's not looping forever, truncate can just take a relatively long > time during which the flusher is busy-spinning full bore on a > relatively small number of unflushable pages (range_cyclic). > > But you raise a good point asking "why is truncate stuck?". I first > thought they might be cannibalizing each other over the page locks, > but that wasn't it (and wouldn't explain the clear asymmetry between > truncate and flusher). That leaves the waiting for writeback. I just > confirmed with tracing that that's exactly where truncate sits while > the flusher goes bananas on the same inode. So the race must be this: > > truncate: flusher > put a subset of pages under writeback > i_size_write(0) > wait_on_page_writeback() > loop with range_cyclic over remaining dirty >EOF pages But write_cache_pages() doesn't repeatedly loop over the pages. The flusher is ->writepages iomap_writepages write_cache_pages() loop over mapping tree lock page iomap_do_writepage set_page_writeback() add page to ioend <end of mapping reached> iomap_submit_ioend() <pages under writeback get sent for IO> return to high level writeback And eventually IO completion will clear page writeback state. i.e. write_cache_pages() should not be hard looping over the pages beyond EOF even if range_cyclic is set - it's skipping those pages, submitting any that are under writeback, and the, going back to high level code for it to make a decision about continuation of writeback. It may call back down and we loop over dirty pages beyond EOF again, but the flusher should not be holding on to pages under writeback for any signification length of time before they are submitted for IO. IOWs, if truncate is getting stuck waiting on writeback, then that implies something is holding up IO completions for a long time, not that there's a problem in writeback submission. i.e. you might actually be looking at a workqueue backlog or scheduling starvation problem here preventing IO completion from clearing writeback state.... > > Hence I think we can remove the redirtying completely - it's not > > needed and hasn't been for some time. > > > > Further, I don't think we need to invalidate the folio, either. If > > it's beyond EOF, then it is because a truncate is in progress that > > means it is somebody else's problem to clean up. Hence we should > > leave it to the truncate to deal with, just like the pre-2013 code > > did.... > > Perfect, that works. If there's actually a IO completion latency problem, this will not fix it - it'll just hide the soft-lockup symptom that the pages stuck in writeback are manifested through this path. Cheers, Dave.
On Thu, Jun 02, 2022 at 08:59:01PM +0100, Matthew Wilcox wrote: > On Thu, Jun 02, 2022 at 07:41:55PM +0000, Chris Mason wrote: > > > > > On Jun 2, 2022, at 11:32 AM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: > > >> > > >> Further, I don't think we need to invalidate the folio, either. If > > >> it's beyond EOF, then it is because a truncate is in progress that > > >> means it is somebody else's problem to clean up. Hence we should > > >> leave it to the truncate to deal with, just like the pre-2013 code > > >> did.... > > > > > > Perfect, that works. > > > > Ok, I’ll run it through xfstests and resend a v2. Will this cover all of the subpage blocksize concerns? > > I run with: > export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024" > > Dave may have other options he'd like to see. No, that above is what I'd suggest. Cheers, Dave.
> On Jun 2, 2022, at 6:06 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner wrote: >> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: >>> On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote: >>>> In prod, bpftrace showed looping on a single inode inside a mysql >>>> cgroup. That inode was usually in the middle of being deleted, >>>> i_size set to zero, but it still had 40-90 pages sitting in the >>>> xarray waiting for truncation. We’d loop through the whole call >>>> path above over and over again, mostly because writepages() was >>>> returning progress had been made on this one inode. The >>>> redirty_page_for_writepage() path does drop wbc->nr_to_write, so >>>> the rest of the writepages machinery believes real work is being >>>> done. nr_to_write is LONG_MAX, so we’ve got a while to loop. >>> >>> Yup, this code relies on truncate making progress to avoid looping >>> forever. Truncate should only block on the page while it locks it >>> and waits for writeback to complete, then it gets forcibly >>> invalidated and removed from the page cache. >> >> It's not looping forever, truncate can just take a relatively long >> time during which the flusher is busy-spinning full bore on a >> relatively small number of unflushable pages (range_cyclic). >> >> But you raise a good point asking "why is truncate stuck?". I first >> thought they might be cannibalizing each other over the page locks, >> but that wasn't it (and wouldn't explain the clear asymmetry between >> truncate and flusher). That leaves the waiting for writeback. I just >> confirmed with tracing that that's exactly where truncate sits while >> the flusher goes bananas on the same inode. So the race must be this: >> >> truncate: flusher >> put a subset of pages under writeback >> i_size_write(0) >> wait_on_page_writeback() >> loop with range_cyclic over remaining dirty >EOF pages > > But write_cache_pages() doesn't repeatedly loop over the pages. > > The flusher is > > ->writepages > iomap_writepages > write_cache_pages() > loop over mapping tree > lock page > iomap_do_writepage > set_page_writeback() > add page to ioend > <end of mapping reached> > iomap_submit_ioend() > <pages under writeback get sent for IO> > return to high level writeback > > And eventually IO completion will clear page writeback state. > Yes, this is actually happening before the truncate starts. The truncate finds these writeback pages and waits for them to finish IO, and while it’s waiting wb_check_background_flush() goes wild on the redirty path. > i.e. write_cache_pages() should not be hard looping over the pages > beyond EOF even if range_cyclic is set - it's skipping those pages, > submitting any that are under writeback, and the, going back to high > level code for it to make a decision about continuation of > writeback. It may call back down and we loop over dirty pages beyond > EOF again, but the flusher should not be holding on to pages under > writeback for any signification length of time before they are > submitted for IO. > I spent a while trying to blame write_cache_pages() for looping repeatedly, and ended up making a series of bpftrace scripts that collected call stack frequency counters for all the different ways to wander into write_cache_pages(). I eventually ran it across 100K systems to figure out exactly how we were getting into trouble. It was (thankfully) really consistent. As you describe above, the loops are definitely coming from higher in the stack. wb_writeback() will loop as long as __writeback_inodes_wb() returns that it’s making progress and we’re still globally over the bg threshold, so write_cache_pages() is just being called over and over again. We’re coming from wb_check_background_flush(), so: struct wb_writeback_work work = { .nr_pages = LONG_MAX, .sync_mode = WB_SYNC_NONE, .for_background = 1, .range_cyclic = 1, .reason = WB_REASON_BACKGROUND, }; > IOWs, if truncate is getting stuck waiting on writeback, then that > implies something is holding up IO completions for a long time, From Johannes’s tracing today, that’s about 17us. Our victim cgroup has just a handful of files, so we can burn through a lot of write_cache_pages loops in the time truncate is waiting for a single IO on a relatively fast ssd. > not > that there's a problem in writeback submission. i.e. you might > actually be looking at a workqueue backlog I actually think a workqueue backlog is coming from the flusher thread hogging the CPU. The investigation started by looking for long tail latencies in write() systemcalls, and Domas’s original finding was IO completion workers were waiting for CPUs. That’s how he ended up finding the redirty calls using high levels of CPU. We honestly won’t be sure until we live patch a lot of boxes and look for long tail latency improvements, but I’m pretty optimistic. -chris
[ Chris, can you lne wrap your emails at 72 columns, please? ] On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote: > > On Jun 2, 2022, at 6:06 PM, Dave Chinner <david@fromorbit.com> > > wrote: On Thu, Jun 02, 2022 at 11:32:30AM -0400, Johannes Weiner > > wrote: > >> On Thu, Jun 02, 2022 at 04:52:52PM +1000, Dave Chinner wrote: > >>> On Wed, Jun 01, 2022 at 02:13:42PM +0000, Chris Mason wrote: > >>>> In prod, bpftrace showed looping on a single inode inside a > >>>> mysql cgroup. That inode was usually in the middle of being > >>>> deleted, i_size set to zero, but it still had 40-90 pages > >>>> sitting in the xarray waiting for truncation. We’d loop > >>>> through the whole call path above over and over again, mostly > >>>> because writepages() was returning progress had been made on > >>>> this one inode. The redirty_page_for_writepage() path does > >>>> drop wbc->nr_to_write, so the rest of the writepages > >>>> machinery believes real work is being done. nr_to_write is > >>>> LONG_MAX, so we’ve got a while to loop. > >>> > >>> Yup, this code relies on truncate making progress to avoid > >>> looping forever. Truncate should only block on the page while > >>> it locks it and waits for writeback to complete, then it gets > >>> forcibly invalidated and removed from the page cache. > >> > >> It's not looping forever, truncate can just take a relatively > >> long time during which the flusher is busy-spinning full bore > >> on a relatively small number of unflushable pages > >> (range_cyclic). > >> > >> But you raise a good point asking "why is truncate stuck?". I > >> first thought they might be cannibalizing each other over the > >> page locks, but that wasn't it (and wouldn't explain the clear > >> asymmetry between truncate and flusher). That leaves the > >> waiting for writeback. I just confirmed with tracing that > >> that's exactly where truncate sits while the flusher goes > >> bananas on the same inode. So the race must be this: > >> > >> truncate: flusher put a subset of pages under writeback > >> i_size_write(0) wait_on_page_writeback() loop with range_cyclic > >> over remaining dirty >EOF pages > > > > But write_cache_pages() doesn't repeatedly loop over the > > pages. > > > > The flusher is > > > > ->writepages iomap_writepages write_cache_pages() loop over > > mapping tree lock page iomap_do_writepage set_page_writeback() > > add page to ioend <end of mapping reached> iomap_submit_ioend() > > <pages under writeback get sent for IO> return to high level > > writeback > > > > And eventually IO completion will clear page writeback state. > > > > Yes, this is actually happening before the truncate starts. The > truncate finds these writeback pages and waits for them to finish > IO, and while it’s waiting wb_check_background_flush() goes wild > on the redirty path. I still don't think the redirty path is the underlying problem. Yes, it triggers it, but it looks to be triggering an existing behaviour that the writeback path should not have... > > i.e. write_cache_pages() should not be hard looping over the > > pages beyond EOF even if range_cyclic is set - it's skipping > > those pages, submitting any that are under writeback, and the, > > going back to high level code for it to make a decision about > > continuation of writeback. It may call back down and we loop > > over dirty pages beyond EOF again, but the flusher should not be > > holding on to pages under writeback for any signification length > > of time before they are submitted for IO. > > > > I spent a while trying to blame write_cache_pages() for looping > repeatedly, and ended up making a series of bpftrace scripts that > collected call stack frequency counters for all the different ways > to wander into write_cache_pages(). I eventually ran it across > 100K systems to figure out exactly how we were getting into > trouble. It was (thankfully) really consistent. > > As you describe above, the loops are definitely coming from higher > in the stack. wb_writeback() will loop as long as > __writeback_inodes_wb() returns that it’s making progress and > we’re still globally over the bg threshold, so write_cache_pages() > is just being called over and over again. We’re coming from > wb_check_background_flush(), so: > > struct wb_writeback_work work = { > .nr_pages = LONG_MAX, > .sync_mode = WB_SYNC_NONE, > .for_background = 1, > .range_cyclic = 1, > .reason = WB_REASON_BACKGROUND, > }; Sure, but we end up in writeback_sb_inodes() which does this after the __writeback_single_inode()->do_writepages() call that iterates the dirty pages: if (need_resched()) { /* * We're trying to balance between building up a nice * long list of IOs to improve our merge rate, and * getting those IOs out quickly for anyone throttling * in balance_dirty_pages(). cond_resched() doesn't * unplug, so get our IOs out the door before we * give up the CPU. */ blk_flush_plug(current->plug, false); cond_resched(); } So if there is a pending IO completion on this CPU on a work queue here, we'll reschedule to it because the work queue kworkers are bound to CPUs and they take priority over user threads. Also, this then requeues the inode of the b_more_io queue, and wb_check_background_flush() won't come back to it until all other inodes on all other superblocks on the bdi have had writeback attempted. So if the system truly is over the background dirty threshold, why is writeback getting stuck on this one inode in this way? > > IOWs, if truncate is getting stuck waiting on writeback, then that > > implies something is holding up IO completions for a long time, > > From Johannes’s tracing today, that’s about 17us. Which means there should only be a delay of 17us between IO completion being queued on this CPU and the flusher thread being preempted and completion being run, right? > Our victim cgroup has just a handful of files, so we can burn > through a lot of write_cache_pages loops in the time truncate is > waiting for a single IO on a relatively fast ssd. Yes, I'm not denying that, but background writeback when over global dirty thresholds should not be spinning on a single inode with 50-100 pages attached to it. It should visit it, try to flush it, and then move on to the next. > > not > > that there's a problem in writeback submission. i.e. you might > > actually be looking at a workqueue backlog > > I actually think a workqueue backlog is coming from the flusher > thread hogging the CPU. I can't see how that happens with that need_resched() check in writeback_sb_inodes(). > The investigation started by looking for long tail latencies in > write() systemcalls, and Domas’s original finding was IO > completion workers were waiting for CPUs. That’s how he ended up > finding the redirty calls using high levels of CPU. We honestly > won’t be sure until we live patch a lot of boxes and look for long > tail latency improvements, but I’m pretty optimistic. Something still doesn't add up. I don't see how submission spinning is holding off completion because it plays nice with resched checks. You're now saying that this started because someone found long tail completion latencies, and that explains truncate getting stuck. But AFAICT the writeback behaviour isn't responsible for completion latencies as writeback will give up the CPU to any other thread scheduled to run on that CPU pretty quickly. It feels like there still something unknown behaviour here, and it still smells to me like IO completions are getting backed up by something. I wonder - are there large files getting written using buffered IO on these machines? Commit ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback") might be relevant if that is the case.... Cheers, Dave.
Hello Dave, On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote: > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote: > > As you describe above, the loops are definitely coming from higher > > in the stack. wb_writeback() will loop as long as > > __writeback_inodes_wb() returns that it’s making progress and > > we’re still globally over the bg threshold, so write_cache_pages() > > is just being called over and over again. We’re coming from > > wb_check_background_flush(), so: > > > > struct wb_writeback_work work = { > > .nr_pages = LONG_MAX, > > .sync_mode = WB_SYNC_NONE, > > .for_background = 1, > > .range_cyclic = 1, > > .reason = WB_REASON_BACKGROUND, > > }; > > Sure, but we end up in writeback_sb_inodes() which does this after > the __writeback_single_inode()->do_writepages() call that iterates > the dirty pages: > > if (need_resched()) { > /* > * We're trying to balance between building up a nice > * long list of IOs to improve our merge rate, and > * getting those IOs out quickly for anyone throttling > * in balance_dirty_pages(). cond_resched() doesn't > * unplug, so get our IOs out the door before we > * give up the CPU. > */ > blk_flush_plug(current->plug, false); > cond_resched(); > } > > So if there is a pending IO completion on this CPU on a work queue > here, we'll reschedule to it because the work queue kworkers are > bound to CPUs and they take priority over user threads. The flusher thread is also a kworker, though. So it may hit this cond_resched(), but it doesn't yield until the timeslice expires. > Also, this then requeues the inode of the b_more_io queue, and > wb_check_background_flush() won't come back to it until all other > inodes on all other superblocks on the bdi have had writeback > attempted. So if the system truly is over the background dirty > threshold, why is writeback getting stuck on this one inode in this > way? The explanation for this part at least is that the bdi/flush domain is split per cgroup. The cgroup in question is over its proportional bg thresh. It has very few dirty pages, but it also has very few *dirtyable* pages, which makes for a high dirty ratio. And those handful of dirty pages are the unflushable ones past EOF. There is no next inode to move onto on subsequent loops.
[ From a different message, Dave asks wtf my email client was doing. Thanks Dave, apparently exchange is being exchangey with base64 in unpredictable ways. This was better in my test reply, lets see. ] On 6/3/22 11:06 AM, Johannes Weiner wrote: > Hello Dave, > > On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote: >> On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote: >>> As you describe above, the loops are definitely coming from higher >>> in the stack. wb_writeback() will loop as long as >>> __writeback_inodes_wb() returns that it’s making progress and >>> we’re still globally over the bg threshold, so write_cache_pages() >>> is just being called over and over again. We’re coming from >>> wb_check_background_flush(), so: >>> >>> struct wb_writeback_work work = { >>> .nr_pages = LONG_MAX, >>> .sync_mode = WB_SYNC_NONE, >>> .for_background = 1, >>> .range_cyclic = 1, >>> .reason = WB_REASON_BACKGROUND, >>> }; >> >> Sure, but we end up in writeback_sb_inodes() which does this after >> the __writeback_single_inode()->do_writepages() call that iterates >> the dirty pages: >> >> if (need_resched()) { >> /* >> * We're trying to balance between building up a nice >> * long list of IOs to improve our merge rate, and >> * getting those IOs out quickly for anyone throttling >> * in balance_dirty_pages(). cond_resched() doesn't >> * unplug, so get our IOs out the door before we >> * give up the CPU. >> */ >> blk_flush_plug(current->plug, false); >> cond_resched(); >> } >> >> So if there is a pending IO completion on this CPU on a work queue >> here, we'll reschedule to it because the work queue kworkers are >> bound to CPUs and they take priority over user threads. > > The flusher thread is also a kworker, though. So it may hit this > cond_resched(), but it doesn't yield until the timeslice expires. > Just to underline this, the long tail latencies aren't softlockups or major explosions. It's just suboptimal enough that different metrics and dashboards noticed it. -chris
On Fri, Jun 03, 2022 at 12:09:06PM -0400, Chris Mason wrote: > [ From a different message, Dave asks wtf my email client was doing. Thanks > Dave, apparently exchange is being exchangey with base64 in unpredictable > ways. This was better in my test reply, lets see. ] > > On 6/3/22 11:06 AM, Johannes Weiner wrote: > > Hello Dave, > > > > On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote: > > > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote: > > > > As you describe above, the loops are definitely coming from higher > > > > in the stack. wb_writeback() will loop as long as > > > > __writeback_inodes_wb() returns that it’s making progress and > > > > we’re still globally over the bg threshold, so write_cache_pages() > > > > is just being called over and over again. We’re coming from > > > > wb_check_background_flush(), so: > > > > > > > > struct wb_writeback_work work = { > > > > .nr_pages = LONG_MAX, > > > > .sync_mode = WB_SYNC_NONE, > > > > .for_background = 1, > > > > .range_cyclic = 1, > > > > .reason = WB_REASON_BACKGROUND, > > > > }; > > > > > > Sure, but we end up in writeback_sb_inodes() which does this after > > > the __writeback_single_inode()->do_writepages() call that iterates > > > the dirty pages: > > > > > > if (need_resched()) { > > > /* > > > * We're trying to balance between building up a nice > > > * long list of IOs to improve our merge rate, and > > > * getting those IOs out quickly for anyone throttling > > > * in balance_dirty_pages(). cond_resched() doesn't > > > * unplug, so get our IOs out the door before we > > > * give up the CPU. > > > */ > > > blk_flush_plug(current->plug, false); > > > cond_resched(); > > > } > > > > > > So if there is a pending IO completion on this CPU on a work queue > > > here, we'll reschedule to it because the work queue kworkers are > > > bound to CPUs and they take priority over user threads. > > > > The flusher thread is also a kworker, though. So it may hit this > > cond_resched(), but it doesn't yield until the timeslice expires. 17us or 10ms, it doesn't matter. The fact is the writeback thread will give up the CPU long before the latency durations (seconds) that were reported upthread are seen. Writeback spinning can not explain why truncate is not making progress - everything points to it being a downstream symptom, not a cause. Also important to note, as we are talking about kworker sheduling hold-offs, the writeback flusher work is unbound (can run on any CPU), whilst the IO completion workers in XFS are per-CPU and bound to individual CPUs. Bound kernel tasks usually take run queue priority on a CPU over unbound and/or user tasks that can be punted to a different CPU. So, again, with this taken into account I don't see how the flusher thread consuming CPU would cause long duration hold-offs of IO completion work.... > Just to underline this, the long tail latencies aren't softlockups or major > explosions. It's just suboptimal enough that different metrics and > dashboards noticed it. Sure, but you've brought a problem we don't understand the root cause of to my attention. I want to know what the root cause is so that I can determine that there are no other unknown underlying issues that are contributing to this issue. Nothing found so far explains why truncate is not making progress and that's what *I* need to understand here. Sure, go ahead and patch your systems to get rid of the bad writeback looping behaviour, but that does not explain or solve the underlying long IO completion tail latency issue that originally lead to finding writeback spinning. Cheers, Dave.
Hello, On Mon, Jun 06, 2022 at 09:32:13AM +1000, Dave Chinner wrote: > On Fri, Jun 03, 2022 at 12:09:06PM -0400, Chris Mason wrote: > > On 6/3/22 11:06 AM, Johannes Weiner wrote: > > > On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote: > > > > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote: > > > > > As you describe above, the loops are definitely coming from higher > > > > > in the stack. wb_writeback() will loop as long as > > > > > __writeback_inodes_wb() returns that it’s making progress and > > > > > we’re still globally over the bg threshold, so write_cache_pages() > > > > > is just being called over and over again. We’re coming from > > > > > wb_check_background_flush(), so: > > > > > > > > > > struct wb_writeback_work work = { > > > > > .nr_pages = LONG_MAX, > > > > > .sync_mode = WB_SYNC_NONE, > > > > > .for_background = 1, > > > > > .range_cyclic = 1, > > > > > .reason = WB_REASON_BACKGROUND, > > > > > }; > > > > > > > > Sure, but we end up in writeback_sb_inodes() which does this after > > > > the __writeback_single_inode()->do_writepages() call that iterates > > > > the dirty pages: > > > > > > > > if (need_resched()) { > > > > /* > > > > * We're trying to balance between building up a nice > > > > * long list of IOs to improve our merge rate, and > > > > * getting those IOs out quickly for anyone throttling > > > > * in balance_dirty_pages(). cond_resched() doesn't > > > > * unplug, so get our IOs out the door before we > > > > * give up the CPU. > > > > */ > > > > blk_flush_plug(current->plug, false); > > > > cond_resched(); > > > > } > > > > > > > > So if there is a pending IO completion on this CPU on a work queue > > > > here, we'll reschedule to it because the work queue kworkers are > > > > bound to CPUs and they take priority over user threads. > > > > > > The flusher thread is also a kworker, though. So it may hit this > > > cond_resched(), but it doesn't yield until the timeslice expires. > > 17us or 10ms, it doesn't matter. The fact is the writeback thread > will give up the CPU long before the latency durations (seconds) > that were reported upthread are seen. Writeback spinning can > not explain why truncate is not making progress - everything points > to it being a downstream symptom, not a cause. Chris can clarify, but I don't remember second-long latencies being mentioned. Rather sampling periods of multiple seconds during which the spin bursts occur multiple times. > Also important to note, as we are talking about kworker sheduling > hold-offs, the writeback flusher work is unbound (can run on any > CPU), whilst the IO completion workers in XFS are per-CPU and bound > to individual CPUs. Bound kernel tasks usually take run queue > priority on a CPU over unbound and/or user tasks that can be punted > to a different CPU. Is that actually true? I'm having trouble finding the corresponding code in the scheduler. That said, I'm not sure it matters that much. Even if you take CPU contention out of the equation entirely, I think we agree it's not a good idea (from a climate POV) to have CPUs busywait on IO. Even if that IO is just an ordinary wait_on_page_writeback() on a fast drive. So if we can get rid of the redirtying, and it sounds like we can, IMO we should just go ahead and do so. > > Just to underline this, the long tail latencies aren't softlockups or major > > explosions. It's just suboptimal enough that different metrics and > > dashboards noticed it. > > Sure, but you've brought a problem we don't understand the root > cause of to my attention. I want to know what the root cause is so > that I can determine that there are no other unknown underlying > issues that are contributing to this issue. It seems to me we're just not on the same page on what the reported bug is. From my POV, there currently isn't a missing piece in this puzzle. But Chris worked closer with the prod folks on this, so I'll leave it to him :)
On 6/6/22 10:46 AM, Johannes Weiner wrote: > Hello, > > On Mon, Jun 06, 2022 at 09:32:13AM +1000, Dave Chinner wrote: >> Sure, but you've brought a problem we don't understand the root >> cause of to my attention. I want to know what the root cause is so >> that I can determine that there are no other unknown underlying >> issues that are contributing to this issue. > > It seems to me we're just not on the same page on what the reported > bug is. From my POV, there currently isn't a missing piece in this > puzzle. But Chris worked closer with the prod folks on this, so I'll > leave it to him :) The basic description of the investigation: * Multiple hits per hour on per 100K machines, but almost impossible to catch across a single box. * The debugging information from the long tail detector showed high IO and high CPU time. (high CPU is relative here, these machines tend to be IO bound). * Kernel stack analysis showed IO completion threads waiting for CPU. * CPU profiling showed redirty_page_for_writepage() dominating. From here we made a relatively simple reproduction of the redirty_page_for_writepage() part of the problem. It's a good fix in isolation, but we'll have to circle back to see how much of the long tail latency issue it solves. We can livepatch it quickly, but filtering out the long tail latency hits for just this one bug is labor intensive, so it'll take a little bit of time to get good data. I've got a v2 of the patch that drops the invalidate, doing a load test with fsx this morning and then getting a second xfstests baseline run to see if I've added new failures. -chris
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 8ce8720093b9..4a687a2a9ed9 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) pgoff_t end_index = isize >> PAGE_SHIFT; /* - * Skip the page if it's fully outside i_size, e.g. due to a - * truncate operation that's in progress. We must redirty the - * page so that reclaim stops reclaiming it. Otherwise - * iomap_vm_releasepage() is called on it and gets confused. + * invalidate the page if it's fully outside i_size, e.g. + * due to a truncate operation that's in progress. * * Note that the end_index is unsigned long. If the given * offset is greater than 16TB on a 32-bit system then if we @@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) * offset is just equal to the EOF. */ if (folio->index > end_index || - (folio->index == end_index && poff == 0)) - goto redirty; + (folio->index == end_index && poff == 0)) { + folio_invalidate(folio, 0, folio_size(folio)); + goto unlock; + } /* * The page straddles i_size. It must be zeroed out on each @@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data) redirty: folio_redirty_for_writepage(wbc, folio); +unlock: folio_unlock(folio); return 0; }
iomap_do_writepage() sends pages past i_size through folio_redirty_for_writepage(), which normally isn't a problem because truncate and friends clean them very quickly. When the system a variety of cgroups, we can end up in situations where one cgroup has almost no dirty pages at all. This is especially common in our XFS workloads in production because they tend to use O_DIRECT for everything. We've hit storms where the redirty path hits millions of times in a few seconds, on all a single file that's only ~40 pages long. This ends up leading to long tail latencies for file writes because the page reclaim workers are hogging the CPU from some kworkers bound to the same CPU. That's the theory anyway. We know the storms exist, but the tail latencies are so sporadic that it's hard to have any certainty about the cause without patching a large number of boxes. There are a few different problems here. First is just that I don't understand how invalidating the page instead of redirtying might upset the rest of the iomap/xfs world. Btrfs invalidates in this case, which seems like the right thing to me, but we all have slightly different sharp corners in the truncate path so I thought I'd ask for comments. Second is the VM should take wbc->pages_skipped into account, or use some other way to avoid looping over and over. I think we actually want both but I wanted to understand the page invalidation path first. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Domas Mituzas <domas@fb.com> --- fs/iomap/buffered-io.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)