diff mbox series

[v4,1/1] mm: vmscan: Reduce throttling due to a failure to make progress

Message ID 20211202150614.22440-1-mgorman@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series [v4,1/1] mm: vmscan: Reduce throttling due to a failure to make progress | expand

Commit Message

Mel Gorman Dec. 2, 2021, 3:06 p.m. UTC
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
problems due to reclaim throttling for excessive lengths of time.
In Alexey's case, a memory hog that should go OOM quickly stalls for
several minutes before stalling. In Mike and Darrick's cases, a small
memcg environment stalled excessively even though the system had enough
memory overall.

Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being
made") introduced the problem although commit a19594ca4a8b ("mm/vmscan:
increase the timeout if page reclaim is not making progress") made it
worse. Systems at or near an OOM state that cannot be recovered must
reach OOM quickly and memcg should kill tasks if a memcg is near OOM.

To address this, only stall for the first zone in the zonelist, reduce
the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
the scan control nr_reclaimed is 0, kswapd is still active and there were
excessive pages pending for writeback. If kswapd has stopped reclaiming due
to excessive failures, do not stall at all so that OOM triggers relatively
quickly. Similarly, if an LRU is simply congested, only lightly throttle
similar to NOPROGRESS.

Alexey's original case was the most straight forward

	for i in {1..3}; do tail /dev/zero; done

On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
completes in a few seconds similar to 5.15.

Alexey's second test case added watching a youtube video while tail runs
10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a lot
with lots of frames missing and numerous audio glitches. With this patch
applies, the video plays similarly to 5.15.

Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net

[lkp@intel.com: Fix W=1 build warning]
Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv>
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h        |  1 +
 include/trace/events/vmscan.h |  4 ++-
 mm/vmscan.c                   | 64 ++++++++++++++++++++++++++++++-----
 3 files changed, 59 insertions(+), 10 deletions(-)

Comments

Shakeel Butt Dec. 2, 2021, 4:30 p.m. UTC | #1
Hi Mel,

On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> problems due to reclaim throttling for excessive lengths of time.
> In Alexey's case, a memory hog that should go OOM quickly stalls for
> several minutes before stalling. In Mike and Darrick's cases, a small
> memcg environment stalled excessively even though the system had enough
> memory overall.
>
> Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being
> made") introduced the problem although commit a19594ca4a8b ("mm/vmscan:
> increase the timeout if page reclaim is not making progress") made it
> worse. Systems at or near an OOM state that cannot be recovered must
> reach OOM quickly and memcg should kill tasks if a memcg is near OOM.
>

Is there a reason we can't simply revert 69392a403f49 instead of adding
more code/heuristics? Looking more into 69392a403f49, I don't think the
code and commit message are in sync.

For the memcg reclaim, instead of just removing congestion_wait or
replacing it with schedule_timeout in mem_cgroup_force_empty(), why
change the behavior of all memcg reclaim. Also this patch effectively
reverts that behavior of 69392a403f49.

For direct reclaimers under global pressure, why is page allocator a bad
place for stalling on no progress reclaim? IMHO the callers of the
reclaim should decide what to do if reclaim is not making progress.

thanks,
Shakeel
Mel Gorman Dec. 2, 2021, 4:52 p.m. UTC | #2
On Thu, Dec 02, 2021 at 08:30:51AM -0800, Shakeel Butt wrote:
> Hi Mel,
> 
> On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > problems due to reclaim throttling for excessive lengths of time.
> > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > several minutes before stalling. In Mike and Darrick's cases, a small
> > memcg environment stalled excessively even though the system had enough
> > memory overall.
> >
> > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being
> > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan:
> > increase the timeout if page reclaim is not making progress") made it
> > worse. Systems at or near an OOM state that cannot be recovered must
> > reach OOM quickly and memcg should kill tasks if a memcg is near OOM.
> >
> 
> Is there a reason we can't simply revert 69392a403f49 instead of adding
> more code/heuristics? Looking more into 69392a403f49, I don't think the
> code and commit message are in sync.
> 
> For the memcg reclaim, instead of just removing congestion_wait or
> replacing it with schedule_timeout in mem_cgroup_force_empty(), why
> change the behavior of all memcg reclaim. Also this patch effectively
> reverts that behavior of 69392a403f49.
> 

It doesn't fully revert it but I did consider reverting it. The reason
why I preserved it because the intent originally was to throttle somewhat
when progress is not being made to avoid a premature OOM and I wanted to
preserve that charactersistic. Right now, this is the least harmful way
of doing it.

As more memcg, I removed the NOTHROTTLE because the primary reason why a
memcg might fail to make progress is excessive writeback and that should
still throttle. Completely failing to make progress in a memcg is most
likely due to a memcg-OOM.

> For direct reclaimers under global pressure, why is page allocator a bad
> place for stalling on no progress reclaim? IMHO the callers of the
> reclaim should decide what to do if reclaim is not making progress.

Because it's a layering violation and the caller has little direct control
over the reclaim retry logic. The page allocator has no visibility on
why reclaim failed only that it did fail.
Shakeel Butt Dec. 2, 2021, 5:41 p.m. UTC | #3
On Thu, Dec 2, 2021 at 8:52 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, Dec 02, 2021 at 08:30:51AM -0800, Shakeel Butt wrote:
> > Hi Mel,
> >
> > On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > > problems due to reclaim throttling for excessive lengths of time.
> > > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > > several minutes before stalling. In Mike and Darrick's cases, a small
> > > memcg environment stalled excessively even though the system had enough
> > > memory overall.
> > >
> > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being
> > > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan:
> > > increase the timeout if page reclaim is not making progress") made it
> > > worse. Systems at or near an OOM state that cannot be recovered must
> > > reach OOM quickly and memcg should kill tasks if a memcg is near OOM.
> > >
> >
> > Is there a reason we can't simply revert 69392a403f49 instead of adding
> > more code/heuristics? Looking more into 69392a403f49, I don't think the
> > code and commit message are in sync.
> >
> > For the memcg reclaim, instead of just removing congestion_wait or
> > replacing it with schedule_timeout in mem_cgroup_force_empty(), why
> > change the behavior of all memcg reclaim. Also this patch effectively
> > reverts that behavior of 69392a403f49.
> >
>
> It doesn't fully revert it but I did consider reverting it. The reason
> why I preserved it because the intent originally was to throttle somewhat
> when progress is not being made to avoid a premature OOM and I wanted to
> preserve that charactersistic. Right now, this is the least harmful way
> of doing it.

If I understand correctly, the original intent of 69392a403f49 which
you want to preserve is "avoid premature OOMs when reclaim is not
making progress". Were there any complaints or bug reports on these
premature OOMs?

>
> As more memcg, I removed the NOTHROTTLE because the primary reason why a
> memcg might fail to make progress is excessive writeback and that should
> still throttle. Completely failing to make progress in a memcg is most
> likely due to a memcg-OOM.
>
> > For direct reclaimers under global pressure, why is page allocator a bad
> > place for stalling on no progress reclaim? IMHO the callers of the
> > reclaim should decide what to do if reclaim is not making progress.
>
> Because it's a layering violation and the caller has little direct control
> over the reclaim retry logic. The page allocator has no visibility on
> why reclaim failed only that it did fail.
>

Isn't it better that the reclaim returns why it is failing instead of
littering the reclaim code with 'is this global reclaim', 'is this
memcg reclaim', 'am I kswapd' which is also a layering violation. IMO
this is the direction we should be going towards though not asking to
do this now.

Regarding this patch and 69392a403f49, I am still confused on the main
motivation behind 69392a403f49 to change the behavior of 'direct
reclaimers from page allocator'.

thanks,
Shakeel
Mel Gorman Dec. 3, 2021, 9:01 a.m. UTC | #4
On Thu, Dec 02, 2021 at 09:41:04AM -0800, Shakeel Butt wrote:
> On Thu, Dec 2, 2021 at 8:52 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, Dec 02, 2021 at 08:30:51AM -0800, Shakeel Butt wrote:
> > > Hi Mel,
> > >
> > > On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > > >
> > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > > > problems due to reclaim throttling for excessive lengths of time.
> > > > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > > > several minutes before stalling. In Mike and Darrick's cases, a small
> > > > memcg environment stalled excessively even though the system had enough
> > > > memory overall.
> > > >
> > > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being
> > > > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan:
> > > > increase the timeout if page reclaim is not making progress") made it
> > > > worse. Systems at or near an OOM state that cannot be recovered must
> > > > reach OOM quickly and memcg should kill tasks if a memcg is near OOM.
> > > >
> > >
> > > Is there a reason we can't simply revert 69392a403f49 instead of adding
> > > more code/heuristics? Looking more into 69392a403f49, I don't think the
> > > code and commit message are in sync.
> > >
> > > For the memcg reclaim, instead of just removing congestion_wait or
> > > replacing it with schedule_timeout in mem_cgroup_force_empty(), why
> > > change the behavior of all memcg reclaim. Also this patch effectively
> > > reverts that behavior of 69392a403f49.
> > >
> >
> > It doesn't fully revert it but I did consider reverting it. The reason
> > why I preserved it because the intent originally was to throttle somewhat
> > when progress is not being made to avoid a premature OOM and I wanted to
> > preserve that charactersistic. Right now, this is the least harmful way
> > of doing it.
> 
> If I understand correctly, the original intent of 69392a403f49 which
> you want to preserve is "avoid premature OOMs when reclaim is not
> making progress". Were there any complaints or bug reports on these
> premature OOMs?
> 

Not recently that I'm aware of but historically reclaim has been plagued by
at least two classes of problems -- premature OOM and excessive CPU usage
churning through the LRU. Going back, the solution was basically to sleep
something like "disable kswapd if it fails to make progress for too long".
Commit 69392a403f49 addressed a case where calling congestion_wait might as
well have been schedule_timeout_uninterruptible(HZ/10) because congestion
is no longer tracked by the block layer.

Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if
another task makes progress, the throttled tasks can be woken before the
timeout. The flaw was throttling too easily or for too long delaying OOM
being properly detected.

> >
> > As more memcg, I removed the NOTHROTTLE because the primary reason why a
> > memcg might fail to make progress is excessive writeback and that should
> > still throttle. Completely failing to make progress in a memcg is most
> > likely due to a memcg-OOM.
> >
> > > For direct reclaimers under global pressure, why is page allocator a bad
> > > place for stalling on no progress reclaim? IMHO the callers of the
> > > reclaim should decide what to do if reclaim is not making progress.
> >
> > Because it's a layering violation and the caller has little direct control
> > over the reclaim retry logic. The page allocator has no visibility on
> > why reclaim failed only that it did fail.
> >
> 
> Isn't it better that the reclaim returns why it is failing instead of
> littering the reclaim code with 'is this global reclaim', 'is this
> memcg reclaim', 'am I kswapd' which is also a layering violation. IMO
> this is the direction we should be going towards though not asking to
> do this now.
> 

It's not clear why you think the page allocator can make better decisions
about reclaim than reclaim can. It might make sense if callers were
returned enough information to make a decision but even if they could,
it would not be popular as the API would be difficult to use properly.

Is your primary objection the cgroup_reclaim(sc) check? If so, I can
remove it. While there is a mild risk that OOM would be delayed, it's very
unlikely because a memcg failing to make progress in the local case will
probably call cond_resched() if there are not lots of of pages pending
writes globally.

> Regarding this patch and 69392a403f49, I am still confused on the main
> motivation behind 69392a403f49 to change the behavior of 'direct
> reclaimers from page allocator'.
> 

The main motivation of the series overall was to remove the reliance on
congestion_wait and wait_iff_congested because both are fundamentally
broken when congestion is not tracked by the block layer. Replacing with
schedule_timeout_uninterruptible() would be silly because where possible
decisions on whether to pause or throttle should be based on events,
not time. For example, if there are too many pages waiting on writeback
then throttle but if writeback completes, wake the throttled tasks
instead of "sleep some time and hope for the best".
Shakeel Butt Dec. 3, 2021, 5:50 p.m. UTC | #5
On Fri, Dec 3, 2021 at 1:01 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
[...]
>
> Not recently that I'm aware of but historically reclaim has been plagued by
> at least two classes of problems -- premature OOM and excessive CPU usage
> churning through the LRU. Going back, the solution was basically to sleep
> something like "disable kswapd if it fails to make progress for too long".
> Commit 69392a403f49 addressed a case where calling congestion_wait might as
> well have been schedule_timeout_uninterruptible(HZ/10) because congestion
> is no longer tracked by the block layer.
>
> Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if
> another task makes progress, the throttled tasks can be woken before the
> timeout. The flaw was throttling too easily or for too long delaying OOM
> being properly detected.
>

To remove congestion_wait of mem_cgroup_force_empty_write(), the
commit 69392a403f49 has changed the behavior of all memcg reclaim
codepaths as well as direct global reclaimers. Were there other
congestion_wait() instances which commit 69392a403f49 was targeting
but those congestion_wait() were replaced/removed by different
commits?

[...]

> >
> > Isn't it better that the reclaim returns why it is failing instead of
> > littering the reclaim code with 'is this global reclaim', 'is this
> > memcg reclaim', 'am I kswapd' which is also a layering violation. IMO
> > this is the direction we should be going towards though not asking to
> > do this now.
> >
>
> It's not clear why you think the page allocator can make better decisions
> about reclaim than reclaim can. It might make sense if callers were
> returned enough information to make a decision but even if they could,
> it would not be popular as the API would be difficult to use properly.
>

The above is a separate discussion for later.

> Is your primary objection the cgroup_reclaim(sc) check?

No, I am of the opinion that we should revert 69392a403f49 and we
should have just replaced congestion_wait in
mem_cgroup_force_empty_write with a simple
schedule_timeout_interruptible. The memory.force_empty is a cgroup v1
interface (to be deprecated) and it is very normal to expect that the
user will trigger that interface multiple times. We should not change
the behavior of all the memcg reclaimers and direct global reclaimers
so that we can remove congestion_wait from
mem_cgroup_force_empty_write.

> If so, I can
> remove it. While there is a mild risk that OOM would be delayed, it's very
> unlikely because a memcg failing to make progress in the local case will
> probably call cond_resched() if there are not lots of of pages pending
> writes globally.
>
> > Regarding this patch and 69392a403f49, I am still confused on the main
> > motivation behind 69392a403f49 to change the behavior of 'direct
> > reclaimers from page allocator'.
> >
>
> The main motivation of the series overall was to remove the reliance on
> congestion_wait and wait_iff_congested because both are fundamentally
> broken when congestion is not tracked by the block layer. Replacing with
> schedule_timeout_uninterruptible() would be silly because where possible
> decisions on whether to pause or throttle should be based on events,
> not time. For example, if there are too many pages waiting on writeback
> then throttle but if writeback completes, wake the throttled tasks
> instead of "sleep some time and hope for the best".
>

I am in agreement with the motivation of the whole series. I am just
making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
throttle is more than just the congestion_wait of
mem_cgroup_force_empty_write.

thanks,
Shakeel
Mel Gorman Dec. 3, 2021, 7:08 p.m. UTC | #6
On Fri, Dec 03, 2021 at 09:50:51AM -0800, Shakeel Butt wrote:
> On Fri, Dec 3, 2021 at 1:01 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> [...]
> >
> > Not recently that I'm aware of but historically reclaim has been plagued by
> > at least two classes of problems -- premature OOM and excessive CPU usage
> > churning through the LRU. Going back, the solution was basically to sleep
> > something like "disable kswapd if it fails to make progress for too long".
> > Commit 69392a403f49 addressed a case where calling congestion_wait might as
> > well have been schedule_timeout_uninterruptible(HZ/10) because congestion
> > is no longer tracked by the block layer.
> >
> > Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if
> > another task makes progress, the throttled tasks can be woken before the
> > timeout. The flaw was throttling too easily or for too long delaying OOM
> > being properly detected.
> >
> 
> To remove congestion_wait of mem_cgroup_force_empty_write(), the
> commit 69392a403f49 has changed the behavior of all memcg reclaim
> codepaths as well as direct global reclaimers.

Well, yes, it moved it to stalling on writeback if it's in progress
and waking up if writeback makes forward progress instead of a
schedule_timeout_interruptible().

> Were there other
> congestion_wait() instances which commit 69392a403f49 was targeting
> but those congestion_wait() were replaced/removed by different
> commits?
> 

Yes, the series removed congestion_wait from other places because the
interface is broken and has been for a long time.

> [...]
> 
> > >
> > > Isn't it better that the reclaim returns why it is failing instead of
> > > littering the reclaim code with 'is this global reclaim', 'is this
> > > memcg reclaim', 'am I kswapd' which is also a layering violation. IMO
> > > this is the direction we should be going towards though not asking to
> > > do this now.
> > >
> >
> > It's not clear why you think the page allocator can make better decisions
> > about reclaim than reclaim can. It might make sense if callers were
> > returned enough information to make a decision but even if they could,
> > it would not be popular as the API would be difficult to use properly.
> >
> 
> The above is a separate discussion for later.
> 
> > Is your primary objection the cgroup_reclaim(sc) check?
> 
> No, I am of the opinion that we should revert 69392a403f49 and we
> should have just replaced congestion_wait in
> mem_cgroup_force_empty_write with a simple
> schedule_timeout_interruptible.

That is a bit weak. Depending on the type of storage, writeback may
completes in microseconds or seconds. The event used to be "sleep until
congestion clears" which is no longer an event that can be waited upon
in the vast majority of cases (NFS being an obvious exception). Now,
it may throttle writeback on a waitqueue and if enough writeback
completes, the task will be woken before the timeout to minimise the
stall. schedule_timeout_interruptible() always waits for a fixed duration
regardless of what else happens in the meantime.

> The memory.force_empty is a cgroup v1
> interface (to be deprecated) and it is very normal to expect that the
> user will trigger that interface multiple times. We should not change
> the behavior of all the memcg reclaimers and direct global reclaimers
> so that we can remove congestion_wait from
> mem_cgroup_force_empty_write.
> 

The mem_cgroup_force_empty_write() path will throttle on writeback in
the same way that global reclaim does at this point

                /*
                 * If kswapd scans pages marked for immediate
                 * reclaim and under writeback (nr_immediate), it
                 * implies that pages are cycling through the LRU
                 * faster than they are written so forcibly stall
                 * until some pages complete writeback.
                 */
                if (sc->nr.immediate)
                        reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);

With this patch, memcg does not stall on NOPROGRESS.

> > If so, I can
> > remove it. While there is a mild risk that OOM would be delayed, it's very
> > unlikely because a memcg failing to make progress in the local case will
> > probably call cond_resched() if there are not lots of of pages pending
> > writes globally.
> >
> > > Regarding this patch and 69392a403f49, I am still confused on the main
> > > motivation behind 69392a403f49 to change the behavior of 'direct
> > > reclaimers from page allocator'.
> > >
> >
> > The main motivation of the series overall was to remove the reliance on
> > congestion_wait and wait_iff_congested because both are fundamentally
> > broken when congestion is not tracked by the block layer. Replacing with
> > schedule_timeout_uninterruptible() would be silly because where possible
> > decisions on whether to pause or throttle should be based on events,
> > not time. For example, if there are too many pages waiting on writeback
> > then throttle but if writeback completes, wake the throttled tasks
> > instead of "sleep some time and hope for the best".
> >
> 
> I am in agreement with the motivation of the whole series. I am just
> making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
> throttle is more than just the congestion_wait of
> mem_cgroup_force_empty_write.
> 

The commit that primarily targets congestion_wait is 8cd7c588decf
("mm/vmscan: throttle reclaim until some writeback completes if
congested"). The series recognises that there are other reasons why
reclaim can fail to make progress that is not directly writeback related.
Shakeel Butt Dec. 6, 2021, 6:06 a.m. UTC | #7
On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
[...]
> > I am in agreement with the motivation of the whole series. I am just
> > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
> > throttle is more than just the congestion_wait of
> > mem_cgroup_force_empty_write.
> >
>
> The commit that primarily targets congestion_wait is 8cd7c588decf
> ("mm/vmscan: throttle reclaim until some writeback completes if
> congested"). The series recognises that there are other reasons why
> reclaim can fail to make progress that is not directly writeback related.
>

I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED]
reasons. Please explain why we should throttle for
VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim
primarily is throttled in the page allocator if it is failing to make
progress.", can you please explain how?
Mel Gorman Dec. 6, 2021, 11:25 a.m. UTC | #8
On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote:
> On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> [...]
> > > I am in agreement with the motivation of the whole series. I am just
> > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
> > > throttle is more than just the congestion_wait of
> > > mem_cgroup_force_empty_write.
> > >
> >
> > The commit that primarily targets congestion_wait is 8cd7c588decf
> > ("mm/vmscan: throttle reclaim until some writeback completes if
> > congested"). The series recognises that there are other reasons why
> > reclaim can fail to make progress that is not directly writeback related.
> >
> 
> I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED]
> reasons. Please explain why we should throttle for
> VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim
> primarily is throttled in the page allocator if it is failing to make
> progress.", can you please explain how?

It could happen if the pages on the LRU are being reactivated continually
or holding an elevated reference count for some reason (e.g. gup,
page migration etc). The event is probably transient, hence the short
throttling.
Shakeel Butt Dec. 7, 2021, 7:14 a.m. UTC | #9
On Mon, Dec 6, 2021 at 3:25 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote:
> > On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > [...]
> > > > I am in agreement with the motivation of the whole series. I am just
> > > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
> > > > throttle is more than just the congestion_wait of
> > > > mem_cgroup_force_empty_write.
> > > >
> > >
> > > The commit that primarily targets congestion_wait is 8cd7c588decf
> > > ("mm/vmscan: throttle reclaim until some writeback completes if
> > > congested"). The series recognises that there are other reasons why
> > > reclaim can fail to make progress that is not directly writeback related.
> > >
> >
> > I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED]
> > reasons. Please explain why we should throttle for
> > VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim
> > primarily is throttled in the page allocator if it is failing to make
> > progress.", can you please explain how?
>
> It could happen if the pages on the LRU are being reactivated continually
> or holding an elevated reference count for some reason (e.g. gup,
> page migration etc). The event is probably transient, hence the short
> throttling.
>

What's the worst that can happen if the kernel doesn't throttle at all
for these transient scenarios? Premature oom-kills? The kernel already
has some protection against such situations with retries i.e.
consecutive 16 unsuccessful reclaim tries have to fail to give up the
reclaim.

Anyways, I have shared my view which is 'no need to throttle at all
for no-progress reclaims for now and course correct if there are
complaints in future' but will not block the patch.

thanks,
Shakeel
Mel Gorman Dec. 7, 2021, 9:28 a.m. UTC | #10
On Mon, Dec 06, 2021 at 11:14:58PM -0800, Shakeel Butt wrote:
> On Mon, Dec 6, 2021 at 3:25 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote:
> > > On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > > >
> > > [...]
> > > > > I am in agreement with the motivation of the whole series. I am just
> > > > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based
> > > > > throttle is more than just the congestion_wait of
> > > > > mem_cgroup_force_empty_write.
> > > > >
> > > >
> > > > The commit that primarily targets congestion_wait is 8cd7c588decf
> > > > ("mm/vmscan: throttle reclaim until some writeback completes if
> > > > congested"). The series recognises that there are other reasons why
> > > > reclaim can fail to make progress that is not directly writeback related.
> > > >
> > >
> > > I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED]
> > > reasons. Please explain why we should throttle for
> > > VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim
> > > primarily is throttled in the page allocator if it is failing to make
> > > progress.", can you please explain how?
> >
> > It could happen if the pages on the LRU are being reactivated continually
> > or holding an elevated reference count for some reason (e.g. gup,
> > page migration etc). The event is probably transient, hence the short
> > throttling.
> >
> 
> What's the worst that can happen if the kernel doesn't throttle at all
> for these transient scenarios? Premature oom-kills?

Excessive CPU usage in reclaim, potential premature OOM kills.

> The kernel already
> has some protection against such situations with retries i.e.
> consecutive 16 unsuccessful reclaim tries have to fail to give up the
> reclaim.
> 

The retries mitigate the premature OOM kills but not the excessive
CPU usage.

> Anyways, I have shared my view which is 'no need to throttle at all
> for no-progress reclaims for now and course correct if there are
> complaints in future' but will not block the patch.
> 

We've gone through periods of bugs that had either direct reclaim or
kswapd pegged at 100% CPU usage. While kswapd now just stops, the patch
still minimises the risk of excessive CPU usage bugs due to direct reclaim.
Hugh Dickins Dec. 9, 2021, 6:20 a.m. UTC | #11
On Thu, 2 Dec 2021, Mel Gorman wrote:
...
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
...
> @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  			/* need some check for avoid more shrink_zone() */
>  		}
>  
> +		if (!first_pgdat)
> +			first_pgdat = zone->zone_pgdat;
> +
>  		/* See comment about same check for global reclaim above */
>  		if (zone->zone_pgdat == last_pgdat)
>  			continue;
>  		last_pgdat = zone->zone_pgdat;
>  		shrink_node(zone->zone_pgdat, sc);
> -		consider_reclaim_throttle(zone->zone_pgdat, sc);
>  	}
>  
> +	consider_reclaim_throttle(first_pgdat, sc);

My tmpfs swapping load (tweaked to use huge pages more heavily than
in real life) is far from being a realistic load: but it was notably
slowed down by your throttling mods in 5.16-rc, and this patch makes
it well again - thanks.

But: it very quickly hit NULL pointer until I changed that last line to

	if (first_pgdat)
		consider_reclaim_throttle(first_pgdat, sc);

I've given no thought as to whether that is the correct fix,
or if first_pgdat should be set earlier in the loop above.

Hugh
Mel Gorman Dec. 9, 2021, 9:53 a.m. UTC | #12
On Wed, Dec 08, 2021 at 10:20:47PM -0800, Hugh Dickins wrote:
> On Thu, 2 Dec 2021, Mel Gorman wrote:
> ...
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> ...
> > @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  			/* need some check for avoid more shrink_zone() */
> >  		}
> >  
> > +		if (!first_pgdat)
> > +			first_pgdat = zone->zone_pgdat;
> > +
> >  		/* See comment about same check for global reclaim above */
> >  		if (zone->zone_pgdat == last_pgdat)
> >  			continue;
> >  		last_pgdat = zone->zone_pgdat;
> >  		shrink_node(zone->zone_pgdat, sc);
> > -		consider_reclaim_throttle(zone->zone_pgdat, sc);
> >  	}
> >  
> > +	consider_reclaim_throttle(first_pgdat, sc);
> 
> My tmpfs swapping load (tweaked to use huge pages more heavily than
> in real life) is far from being a realistic load: but it was notably
> slowed down by your throttling mods in 5.16-rc, and this patch makes
> it well again - thanks.
> 
> But: it very quickly hit NULL pointer until I changed that last line to
> 
> 	if (first_pgdat)
> 		consider_reclaim_throttle(first_pgdat, sc);
> 
> I've given no thought as to whether that is the correct fix,
> or if first_pgdat should be set earlier in the loop above.
> 

It's the right fix, first_pgdat may be NULL if compaction can run for
each zone in the zonelist which could be the case for a tmpfs swapping
load that is huge page intensive.

Thanks Hugh.
Thorsten Leemhuis Dec. 28, 2021, 10:04 a.m. UTC | #13
Hi, this is your Linux kernel regression tracker speaking.

On 02.12.21 16:06, Mel Gorman wrote:
> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> problems due to reclaim throttling for excessive lengths of time.
> In Alexey's case, a memory hog that should go OOM quickly stalls for
> several minutes before stalling. In Mike and Darrick's cases, a small
> memcg environment stalled excessively even though the system had enough
> memory overall.

Just wondering: this patch afaics is now in -mm and  Linux next for
nearly two weeks. Is that intentional? I had expected it to be mainlined
with the batch of patches Andrew mailed to Linus last week, but it
wasn't among them.

Or am I missing something?

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.


> Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being
> made") introduced the problem although commit a19594ca4a8b ("mm/vmscan:
> increase the timeout if page reclaim is not making progress") made it
> worse. Systems at or near an OOM state that cannot be recovered must
> reach OOM quickly and memcg should kill tasks if a memcg is near OOM.
> 
> To address this, only stall for the first zone in the zonelist, reduce
> the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
> the scan control nr_reclaimed is 0, kswapd is still active and there were
> excessive pages pending for writeback. If kswapd has stopped reclaiming due
> to excessive failures, do not stall at all so that OOM triggers relatively
> quickly. Similarly, if an LRU is simply congested, only lightly throttle
> similar to NOPROGRESS.
> 
> Alexey's original case was the most straight forward
> 
> 	for i in {1..3}; do tail /dev/zero; done
> 
> On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
> completes in a few seconds similar to 5.15.
> 
> Alexey's second test case added watching a youtube video while tail runs
> 10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a lot
> with lots of frames missing and numerous audio glitches. With this patch
> applies, the video plays similarly to 5.15.
> 
> Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
> Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
> Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
> 
> [lkp@intel.com: Fix W=1 build warning]
> Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv>
> Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
> Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org>
> Reported-by: kernel test robot <lkp@intel.com>
> Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made")
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mmzone.h        |  1 +
>  include/trace/events/vmscan.h |  4 ++-
>  mm/vmscan.c                   | 64 ++++++++++++++++++++++++++++++-----
>  3 files changed, 59 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 58e744b78c2c..936dc0b6c226 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -277,6 +277,7 @@ enum vmscan_throttle_state {
>  	VMSCAN_THROTTLE_WRITEBACK,
>  	VMSCAN_THROTTLE_ISOLATED,
>  	VMSCAN_THROTTLE_NOPROGRESS,
> +	VMSCAN_THROTTLE_CONGESTED,
>  	NR_VMSCAN_THROTTLE,
>  };
>  
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index f25a6149d3ba..ca2e9009a651 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -30,12 +30,14 @@
>  #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
>  #define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
>  #define _VMSCAN_THROTTLE_NOPROGRESS	(1 << VMSCAN_THROTTLE_NOPROGRESS)
> +#define _VMSCAN_THROTTLE_CONGESTED	(1 << VMSCAN_THROTTLE_CONGESTED)
>  
>  #define show_throttle_flags(flags)						\
>  	(flags) ? __print_flags(flags, "|",					\
>  		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"},	\
>  		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"},	\
> -		{_VMSCAN_THROTTLE_NOPROGRESS,	"VMSCAN_THROTTLE_NOPROGRESS"}	\
> +		{_VMSCAN_THROTTLE_NOPROGRESS,	"VMSCAN_THROTTLE_NOPROGRESS"},	\
> +		{_VMSCAN_THROTTLE_CONGESTED,	"VMSCAN_THROTTLE_CONGESTED"}	\
>  		) : "VMSCAN_THROTTLE_NONE"
>  
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fb9584641ac7..4c4d5f6cd8a3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1021,6 +1021,39 @@ static void handle_write_error(struct address_space *mapping,
>  	unlock_page(page);
>  }
>  
> +static bool skip_throttle_noprogress(pg_data_t *pgdat)
> +{
> +	int reclaimable = 0, write_pending = 0;
> +	int i;
> +
> +	/*
> +	 * If kswapd is disabled, reschedule if necessary but do not
> +	 * throttle as the system is likely near OOM.
> +	 */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
> +	/*
> +	 * If there are a lot of dirty/writeback pages then do not
> +	 * throttle as throttling will occur when the pages cycle
> +	 * towards the end of the LRU if still under writeback.
> +	 */
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		struct zone *zone = pgdat->node_zones + i;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		reclaimable += zone_reclaimable_pages(zone);
> +		write_pending += zone_page_state_snapshot(zone,
> +						  NR_ZONE_WRITE_PENDING);
> +	}
> +	if (2 * write_pending <= reclaimable)
> +		return true;
> +
> +	return false;
> +}
> +
>  void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
>  {
>  	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
> @@ -1056,8 +1089,16 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
>  		}
>  
>  		break;
> +	case VMSCAN_THROTTLE_CONGESTED:
> +		fallthrough;
>  	case VMSCAN_THROTTLE_NOPROGRESS:
> -		timeout = HZ/2;
> +		if (skip_throttle_noprogress(pgdat)) {
> +			cond_resched();
> +			return;
> +		}
> +
> +		timeout = 1;
> +
>  		break;
>  	case VMSCAN_THROTTLE_ISOLATED:
>  		timeout = HZ/50;
> @@ -3321,7 +3362,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	if (!current_is_kswapd() && current_may_throttle() &&
>  	    !sc->hibernation_mode &&
>  	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
> -		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +		reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
>  
>  	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  				    sc))
> @@ -3386,16 +3427,16 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
>  	}
>  
>  	/*
> -	 * Do not throttle kswapd on NOPROGRESS as it will throttle on
> -	 * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under
> -	 * writeback and marked for immediate reclaim at the tail of
> -	 * the LRU.
> +	 * Do not throttle kswapd or cgroup reclaim on NOPROGRESS as it will
> +	 * throttle on VMSCAN_THROTTLE_WRITEBACK if there are too many pages
> +	 * under writeback and marked for immediate reclaim at the tail of the
> +	 * LRU.
>  	 */
> -	if (current_is_kswapd())
> +	if (current_is_kswapd() || cgroup_reclaim(sc))
>  		return;
>  
>  	/* Throttle if making no progress at high prioities. */
> -	if (sc->priority < DEF_PRIORITY - 2)
> +	if (sc->priority == 1 && !sc->nr_reclaimed)
>  		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
>  }
>  
> @@ -3415,6 +3456,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  	unsigned long nr_soft_scanned;
>  	gfp_t orig_mask;
>  	pg_data_t *last_pgdat = NULL;
> +	pg_data_t *first_pgdat = NULL;
>  
>  	/*
>  	 * If the number of buffer_heads in the machine exceeds the maximum
> @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  			/* need some check for avoid more shrink_zone() */
>  		}
>  
> +		if (!first_pgdat)
> +			first_pgdat = zone->zone_pgdat;
> +
>  		/* See comment about same check for global reclaim above */
>  		if (zone->zone_pgdat == last_pgdat)
>  			continue;
>  		last_pgdat = zone->zone_pgdat;
>  		shrink_node(zone->zone_pgdat, sc);
> -		consider_reclaim_throttle(zone->zone_pgdat, sc);
>  	}
>  
> +	consider_reclaim_throttle(first_pgdat, sc);
> +
>  	/*
>  	 * Restore to original mask to avoid the impact on the caller if we
>  	 * promoted it to __GFP_HIGHMEM.
Andrew Morton Dec. 29, 2021, 11:45 p.m. UTC | #14
On Tue, 28 Dec 2021 11:04:18 +0100 Thorsten Leemhuis <regressions@leemhuis.info> wrote:

> Hi, this is your Linux kernel regression tracker speaking.
> 
> On 02.12.21 16:06, Mel Gorman wrote:
> > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > problems due to reclaim throttling for excessive lengths of time.
> > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > several minutes before stalling. In Mike and Darrick's cases, a small
> > memcg environment stalled excessively even though the system had enough
> > memory overall.
> 
> Just wondering: this patch afaics is now in -mm and  Linux next for
> nearly two weeks. Is that intentional? I had expected it to be mainlined
> with the batch of patches Andrew mailed to Linus last week, but it
> wasn't among them.

I have it queued for 5.17-rc1.

There is still time to squeeze it into 5.16, just, with a cc:stable. 

Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it
will trickle back with less risk to the 5.17 release.

What do people think?
Thorsten Leemhuis Dec. 31, 2021, 2:24 p.m. UTC | #15
On 30.12.21 00:45, Andrew Morton wrote:
> On Tue, 28 Dec 2021 11:04:18 +0100 Thorsten Leemhuis <regressions@leemhuis.info> wrote:
> 
>> Hi, this is your Linux kernel regression tracker speaking.
>>
>> On 02.12.21 16:06, Mel Gorman wrote:
>>> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
>>> problems due to reclaim throttling for excessive lengths of time.
>>> In Alexey's case, a memory hog that should go OOM quickly stalls for
>>> several minutes before stalling. In Mike and Darrick's cases, a small
>>> memcg environment stalled excessively even though the system had enough
>>> memory overall.
>>
>> Just wondering: this patch afaics is now in -mm and  Linux next for
>> nearly two weeks. Is that intentional? I had expected it to be mainlined
>> with the batch of patches Andrew mailed to Linus last week, but it
>> wasn't among them.
> 
> I have it queued for 5.17-rc1.
> 
> There is still time to squeeze it into 5.16, just, with a cc:stable. 
> 
> Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it
> will trickle back with less risk to the 5.17 release.
> 
> What do people think?

CCing Linus, to make sure he's aware of this.

Maybe I'm totally missing something, but I'm a bit confused by what you
wrote, as the regression afaik was introduced between v5.15..v5.16-rc1.
So I assume this is what you meant:

```
I have it queued for 5.17-rc1.

There is still time to squeeze it into 5.16.

Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it
will trickle back with less risk to the 5.16 release.

What do people think?
```

I'll leave the individual risk evaluation of the patch to others. If the
fix is risky, waiting for 5.17 is fine for me.

But hmmm, regarding the "could merge it into 5.17-rc1 with a cc:stable"
idea a remark: is that really "less risk", as your stated?

If we get it into rc8 (which is still possible, even if a bit hard due
to the new year festivities), it will get at least one week of testing.

If the fix waits for the next merge window, it all depends on the how
the timing works out. But it's easy to picture a worst case: the fix is
only merged on the Friday evening before Linus releases 5.17-rc1 and
right after it's out makes it into a stable-rc (say a day or two after
5.17-rc1 is out) and from there into a 5.16.y release on Thursday. That
IMHO would mean less days of testing in the end (and there is a weekend
in this period as well).

Waiting obviously will also mean that users of 5.16 and 5.16.y will
likely have to face this regression for at least two and a half weeks,
unless you send the fix early and Greg backports it before rc1 (which he
afaics does if there are good reasons). Yes, it's `just` a performance
regression, so it might not stop anyone from running Linux 5.16 -- but
it's one that three people separately reported in the 5.16 devel cycle,
so others will likely encounter it as well if we leave it unfixed in
5.16. This will likely annoy some people, especially if they invest time
in bisecting it, only to find out that the forth iteration of the fix
for the regression is already available since December the 2nd.

Ciao, Thorsten
Hugh Dickins Dec. 31, 2021, 6:33 p.m. UTC | #16
On Fri, 31 Dec 2021, Thorsten Leemhuis wrote:
> On 30.12.21 00:45, Andrew Morton wrote:
> > On Tue, 28 Dec 2021 11:04:18 +0100 Thorsten Leemhuis <regressions@leemhuis.info> wrote:
> > 
> >> Hi, this is your Linux kernel regression tracker speaking.
> >>
> >> On 02.12.21 16:06, Mel Gorman wrote:
> >>> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> >>> problems due to reclaim throttling for excessive lengths of time.
> >>> In Alexey's case, a memory hog that should go OOM quickly stalls for
> >>> several minutes before stalling. In Mike and Darrick's cases, a small
> >>> memcg environment stalled excessively even though the system had enough
> >>> memory overall.
> >>
> >> Just wondering: this patch afaics is now in -mm and  Linux next for
> >> nearly two weeks. Is that intentional? I had expected it to be mainlined
> >> with the batch of patches Andrew mailed to Linus last week, but it
> >> wasn't among them.
> > 
> > I have it queued for 5.17-rc1.
> > 
> > There is still time to squeeze it into 5.16, just, with a cc:stable. 
> > 
> > Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it
> > will trickle back with less risk to the 5.17 release.
> > 
> > What do people think?
> 
> CCing Linus, to make sure he's aware of this.
> 
> Maybe I'm totally missing something, but I'm a bit confused by what you
> wrote, as the regression afaik was introduced between v5.15..v5.16-rc1.
> So I assume this is what you meant:
> 
> ```
> I have it queued for 5.17-rc1.
> 
> There is still time to squeeze it into 5.16.
> 
> Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it
> will trickle back with less risk to the 5.16 release.
> 
> What do people think?
> ```
> 
> I'll leave the individual risk evaluation of the patch to others. If the
> fix is risky, waiting for 5.17 is fine for me.
> 
> But hmmm, regarding the "could merge it into 5.17-rc1 with a cc:stable"
> idea a remark: is that really "less risk", as your stated?
> 
> If we get it into rc8 (which is still possible, even if a bit hard due
> to the new year festivities), it will get at least one week of testing.

My vote is for it to go into rc8: for me, 5.16-rc reclaim behaves too
oddly without it, so I've simply added it into whatever testing I do
ever since Mel posted - no regressions noticed with it in (aside from
needing the -fix.patch you already added a few weeks ago).

Hugh

> 
> If the fix waits for the next merge window, it all depends on the how
> the timing works out. But it's easy to picture a worst case: the fix is
> only merged on the Friday evening before Linus releases 5.17-rc1 and
> right after it's out makes it into a stable-rc (say a day or two after
> 5.17-rc1 is out) and from there into a 5.16.y release on Thursday. That
> IMHO would mean less days of testing in the end (and there is a weekend
> in this period as well).
> 
> Waiting obviously will also mean that users of 5.16 and 5.16.y will
> likely have to face this regression for at least two and a half weeks,
> unless you send the fix early and Greg backports it before rc1 (which he
> afaics does if there are good reasons). Yes, it's `just` a performance
> regression, so it might not stop anyone from running Linux 5.16 -- but
> it's one that three people separately reported in the 5.16 devel cycle,
> so others will likely encounter it as well if we leave it unfixed in
> 5.16. This will likely annoy some people, especially if they invest time
> in bisecting it, only to find out that the forth iteration of the fix
> for the regression is already available since December the 2nd.
> 
> Ciao, Thorsten
Linus Torvalds Dec. 31, 2021, 7:14 p.m. UTC | #17
On Fri, Dec 31, 2021 at 6:24 AM Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> If we get it into rc8 (which is still possible, even if a bit hard due
> to the new year festivities), it will get at least one week of testing.

I took it with Hugh's ack from his reply to this, so it should be in rc8.

I'm not sure how much testing a week in rc8 will actually get, but
next week is likely to be at least a _bit_ more active than the
current week has been. And it's been tested by the people reporting it
(and apparently Hugh too).

                   Linus
Linus Torvalds Dec. 31, 2021, 7:21 p.m. UTC | #18
On Fri, Dec 31, 2021 at 11:14 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Dec 31, 2021 at 6:24 AM Thorsten Leemhuis
> <regressions@leemhuis.info> wrote:
> >
> > If we get it into rc8 (which is still possible, even if a bit hard due
> > to the new year festivities), it will get at least one week of testing.
>
> I took it with Hugh's ack from his reply to this, so it should be in rc8.

Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a
failure to make progress")

             Linus
Linus Torvalds Dec. 31, 2021, 7:22 p.m. UTC | #19
On Fri, Dec 31, 2021 at 11:21 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a
> failure to make progress")

.. and I _think_ this empties the regzbot queue for this release, Thorsten. No?

                   Linus
Andrew Morton Dec. 31, 2021, 9:04 p.m. UTC | #20
On Fri, 31 Dec 2021 11:21:14 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Dec 31, 2021 at 11:14 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, Dec 31, 2021 at 6:24 AM Thorsten Leemhuis
> > <regressions@leemhuis.info> wrote:
> > >
> > > If we get it into rc8 (which is still possible, even if a bit hard due
> > > to the new year festivities), it will get at least one week of testing.
> >
> > I took it with Hugh's ack from his reply to this, so it should be in rc8.
> 
> Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a
> failure to make progress")

Needs this fixup, which I shall tweak a bit then send formally
in a few minutes.


From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm: vmscan: reduce throttling due to a failure to make progress -fix

Hugh Dickins reported the following

	My tmpfs swapping load (tweaked to use huge pages more heavily
	than in real life) is far from being a realistic load: but it was
	notably slowed down by your throttling mods in 5.16-rc, and this
	patch makes it well again - thanks.

	But: it very quickly hit NULL pointer until I changed that last
	line to

        if (first_pgdat)
                consider_reclaim_throttle(first_pgdat, sc);

The likely issue is that huge pages are a major component of the test
workload. When this is the case, first_pgdat may never get set if
compaction is ready to continue due to this check

        if (IS_ENABLED(CONFIG_COMPACTION) &&
            sc->order > PAGE_ALLOC_COSTLY_ORDER &&
            compaction_ready(zone, sc)) {
                sc->compaction_ready = true;
                continue;
        }

If this was true for every zone in the zonelist, first_pgdat would never
get set resulting in a NULL pointer exception.

This is a fix to the mmotm patch
mm-vmscan-reduce-throttling-due-to-a-failure-to-make-progress.patch

Link: https://lkml.kernel.org/r/20211209095453.GM3366@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/vmscan.c~mm-vmscan-reduce-throttling-due-to-a-failure-to-make-progress-fix
+++ a/mm/vmscan.c
@@ -3530,7 +3530,8 @@ static void shrink_zones(struct zonelist
 		shrink_node(zone->zone_pgdat, sc);
 	}
 
-	consider_reclaim_throttle(first_pgdat, sc);
+	if (first_pgdat)
+		consider_reclaim_throttle(first_pgdat, sc);
 
 	/*
 	 * Restore to original mask to avoid the impact on the caller if we
Linus Torvalds Dec. 31, 2021, 9:18 p.m. UTC | #21
On Fri, Dec 31, 2021 at 1:04 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Needs this fixup, which I shall tweak a bit then send formally
> in a few minutes.

Thanks, applied.

           Linus
Thorsten Leemhuis Jan. 1, 2022, 10:52 a.m. UTC | #22
On 31.12.21 20:22, Linus Torvalds wrote:
> On Fri, Dec 31, 2021 at 11:21 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a
>> failure to make progress")

Thx.

> .. and I _think_ this empties the regzbot queue for this release, Thorsten. No?

Well, it was the regression that bothered me most. But there are still a
few out there that got introduced this cycle.

There is a regression in RDMA/mlx5 introduced in v5.16-rc5:
https://lore.kernel.org/lkml/f298db4ec5fdf7a2d1d166ca2f66020fd9397e5c.1640079962.git.leonro@nvidia.com/
https://lore.kernel.org/all/EEBA2D1C-F29C-4237-901C-587B60CEE113@oracle.com/
A fix is available, but got stuck afaics:
https://lore.kernel.org/lkml/f298db4ec5fdf7a2d1d166ca2f66020fd9397e5c.1640079962.git.leonro@nvidia.com/
And I only noticed just now: a revert was also discussed, but not performed:
https://lore.kernel.org/all/20211222101312.1358616-1-maorg@nvidia.com/
Will let Greg know, seems the commit got backported to 5.15.

s0ix suspend broke for some AMD machines, Alex and Mario are busy
looking into it:
https://gitlab.freedesktop.org/drm/amd/-/issues/1821
https://bugzilla.kernel.org/show_bug.cgi?id=215436

Alex is also dealing with another issue where the screen contents now
get restored after some input events:
https://bugzilla.kernel.org/show_bug.cgi?id=215203

There still seems to be a performance regression that Josef and Valentin
try hard to pin down without much success for weeks now:
https://lore.kernel.org/lkml/87tuf07hdk.mognet@arm.com/

And there one more report, but it might be a follow-up error due to
another regression:
https://lore.kernel.org/linux-pm/52933493.dBzk7ret6Y@geek500/
Shuang Zhai Feb. 14, 2022, 9:10 p.m. UTC | #23
Hi Mel,

Mel Gorman wrote:
>
> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> problems due to reclaim throttling for excessive lengths of time.
> In Alexey's case, a memory hog that should go OOM quickly stalls for
> several minutes before stalling. In Mike and Darrick's cases, a small
> memcg environment stalled excessively even though the system had enough
> memory overall.
>

I recently found a regression when I tested MGLRU with fio on Linux
5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux
5.16, but the regression has not been fixed yet. 

The workload is to let fio perform random access on files with buffered
IO. The total file size is 2x the memory size. Files are stored on pmem.
For each configuration, I ran fio 10 times and reported the average and
the standard deviation.

Fio command
===========

$ numactl --cpubind=0 --membind=0 fio --name=randread \
  --directory=/mnt/pmem/ --size={10G, 5G} --io_size=1000TB \
  --time_based --numjobs={40, 80} --ioengine=io_uring \
  --ramp_time=20m --runtime=10m --iodepth=128 \
  --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
  --rw=randread --random_distribution=random \
  --direct=0 --norandommap --group_reporting

Results in throughput (MB/s):
=============================

+------------+------+-------+------+-------+----------+-------+
| Jobs / CPU | 5.15 | stdev | 5.16 | stdev | 5.17-rc3 | stdev |
+------------+------+-------+------+-------+----------+-------+
| 1          | 8411 | 75    | 7459 | 38    | 7331     | 36    |
+------------+------+-------+------+-------+----------+-------+
| 2          | 8417 | 54    | 7491 | 41    | 7383     | 15    |
+------------+------+-------+------+-------+----------+-------+

[1] https://lore.kernel.org/linux-mm/20220105024423.26409-1-szhai2@cs.rochester.edu/

Thanks!

Shuang
Mel Gorman Feb. 15, 2022, 2:49 p.m. UTC | #24
On Mon, Feb 14, 2022 at 04:10:50PM -0500, Shuang Zhai wrote:
> Hi Mel,
> 
> Mel Gorman wrote:
> >
> > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > problems due to reclaim throttling for excessive lengths of time.
> > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > several minutes before stalling. In Mike and Darrick's cases, a small
> > memcg environment stalled excessively even though the system had enough
> > memory overall.
> >
> 
> I recently found a regression when I tested MGLRU with fio on Linux
> 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux
> 5.16, but the regression has not been fixed yet. 
> 

Am I correct in thinging that this only happens with MGLRU?
Shuang Zhai Feb. 22, 2022, 5:27 p.m. UTC | #25
Mel Gorman wrote:
> On Mon, Feb 14, 2022 at 04:10:50PM -0500, Shuang Zhai wrote:
> > Hi Mel,
> > 
> > Mel Gorman wrote:
> > >
> > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > > problems due to reclaim throttling for excessive lengths of time.
> > > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > > several minutes before stalling. In Mike and Darrick's cases, a small
> > > memcg environment stalled excessively even though the system had enough
> > > memory overall.
> > >
> > 
> > I recently found a regression when I tested MGLRU with fio on Linux
> > 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux
> > 5.16, but the regression has not been fixed yet. 
> > 
> 
> Am I correct in thinging that this only happens with MGLRU?

Sorry about the confusion and let me clarify on this. The regression happens
on upstream Linux with the default page replacement mechanism.
Mel Gorman Feb. 23, 2022, 12:50 p.m. UTC | #26
On Tue, Feb 22, 2022 at 12:27:31PM -0500, Shuang Zhai wrote:
> Mel Gorman wrote:
> > On Mon, Feb 14, 2022 at 04:10:50PM -0500, Shuang Zhai wrote:
> > > Hi Mel,
> > > 
> > > Mel Gorman wrote:
> > > >
> > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
> > > > problems due to reclaim throttling for excessive lengths of time.
> > > > In Alexey's case, a memory hog that should go OOM quickly stalls for
> > > > several minutes before stalling. In Mike and Darrick's cases, a small
> > > > memcg environment stalled excessively even though the system had enough
> > > > memory overall.
> > > >
> > > 
> > > I recently found a regression when I tested MGLRU with fio on Linux
> > > 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux
> > > 5.16, but the regression has not been fixed yet. 
> > > 
> > 
> > Am I correct in thinging that this only happens with MGLRU?
> 
> Sorry about the confusion and let me clarify on this. The regression happens
> on upstream Linux with the default page replacement mechanism.

Ok, the fio command for me simply exits with an error and even if it didn't
the test machine I have with persistent memory does not have enough pmem
to trigger memory reclaim issues with fio. Can you do the following please?

# echo 1 > vmscan/mm_vmscan_throttled/enable
# cat /sys/kernel/debug/tracing/trace_pipe > trace.out

and run the test? Compress trace.out with xz and send it to me by mail.
If the trace is too large, send as much as you can.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58e744b78c2c..936dc0b6c226 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -277,6 +277,7 @@  enum vmscan_throttle_state {
 	VMSCAN_THROTTLE_WRITEBACK,
 	VMSCAN_THROTTLE_ISOLATED,
 	VMSCAN_THROTTLE_NOPROGRESS,
+	VMSCAN_THROTTLE_CONGESTED,
 	NR_VMSCAN_THROTTLE,
 };
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f25a6149d3ba..ca2e9009a651 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -30,12 +30,14 @@ 
 #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
 #define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
 #define _VMSCAN_THROTTLE_NOPROGRESS	(1 << VMSCAN_THROTTLE_NOPROGRESS)
+#define _VMSCAN_THROTTLE_CONGESTED	(1 << VMSCAN_THROTTLE_CONGESTED)
 
 #define show_throttle_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",					\
 		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"},	\
 		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"},	\
-		{_VMSCAN_THROTTLE_NOPROGRESS,	"VMSCAN_THROTTLE_NOPROGRESS"}	\
+		{_VMSCAN_THROTTLE_NOPROGRESS,	"VMSCAN_THROTTLE_NOPROGRESS"},	\
+		{_VMSCAN_THROTTLE_CONGESTED,	"VMSCAN_THROTTLE_CONGESTED"}	\
 		) : "VMSCAN_THROTTLE_NONE"
 
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..4c4d5f6cd8a3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1021,6 +1021,39 @@  static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
+static bool skip_throttle_noprogress(pg_data_t *pgdat)
+{
+	int reclaimable = 0, write_pending = 0;
+	int i;
+
+	/*
+	 * If kswapd is disabled, reschedule if necessary but do not
+	 * throttle as the system is likely near OOM.
+	 */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return true;
+
+	/*
+	 * If there are a lot of dirty/writeback pages then do not
+	 * throttle as throttling will occur when the pages cycle
+	 * towards the end of the LRU if still under writeback.
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+
+		if (!populated_zone(zone))
+			continue;
+
+		reclaimable += zone_reclaimable_pages(zone);
+		write_pending += zone_page_state_snapshot(zone,
+						  NR_ZONE_WRITE_PENDING);
+	}
+	if (2 * write_pending <= reclaimable)
+		return true;
+
+	return false;
+}
+
 void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 {
 	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
@@ -1056,8 +1089,16 @@  void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 		}
 
 		break;
+	case VMSCAN_THROTTLE_CONGESTED:
+		fallthrough;
 	case VMSCAN_THROTTLE_NOPROGRESS:
-		timeout = HZ/2;
+		if (skip_throttle_noprogress(pgdat)) {
+			cond_resched();
+			return;
+		}
+
+		timeout = 1;
+
 		break;
 	case VMSCAN_THROTTLE_ISOLATED:
 		timeout = HZ/50;
@@ -3321,7 +3362,7 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -3386,16 +3427,16 @@  static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/*
-	 * Do not throttle kswapd on NOPROGRESS as it will throttle on
-	 * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under
-	 * writeback and marked for immediate reclaim at the tail of
-	 * the LRU.
+	 * Do not throttle kswapd or cgroup reclaim on NOPROGRESS as it will
+	 * throttle on VMSCAN_THROTTLE_WRITEBACK if there are too many pages
+	 * under writeback and marked for immediate reclaim at the tail of the
+	 * LRU.
 	 */
-	if (current_is_kswapd())
+	if (current_is_kswapd() || cgroup_reclaim(sc))
 		return;
 
 	/* Throttle if making no progress at high prioities. */
-	if (sc->priority < DEF_PRIORITY - 2)
+	if (sc->priority == 1 && !sc->nr_reclaimed)
 		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
 }
 
@@ -3415,6 +3456,7 @@  static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	pg_data_t *last_pgdat = NULL;
+	pg_data_t *first_pgdat = NULL;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -3478,14 +3520,18 @@  static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
+		if (!first_pgdat)
+			first_pgdat = zone->zone_pgdat;
+
 		/* See comment about same check for global reclaim above */
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
 		last_pgdat = zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc);
-		consider_reclaim_throttle(zone->zone_pgdat, sc);
 	}
 
+	consider_reclaim_throttle(first_pgdat, sc);
+
 	/*
 	 * Restore to original mask to avoid the impact on the caller if we
 	 * promoted it to __GFP_HIGHMEM.