mm: vmscan: fix extreme overreclaim and swap floods

Message ID	20220802162811.39216-1-hannes@cmpxchg.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net>, Hugh Dickins <hughd@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] mm: vmscan: fix extreme overreclaim and swap floods Date: Tue, 2 Aug 2022 12:28:11 -0400 Message-Id: <20220802162811.39216-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: vmscan: fix extreme overreclaim and swap floods \| expand mm: vmscan: fix extreme overreclaim and swap floods

Johannes Weiner Aug. 2, 2022, 4:28 p.m. UTC

During proactive reclaim, we sometimes observe severe overreclaim,
with several thousand times more pages reclaimed than requested.

This trace was obtained from shrink_lruvec() during such an instance:

    prio:0 anon_cost:1141521 file_cost:7767
    nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
    nr=[7161123 345 578 1111]

While he reclaimer requested 4M, vmscan reclaimed close to 16G, most
of it by swapping. These requests take over a minute, during which the
write() to memory.reclaim is unkillably stuck inside the kernel.

Digging into the source, this is caused by the proportional reclaim
bailout logic. This code tries to resolve a fundamental conflict: to
reclaim roughly what was requested, while also aging all LRUs fairly
and in accordance to their size, swappiness, refault rates etc. The
way it attempts fairness is that once the reclaim goal has been
reached, it stops scanning the LRUs with the smaller remaining scan
targets, and adjusts the remainder of the bigger LRUs according to how
much of the smaller LRUs was scanned. It then finishes scanning that
remainder regardless of the reclaim goal.

This works fine if priority levels are low and the LRU lists are
comparable in size. However, in this instance, the cgroup that is
targeted by proactive reclaim has almost no files left - they've
already been squeezed out by proactive reclaim earlier - and the
remaining anon pages are hot. Anon rotations cause the priority level
to drop to 0, which results in reclaim targeting all of anon (a lot)
and all of file (almost nothing). By the time reclaim decides to bail,
it has scanned most or all of the file target, and therefor must also
scan most or all of the enormous anon target. This target is thousands
of times larger than the reclaim goal, thus causing the overreclaim.

The bailout code hasn't changed in years, why is this failing now?
The most likely explanations are two other recent changes in anon
reclaim:

1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
   balancing effect of new transparent huge pages"), the VM was
   overall relatively reluctant to swap at all, even if swap was
   configured. This means the LRU balancing code didn't come into play
   as often as it does now, and mostly in high pressure situations
   where pronounced swap activity wouldn't be as surprising.

2. For historic reasons, shrink_lruvec() loops on the scan targets of
   all LRU lists except the active anon one, meaning it would bail if
   the only remaining pages to scan were active anon - even if there
   were a lot of them.

   Before the series starting with commit ccc5dc67340c ("mm/vmscan:
   make active/inactive ratio as 1:1 for anon lru"), most anon pages
   would live on the active LRU; the inactive one would contain only a
   handful of preselected reclaim candidates. After the series, anon
   gets aged similarly to file, and the inactive list is the default
   for new anon pages as well, making it often the much bigger list.

   As a result, the VM is now more likely to actually finish large
   anon targets than before.

Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
the larger LRU lists is made before bailing out on a met reclaim goal.

This fixes the extreme overreclaim problem.

Fairness is more subtle and harder to evaluate. No obvious misbehavior
was observed on the test workload, in any case. Conceptually, fairness
should primarily be a cumulative effect from regular, lower priority
scans. Once the VM is in trouble and needs to escalate scan targets to
make forward progress, fairness needs to take a backseat. This is also
acknowledged by the myriad exceptions in get_scan_count(). This patch
makes fairness decrease gradually, as it keeps fairness work static
over increasing priority levels with growing scan targets. This should
make more sense - although we may have to re-visit the exact values.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Andrew Morton Aug. 3, 2022, 12:06 a.m. UTC | #1

On Tue,  2 Aug 2022 12:28:11 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> the larger LRU lists is made before bailing out on a met reclaim goal.

It seems rash to jam this into 5.20-rc1 at this stage.  I'm thinking
5.21-rc1 with a cc:stable?

Johannes Weiner Aug. 3, 2022, 1:28 p.m. UTC | #2

On Tue, Aug 02, 2022 at 05:06:19PM -0700, Andrew Morton wrote:
> On Tue,  2 Aug 2022 12:28:11 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> > the larger LRU lists is made before bailing out on a met reclaim goal.
> 
> It seems rash to jam this into 5.20-rc1 at this stage.  I'm thinking
> 5.21-rc1 with a cc:stable?

Yeah, 5.20-rc1 sounds fast. Let's wait for reviews first and see how
much confidence we get on that change. I can't help but feel, reading
logs and comments (commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813),
that my fix is how the code was intended to work from the start.

5.21 does sound a biiiit on the long side for fixing such extreme
misbehavior, IMO, once it's proven to affect production workloads in
the wild. I was hoping -rc2 or so...

Rik van Riel Aug. 3, 2022, 5:27 p.m. UTC | #3

On Tue, 2022-08-02 at 12:28 -0400, Johannes Weiner wrote:
> 
> Change the code such that only one SWAP_CLUSTER_MAX-sized nudge
> toward
> the larger LRU lists is made before bailing out on a met reclaim
> goal.
> 
> This fixes the extreme overreclaim problem.

Reviewed-by: Rik van Riel <riel@surriel.com>

Mel Gorman Aug. 4, 2022, 9:38 a.m. UTC | #4

On Tue, Aug 02, 2022 at 12:28:11PM -0400, Johannes Weiner wrote:
> During proactive reclaim, we sometimes observe severe overreclaim,
> with several thousand times more pages reclaimed than requested.
> 
> This trace was obtained from shrink_lruvec() during such an instance:
> 
>     prio:0 anon_cost:1141521 file_cost:7767
>     nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
>     nr=[7161123 345 578 1111]
> 
> While he reclaimer requested 4M, vmscan reclaimed close to 16G, most
> of it by swapping. These requests take over a minute, during which the
> write() to memory.reclaim is unkillably stuck inside the kernel.
> 
> Digging into the source, this is caused by the proportional reclaim
> bailout logic. This code tries to resolve a fundamental conflict: to
> reclaim roughly what was requested, while also aging all LRUs fairly
> and in accordance to their size, swappiness, refault rates etc. The
> way it attempts fairness is that once the reclaim goal has been
> reached, it stops scanning the LRUs with the smaller remaining scan
> targets, and adjusts the remainder of the bigger LRUs according to how
> much of the smaller LRUs was scanned. It then finishes scanning that
> remainder regardless of the reclaim goal.
> 
> This works fine if priority levels are low and the LRU lists are
> comparable in size. However, in this instance, the cgroup that is
> targeted by proactive reclaim has almost no files left - they've
> already been squeezed out by proactive reclaim earlier - and the
> remaining anon pages are hot. Anon rotations cause the priority level
> to drop to 0, which results in reclaim targeting all of anon (a lot)
> and all of file (almost nothing). By the time reclaim decides to bail,
> it has scanned most or all of the file target, and therefor must also
> scan most or all of the enormous anon target. This target is thousands
> of times larger than the reclaim goal, thus causing the overreclaim.
> 
> The bailout code hasn't changed in years, why is this failing now?
> The most likely explanations are two other recent changes in anon
> reclaim:
> 
> 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
>    balancing effect of new transparent huge pages"), the VM was
>    overall relatively reluctant to swap at all, even if swap was
>    configured. This means the LRU balancing code didn't come into play
>    as often as it does now, and mostly in high pressure situations
>    where pronounced swap activity wouldn't be as surprising.
> 
> 2. For historic reasons, shrink_lruvec() loops on the scan targets of
>    all LRU lists except the active anon one, meaning it would bail if
>    the only remaining pages to scan were active anon - even if there
>    were a lot of them.
> 
>    Before the series starting with commit ccc5dc67340c ("mm/vmscan:
>    make active/inactive ratio as 1:1 for anon lru"), most anon pages
>    would live on the active LRU; the inactive one would contain only a
>    handful of preselected reclaim candidates. After the series, anon
>    gets aged similarly to file, and the inactive list is the default
>    for new anon pages as well, making it often the much bigger list.
> 
>    As a result, the VM is now more likely to actually finish large
>    anon targets than before.
> 
> Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> the larger LRU lists is made before bailing out on a met reclaim goal.
> 
> This fixes the extreme overreclaim problem.
> 
> Fairness is more subtle and harder to evaluate. No obvious misbehavior
> was observed on the test workload, in any case. Conceptually, fairness
> should primarily be a cumulative effect from regular, lower priority
> scans. Once the VM is in trouble and needs to escalate scan targets to
> make forward progress, fairness needs to take a backseat. This is also
> acknowledged by the myriad exceptions in get_scan_count(). This patch
> makes fairness decrease gradually, as it keeps fairness work static
> over increasing priority levels with growing scan targets. This should
> make more sense - although we may have to re-visit the exact values.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Yosry Ahmed Aug. 8, 2022, 1:54 p.m. UTC | #5

On Tue, Aug 2, 2022 at 9:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> During proactive reclaim, we sometimes observe severe overreclaim,
> with several thousand times more pages reclaimed than requested.
>
> This trace was obtained from shrink_lruvec() during such an instance:
>
>     prio:0 anon_cost:1141521 file_cost:7767
>     nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
>     nr=[7161123 345 578 1111]
>
> While he reclaimer requested 4M, vmscan reclaimed close to 16G, most
> of it by swapping. These requests take over a minute, during which the
> write() to memory.reclaim is unkillably stuck inside the kernel.
>
> Digging into the source, this is caused by the proportional reclaim
> bailout logic. This code tries to resolve a fundamental conflict: to
> reclaim roughly what was requested, while also aging all LRUs fairly
> and in accordance to their size, swappiness, refault rates etc. The
> way it attempts fairness is that once the reclaim goal has been
> reached, it stops scanning the LRUs with the smaller remaining scan
> targets, and adjusts the remainder of the bigger LRUs according to how
> much of the smaller LRUs was scanned. It then finishes scanning that
> remainder regardless of the reclaim goal.
>
> This works fine if priority levels are low and the LRU lists are
> comparable in size. However, in this instance, the cgroup that is
> targeted by proactive reclaim has almost no files left - they've
> already been squeezed out by proactive reclaim earlier - and the
> remaining anon pages are hot. Anon rotations cause the priority level
> to drop to 0, which results in reclaim targeting all of anon (a lot)
> and all of file (almost nothing). By the time reclaim decides to bail,
> it has scanned most or all of the file target, and therefor must also
> scan most or all of the enormous anon target. This target is thousands
> of times larger than the reclaim goal, thus causing the overreclaim.
>
> The bailout code hasn't changed in years, why is this failing now?
> The most likely explanations are two other recent changes in anon
> reclaim:
>
> 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
>    balancing effect of new transparent huge pages"), the VM was
>    overall relatively reluctant to swap at all, even if swap was
>    configured. This means the LRU balancing code didn't come into play
>    as often as it does now, and mostly in high pressure situations
>    where pronounced swap activity wouldn't be as surprising.
>
> 2. For historic reasons, shrink_lruvec() loops on the scan targets of
>    all LRU lists except the active anon one, meaning it would bail if
>    the only remaining pages to scan were active anon - even if there
>    were a lot of them.
>
>    Before the series starting with commit ccc5dc67340c ("mm/vmscan:
>    make active/inactive ratio as 1:1 for anon lru"), most anon pages
>    would live on the active LRU; the inactive one would contain only a
>    handful of preselected reclaim candidates. After the series, anon
>    gets aged similarly to file, and the inactive list is the default
>    for new anon pages as well, making it often the much bigger list.
>
>    As a result, the VM is now more likely to actually finish large
>    anon targets than before.
>
> Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> the larger LRU lists is made before bailing out on a met reclaim goal.
>
> This fixes the extreme overreclaim problem.
>
> Fairness is more subtle and harder to evaluate. No obvious misbehavior
> was observed on the test workload, in any case. Conceptually, fairness
> should primarily be a cumulative effect from regular, lower priority
> scans. Once the VM is in trouble and needs to escalate scan targets to
> make forward progress, fairness needs to take a backseat. This is also
> acknowledged by the myriad exceptions in get_scan_count(). This patch
> makes fairness decrease gradually, as it keeps fairness work static
> over increasing priority levels with growing scan targets. This should
> make more sense - although we may have to re-visit the exact values.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f7d9a683e3a7..1cc0c6666787 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2897,8 +2897,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>         enum lru_list lru;
>         unsigned long nr_reclaimed = 0;
>         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +       bool proportional_reclaim;
>         struct blk_plug plug;
> -       bool scan_adjusted;
>
>         get_scan_count(lruvec, sc, nr);
>
> @@ -2916,8 +2916,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>          * abort proportional reclaim if either the file or anon lru has already
>          * dropped to zero at the first pass.
>          */
> -       scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> -                        sc->priority == DEF_PRIORITY);
> +       proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> +                               sc->priority == DEF_PRIORITY);
>
>         blk_start_plug(&plug);
>         while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> @@ -2937,7 +2937,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>
>                 cond_resched();
>
> -               if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> +               if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
>                         continue;
>
>                 /*
> @@ -2988,8 +2988,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 nr_scanned = targets[lru] - nr[lru];
>                 nr[lru] = targets[lru] * (100 - percentage) / 100;
>                 nr[lru] -= min(nr[lru], nr_scanned);
> -
> -               scan_adjusted = true;

Thanks for the great analysis of the problem!

I have a question here. This fixes the overreclaim problem for
proactive reclaim (and most other scenarios), but what about the case
where proportional_reclaim (aka scan_adjusted) is set before we start
shrinking lrus: global direct reclaim on DEF_PRIORITY? If we hit a
memcg that has very few file pages and a ton of anon pages in this
scenario (or vice versa), wouldn't we still overreclaim and possibly
stall unnecessarily? or am I missing something here?

>         }
>         blk_finish_plug(&plug);
>         sc->nr_reclaimed += nr_reclaimed;
> --
> 2.37.1
>
>

Yosry Ahmed Aug. 8, 2022, 2:07 p.m. UTC | #6

On Mon, Aug 8, 2022 at 6:54 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Aug 2, 2022 at 9:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > During proactive reclaim, we sometimes observe severe overreclaim,
> > with several thousand times more pages reclaimed than requested.
> >
> > This trace was obtained from shrink_lruvec() during such an instance:
> >
> >     prio:0 anon_cost:1141521 file_cost:7767
> >     nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
> >     nr=[7161123 345 578 1111]
> >
> > While he reclaimer requested 4M, vmscan reclaimed close to 16G, most
> > of it by swapping. These requests take over a minute, during which the
> > write() to memory.reclaim is unkillably stuck inside the kernel.
> >
> > Digging into the source, this is caused by the proportional reclaim
> > bailout logic. This code tries to resolve a fundamental conflict: to
> > reclaim roughly what was requested, while also aging all LRUs fairly
> > and in accordance to their size, swappiness, refault rates etc. The
> > way it attempts fairness is that once the reclaim goal has been
> > reached, it stops scanning the LRUs with the smaller remaining scan
> > targets, and adjusts the remainder of the bigger LRUs according to how
> > much of the smaller LRUs was scanned. It then finishes scanning that
> > remainder regardless of the reclaim goal.
> >
> > This works fine if priority levels are low and the LRU lists are
> > comparable in size. However, in this instance, the cgroup that is
> > targeted by proactive reclaim has almost no files left - they've
> > already been squeezed out by proactive reclaim earlier - and the
> > remaining anon pages are hot. Anon rotations cause the priority level
> > to drop to 0, which results in reclaim targeting all of anon (a lot)
> > and all of file (almost nothing). By the time reclaim decides to bail,
> > it has scanned most or all of the file target, and therefor must also
> > scan most or all of the enormous anon target. This target is thousands
> > of times larger than the reclaim goal, thus causing the overreclaim.
> >
> > The bailout code hasn't changed in years, why is this failing now?
> > The most likely explanations are two other recent changes in anon
> > reclaim:
> >
> > 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
> >    balancing effect of new transparent huge pages"), the VM was
> >    overall relatively reluctant to swap at all, even if swap was
> >    configured. This means the LRU balancing code didn't come into play
> >    as often as it does now, and mostly in high pressure situations
> >    where pronounced swap activity wouldn't be as surprising.
> >
> > 2. For historic reasons, shrink_lruvec() loops on the scan targets of
> >    all LRU lists except the active anon one, meaning it would bail if
> >    the only remaining pages to scan were active anon - even if there
> >    were a lot of them.
> >
> >    Before the series starting with commit ccc5dc67340c ("mm/vmscan:
> >    make active/inactive ratio as 1:1 for anon lru"), most anon pages
> >    would live on the active LRU; the inactive one would contain only a
> >    handful of preselected reclaim candidates. After the series, anon
> >    gets aged similarly to file, and the inactive list is the default
> >    for new anon pages as well, making it often the much bigger list.
> >
> >    As a result, the VM is now more likely to actually finish large
> >    anon targets than before.
> >
> > Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> > the larger LRU lists is made before bailing out on a met reclaim goal.
> >
> > This fixes the extreme overreclaim problem.
> >
> > Fairness is more subtle and harder to evaluate. No obvious misbehavior
> > was observed on the test workload, in any case. Conceptually, fairness
> > should primarily be a cumulative effect from regular, lower priority
> > scans. Once the VM is in trouble and needs to escalate scan targets to
> > make forward progress, fairness needs to take a backseat. This is also
> > acknowledged by the myriad exceptions in get_scan_count(). This patch
> > makes fairness decrease gradually, as it keeps fairness work static
> > over increasing priority levels with growing scan targets. This should
> > make more sense - although we may have to re-visit the exact values.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f7d9a683e3a7..1cc0c6666787 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2897,8 +2897,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >         enum lru_list lru;
> >         unsigned long nr_reclaimed = 0;
> >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > +       bool proportional_reclaim;
> >         struct blk_plug plug;
> > -       bool scan_adjusted;
> >
> >         get_scan_count(lruvec, sc, nr);
> >
> > @@ -2916,8 +2916,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >          * abort proportional reclaim if either the file or anon lru has already
> >          * dropped to zero at the first pass.
> >          */
> > -       scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> > -                        sc->priority == DEF_PRIORITY);
> > +       proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> > +                               sc->priority == DEF_PRIORITY);
> >
> >         blk_start_plug(&plug);
> >         while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> > @@ -2937,7 +2937,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >
> >                 cond_resched();
> >
> > -               if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> > +               if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
> >                         continue;
> >
> >                 /*
> > @@ -2988,8 +2988,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 nr_scanned = targets[lru] - nr[lru];
> >                 nr[lru] = targets[lru] * (100 - percentage) / 100;
> >                 nr[lru] -= min(nr[lru], nr_scanned);
> > -
> > -               scan_adjusted = true;
>
> Thanks for the great analysis of the problem!
>
> I have a question here. This fixes the overreclaim problem for
> proactive reclaim (and most other scenarios), but what about the case
> where proportional_reclaim (aka scan_adjusted) is set before we start
> shrinking lrus: global direct reclaim on DEF_PRIORITY? If we hit a
> memcg that has very few file pages and a ton of anon pages in this
> scenario (or vice versa), wouldn't we still overreclaim and possibly
> stall unnecessarily? or am I missing something here?

Never mind :) In this scenario we will keep iterating the LRUs anyway,
we don't attempt to make the scanning proportional. I guess the new
name (proportional_reclaim) confused me :)

>
> >         }
> >         blk_finish_plug(&plug);
> >         sc->nr_reclaimed += nr_reclaimed;
> > --
> > 2.37.1
> >
> >

Joonsoo Kim Aug. 12, 2022, 1:59 a.m. UTC | #7

2022년 8월 3일 (수) 오전 1:28, Johannes Weiner <hannes@cmpxchg.org>님이 작성:
>
> During proactive reclaim, we sometimes observe severe overreclaim,
> with several thousand times more pages reclaimed than requested.
>
> This trace was obtained from shrink_lruvec() during such an instance:
>
>     prio:0 anon_cost:1141521 file_cost:7767
>     nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
>     nr=[7161123 345 578 1111]
>
> While he reclaimer requested 4M, vmscan reclaimed close to 16G, most
> of it by swapping. These requests take over a minute, during which the
> write() to memory.reclaim is unkillably stuck inside the kernel.
>
> Digging into the source, this is caused by the proportional reclaim
> bailout logic. This code tries to resolve a fundamental conflict: to
> reclaim roughly what was requested, while also aging all LRUs fairly
> and in accordance to their size, swappiness, refault rates etc. The
> way it attempts fairness is that once the reclaim goal has been
> reached, it stops scanning the LRUs with the smaller remaining scan
> targets, and adjusts the remainder of the bigger LRUs according to how
> much of the smaller LRUs was scanned. It then finishes scanning that
> remainder regardless of the reclaim goal.
>
> This works fine if priority levels are low and the LRU lists are
> comparable in size. However, in this instance, the cgroup that is
> targeted by proactive reclaim has almost no files left - they've
> already been squeezed out by proactive reclaim earlier - and the
> remaining anon pages are hot. Anon rotations cause the priority level
> to drop to 0, which results in reclaim targeting all of anon (a lot)
> and all of file (almost nothing). By the time reclaim decides to bail,
> it has scanned most or all of the file target, and therefor must also
> scan most or all of the enormous anon target. This target is thousands
> of times larger than the reclaim goal, thus causing the overreclaim.
>
> The bailout code hasn't changed in years, why is this failing now?
> The most likely explanations are two other recent changes in anon
> reclaim:
>
> 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
>    balancing effect of new transparent huge pages"), the VM was
>    overall relatively reluctant to swap at all, even if swap was
>    configured. This means the LRU balancing code didn't come into play
>    as often as it does now, and mostly in high pressure situations
>    where pronounced swap activity wouldn't be as surprising.
>
> 2. For historic reasons, shrink_lruvec() loops on the scan targets of
>    all LRU lists except the active anon one, meaning it would bail if
>    the only remaining pages to scan were active anon - even if there
>    were a lot of them.
>
>    Before the series starting with commit ccc5dc67340c ("mm/vmscan:
>    make active/inactive ratio as 1:1 for anon lru"), most anon pages
>    would live on the active LRU; the inactive one would contain only a
>    handful of preselected reclaim candidates. After the series, anon
>    gets aged similarly to file, and the inactive list is the default
>    for new anon pages as well, making it often the much bigger list.
>
>    As a result, the VM is now more likely to actually finish large
>    anon targets than before.
>
> Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> the larger LRU lists is made before bailing out on a met reclaim goal.
>
> This fixes the extreme overreclaim problem.

I think that we can fix the issue without breaking the fairness.
Key idea is that doing scan based on the lru having max scan count.
(aka max-lru)
As scan is doing on max-lru, do scan the proportional number of
pages on other lru.

Pseudo code is here.

1. find the lru having max scan count
2. calculate nr_to_scan_max for max-lru
3. prop = (scanned[max-lru] + nr_to_scan_max) / targets[max-lru]
3. for_each_lru()
3-1. nr_to_scan = (targets[lru] * prop) - scanned[lru]
3-2. shrink_list(nr_to_scan)

With this approach, we can minimize reclaim without breaking the
fairness.

Note that actual code needs to handle some corner cases, one of it is
a low-nr_to_scan case to improve performance.

Thanks.

Hongchen Zhang Sept. 20, 2022, 6:12 a.m. UTC | #8

Hi Johannes,

On 2022/8/3 am 12:28, Johannes Weiner wrote:
> During proactive reclaim, we sometimes observe severe overreclaim,
> with several thousand times more pages reclaimed than requested.
> 
> This trace was obtained from shrink_lruvec() during such an instance:
> 
>      prio:0 anon_cost:1141521 file_cost:7767
>      nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
>      nr=[7161123 345 578 1111]
> 
> While he reclaimer requested 4M, vmscan reclaimed close to 16G, most
> of it by swapping. These requests take over a minute, during which the
> write() to memory.reclaim is unkillably stuck inside the kernel.
> 
> Digging into the source, this is caused by the proportional reclaim
> bailout logic. This code tries to resolve a fundamental conflict: to
> reclaim roughly what was requested, while also aging all LRUs fairly
> and in accordance to their size, swappiness, refault rates etc. The
> way it attempts fairness is that once the reclaim goal has been
> reached, it stops scanning the LRUs with the smaller remaining scan
> targets, and adjusts the remainder of the bigger LRUs according to how
> much of the smaller LRUs was scanned. It then finishes scanning that
> remainder regardless of the reclaim goal.
> 
> This works fine if priority levels are low and the LRU lists are
> comparable in size. However, in this instance, the cgroup that is
> targeted by proactive reclaim has almost no files left - they've
> already been squeezed out by proactive reclaim earlier - and the
> remaining anon pages are hot. Anon rotations cause the priority level
> to drop to 0, which results in reclaim targeting all of anon (a lot)
> and all of file (almost nothing). By the time reclaim decides to bail,
> it has scanned most or all of the file target, and therefor must also
> scan most or all of the enormous anon target. This target is thousands
> of times larger than the reclaim goal, thus causing the overreclaim.
> 
> The bailout code hasn't changed in years, why is this failing now?
> The most likely explanations are two other recent changes in anon
> reclaim:
> 
> 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU
>     balancing effect of new transparent huge pages"), the VM was
>     overall relatively reluctant to swap at all, even if swap was
>     configured. This means the LRU balancing code didn't come into play
>     as often as it does now, and mostly in high pressure situations
>     where pronounced swap activity wouldn't be as surprising.
> 
> 2. For historic reasons, shrink_lruvec() loops on the scan targets of
>     all LRU lists except the active anon one, meaning it would bail if
>     the only remaining pages to scan were active anon - even if there
>     were a lot of them.
> 
>     Before the series starting with commit ccc5dc67340c ("mm/vmscan:
>     make active/inactive ratio as 1:1 for anon lru"), most anon pages
>     would live on the active LRU; the inactive one would contain only a
>     handful of preselected reclaim candidates. After the series, anon
>     gets aged similarly to file, and the inactive list is the default
>     for new anon pages as well, making it often the much bigger list.
> 
>     As a result, the VM is now more likely to actually finish large
>     anon targets than before.
> 
> Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward
> the larger LRU lists is made before bailing out on a met reclaim goal.
> 
> This fixes the extreme overreclaim problem.
> 
> Fairness is more subtle and harder to evaluate. No obvious misbehavior
> was observed on the test workload, in any case. Conceptually, fairness
> should primarily be a cumulative effect from regular, lower priority
> scans. Once the VM is in trouble and needs to escalate scan targets to
> make forward progress, fairness needs to take a backseat. This is also
> acknowledged by the myriad exceptions in get_scan_count(). This patch
> makes fairness decrease gradually, as it keeps fairness work static
> over increasing priority levels with growing scan targets. This should
> make more sense - although we may have to re-visit the exact values.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   mm/vmscan.c | 10 ++++------
>   1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f7d9a683e3a7..1cc0c6666787 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2897,8 +2897,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   	enum lru_list lru;
>   	unsigned long nr_reclaimed = 0;
>   	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	bool proportional_reclaim;
>   	struct blk_plug plug;
> -	bool scan_adjusted;
>   
>   	get_scan_count(lruvec, sc, nr);
>   
> @@ -2916,8 +2916,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   	 * abort proportional reclaim if either the file or anon lru has already
>   	 * dropped to zero at the first pass.
>   	 */
> -	scan_adjusted = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> -			 sc->priority == DEF_PRIORITY);
> +	proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
> +				sc->priority == DEF_PRIORITY);
>   
>   	blk_start_plug(&plug);
>   	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> @@ -2937,7 +2937,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   
>   		cond_resched();
>   
> -		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> +		if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
>   			continue;
>   
>   		/*
> @@ -2988,8 +2988,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   		nr_scanned = targets[lru] - nr[lru];
>   		nr[lru] = targets[lru] * (100 - percentage) / 100;
>   		nr[lru] -= min(nr[lru], nr_scanned);
We should not just remove the following line because kswapd may also 
call this function and there is no side effect to do scan adjust for 
kswapd,so it may be better to change like this:
+    if (current_is_kswapd())
		scan_adjusted = true;
> -
> -		scan_adjusted = true
>   	}
>   	blk_finish_plug(&plug);
>   	sc->nr_reclaimed += nr_reclaimed;
> 

Thanks
Hongchen Zhang

Johannes Weiner Nov. 12, 2022, 2:23 p.m. UTC | #9

On Tue, Sep 20, 2022 at 02:12:17PM +0800, Hongchen Zhang wrote:
> On 2022/8/3 am 12:28, Johannes Weiner wrote:
> > @@ -2988,8 +2988,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >   		nr_scanned = targets[lru] - nr[lru];
> >   		nr[lru] = targets[lru] * (100 - percentage) / 100;
> >   		nr[lru] -= min(nr[lru], nr_scanned);
> We should not just remove the following line because kswapd may also call
> this function and there is no side effect to do scan adjust for kswapd,so it
> may be better to change like this:
> +    if (current_is_kswapd())
> 		scan_adjusted = true;
> > -
> > -		scan_adjusted = true

There is no scan_adjusted after this patch.

If you're saying that kswapd should set proportional_reclaim
unconditionally, then no, it should not. Proportional reclaim is not
safe at lower priority levels, as the changelog outlines in detail.

Johannes Weiner Nov. 12, 2022, 10:48 p.m. UTC | #10

On Fri, Aug 12, 2022 at 10:59:12AM +0900, Joonsoo Kim wrote:
> I think that we can fix the issue without breaking the fairness.
> Key idea is that doing scan based on the lru having max scan count.
> (aka max-lru)
> As scan is doing on max-lru, do scan the proportional number of
> pages on other lru.
> 
> Pseudo code is here.
> 
> 1. find the lru having max scan count
> 2. calculate nr_to_scan_max for max-lru
> 3. prop = (scanned[max-lru] + nr_to_scan_max) / targets[max-lru]

What's nr_to_scan_max?

AFAICS, prop would round down to 0 pretty quickly for imbalanced LRUs,
at which point it would stop reclaiming the smaller list altogether.

> 3. for_each_lru()
> 3-1. nr_to_scan = (targets[lru] * prop) - scanned[lru]
> 3-2. shrink_list(nr_to_scan)
> 
> With this approach, we can minimize reclaim without breaking the
> fairness.
> 
> Note that actual code needs to handle some corner cases, one of it is
> a low-nr_to_scan case to improve performance.

Right.

The main problem I see is that the discrepancies between LRU sizes can
be many orders bigger than common reclaim goals. Even when one LRU is
just 10x bigger, it'll be difficult to be both fair and still have
efficient batch sizes when the goal is only 32 pages total.

I think a proper way to do fairness would have to track scan history
over multiple cycles. (I think mglru does that, but I can't be sure.)

mm: vmscan: fix extreme overreclaim and swap floods

Commit Message

Comments

Patch