mm: ratelimit stat flush from workingset shrinker

Message ID	20231228073055.4046430-1-shakeelb@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 28 Dec 2023 07:30:55 +0000 Mime-Version: 1.0 Message-ID: <20231228073055.4046430-1-shakeelb@google.com> Subject: [PATCH] mm: ratelimit stat flush from workingset shrinker From: Shakeel Butt <shakeelb@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Yosry Ahmed <yosryahmed@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Yu Zhao <yuzhao@google.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: ratelimit stat flush from workingset shrinker \| expand mm: ratelimit stat flush from workingset shrinker

Message ID

20231228073055.4046430-1-shakeelb@google.com (mailing list archive)

State

New

Headers

Date: Thu, 28 Dec 2023 07:30:55 +0000
Mime-Version: 1.0
Message-ID: <20231228073055.4046430-1-shakeelb@google.com>
Subject: [PATCH] mm: ratelimit stat flush from workingset shrinker
From: Shakeel Butt <shakeelb@google.com>
To: Andrew Morton <akpm@linux-foundation.org>,
 Yosry Ahmed <yosryahmed@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Yu Zhao <yuzhao@google.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Shakeel Butt <shakeelb@google.com>
Content-Type: text/plain; charset="UTF-8"
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: ratelimit stat flush from workingset shrinker | expand

Commit Message

Shakeel Butt Dec. 28, 2023, 7:30 a.m. UTC

One of our internal workload regressed on newer upstream kernel and on
further investigation, it seems like the cause is the always synchronous
rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b
("mm: memcg: use rstat for non-hierarchical stats"). On further
inspection it seems like we don't really need accurate stats in this
function as it was already approximating the amount of appropriate
shadow entried to keep for maintaining the refault information. Since
there is already 2 sec periodic rstat flush, we don't need exact stats
here. Let's ratelimit the rstat flush in this code path.

Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/workingset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Yu Zhao Dec. 28, 2023, 8:01 a.m. UTC | #1

On Thu, Dec 28, 2023 at 12:31 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> One of our internal workload regressed on newer upstream kernel

Not really internal -- it's Postgres 14 + sysbench OLTP.

> and on
> further investigation, it seems like the cause is the always synchronous
> rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b
> ("mm: memcg: use rstat for non-hierarchical stats"). On further
> inspection it seems like we don't really need accurate stats in this
> function as it was already approximating the amount of appropriate
> shadow entried to keep for maintaining the refault information. Since
> there is already 2 sec periodic rstat flush, we don't need exact stats
> here. Let's ratelimit the rstat flush in this code path.
>
> Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/workingset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 2a2a34234df9..226012974328 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 struct lruvec *lruvec;
>                 int i;
>
> -               mem_cgroup_flush_stats(sc->memcg);
> +               mem_cgroup_flush_stats_ratelimited(sc->memcg);
>                 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>                 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>                         pages += lruvec_page_state_local(lruvec,

LGTM.

Yosry Ahmed Dec. 28, 2023, 3:13 p.m. UTC | #2

On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> One of our internal workload regressed on newer upstream kernel and on
> further investigation, it seems like the cause is the always synchronous
> rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b
> ("mm: memcg: use rstat for non-hierarchical stats"). On further
> inspection it seems like we don't really need accurate stats in this
> function as it was already approximating the amount of appropriate
> shadow entried to keep for maintaining the refault information. Since

s/entried/entries

> there is already 2 sec periodic rstat flush, we don't need exact stats
> here. Let's ratelimit the rstat flush in this code path.

Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg:
restore subtree stats flushing")? I think the answer is yes based on
internal discussions, but this really surprises me.

Commit f82e6bf9bb9b removed the percpu loop in
lruvec_page_state_local(), and added a flush call. With  7d7ef0a4686a,
the flush call is only effective if there are pending updates in the
cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW,
we should only be doing work when actually needed, whereas before we
used to have multiple percpu loops in count_shadow_nodes() regardless
of pending updates.

It seems like the cgroup subtree is very active that we continuously
need to flush in count_shadow_nodes()? If that's the case, do we still
think it's okay not to flush when we know there are pending updates? I
don't have enough background about the workingset heuristics to judge
this.

I am not objecting to this change, I am just trying to understand
what's happening.

Thanks!

>
> Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/workingset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 2a2a34234df9..226012974328 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>                 struct lruvec *lruvec;
>                 int i;
>
> -               mem_cgroup_flush_stats(sc->memcg);
> +               mem_cgroup_flush_stats_ratelimited(sc->memcg);
>                 lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>                 for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>                         pages += lruvec_page_state_local(lruvec,
> --
> 2.43.0.472.g3155946c3a-goog
>

Shakeel Butt Dec. 28, 2023, 5:44 p.m. UTC | #3

On Thu, Dec 28, 2023 at 07:13:23AM -0800, Yosry Ahmed wrote:
> On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > One of our internal workload regressed on newer upstream kernel and on
> > further investigation, it seems like the cause is the always synchronous
> > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b
> > ("mm: memcg: use rstat for non-hierarchical stats"). On further
> > inspection it seems like we don't really need accurate stats in this
> > function as it was already approximating the amount of appropriate
> > shadow entried to keep for maintaining the refault information. Since
> 
> s/entried/entries
> 
> > there is already 2 sec periodic rstat flush, we don't need exact stats
> > here. Let's ratelimit the rstat flush in this code path.
> 
> Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg:
> restore subtree stats flushing")? I think the answer is yes based on
> internal discussions, but this really surprises me.
> 

Yes, the regression was on latest mm-stable branch of Andrew's mm tree.

> Commit f82e6bf9bb9b removed the percpu loop in
> lruvec_page_state_local(), and added a flush call. With  7d7ef0a4686a,
> the flush call is only effective if there are pending updates in the
> cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW,
> we should only be doing work when actually needed, whereas before we
> used to have multiple percpu loops in count_shadow_nodes() regardless
> of pending updates.
> 
> It seems like the cgroup subtree is very active that we continuously
> need to flush in count_shadow_nodes()? If that's the case, do we still
> think it's okay not to flush when we know there are pending updates? I
> don't have enough background about the workingset heuristics to judge
> this.

Not all updates might be related to the stats being read here. Also the
read value is further divided by 8 and manipulated more in
do_shrink_slab(). So, I don't think we need less than 2 seconds accuracy
for these stats here.

diff --git a/mm/workingset.c b/mm/workingset.c
index 2a2a34234df9..226012974328 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -680,7 +680,7 @@  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
-		mem_cgroup_flush_stats(sc->memcg);
+		mem_cgroup_flush_stats_ratelimited(sc->memcg);
 		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,

mm: ratelimit stat flush from workingset shrinker

Commit Message

Comments

Patch