Message ID | 20231228073055.4046430-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: ratelimit stat flush from workingset shrinker | expand |
On Thu, Dec 28, 2023 at 12:31 AM Shakeel Butt <shakeelb@google.com> wrote: > > One of our internal workload regressed on newer upstream kernel Not really internal -- it's Postgres 14 + sysbench OLTP. > and on > further investigation, it seems like the cause is the always synchronous > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > ("mm: memcg: use rstat for non-hierarchical stats"). On further > inspection it seems like we don't really need accurate stats in this > function as it was already approximating the amount of appropriate > shadow entried to keep for maintaining the refault information. Since > there is already 2 sec periodic rstat flush, we don't need exact stats > here. Let's ratelimit the rstat flush in this code path. > > Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/workingset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index 2a2a34234df9..226012974328 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > - mem_cgroup_flush_stats(sc->memcg); > + mem_cgroup_flush_stats_ratelimited(sc->memcg); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, LGTM.
On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > One of our internal workload regressed on newer upstream kernel and on > further investigation, it seems like the cause is the always synchronous > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > ("mm: memcg: use rstat for non-hierarchical stats"). On further > inspection it seems like we don't really need accurate stats in this > function as it was already approximating the amount of appropriate > shadow entried to keep for maintaining the refault information. Since s/entried/entries > there is already 2 sec periodic rstat flush, we don't need exact stats > here. Let's ratelimit the rstat flush in this code path. Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing")? I think the answer is yes based on internal discussions, but this really surprises me. Commit f82e6bf9bb9b removed the percpu loop in lruvec_page_state_local(), and added a flush call. With 7d7ef0a4686a, the flush call is only effective if there are pending updates in the cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW, we should only be doing work when actually needed, whereas before we used to have multiple percpu loops in count_shadow_nodes() regardless of pending updates. It seems like the cgroup subtree is very active that we continuously need to flush in count_shadow_nodes()? If that's the case, do we still think it's okay not to flush when we know there are pending updates? I don't have enough background about the workingset heuristics to judge this. I am not objecting to this change, I am just trying to understand what's happening. Thanks! > > Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/workingset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index 2a2a34234df9..226012974328 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > - mem_cgroup_flush_stats(sc->memcg); > + mem_cgroup_flush_stats_ratelimited(sc->memcg); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, > -- > 2.43.0.472.g3155946c3a-goog >
On Thu, Dec 28, 2023 at 07:13:23AM -0800, Yosry Ahmed wrote: > On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > One of our internal workload regressed on newer upstream kernel and on > > further investigation, it seems like the cause is the always synchronous > > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > > ("mm: memcg: use rstat for non-hierarchical stats"). On further > > inspection it seems like we don't really need accurate stats in this > > function as it was already approximating the amount of appropriate > > shadow entried to keep for maintaining the refault information. Since > > s/entried/entries > > > there is already 2 sec periodic rstat flush, we don't need exact stats > > here. Let's ratelimit the rstat flush in this code path. > > Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg: > restore subtree stats flushing")? I think the answer is yes based on > internal discussions, but this really surprises me. > Yes, the regression was on latest mm-stable branch of Andrew's mm tree. > Commit f82e6bf9bb9b removed the percpu loop in > lruvec_page_state_local(), and added a flush call. With 7d7ef0a4686a, > the flush call is only effective if there are pending updates in the > cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW, > we should only be doing work when actually needed, whereas before we > used to have multiple percpu loops in count_shadow_nodes() regardless > of pending updates. > > It seems like the cgroup subtree is very active that we continuously > need to flush in count_shadow_nodes()? If that's the case, do we still > think it's okay not to flush when we know there are pending updates? I > don't have enough background about the workingset heuristics to judge > this. Not all updates might be related to the stats being read here. Also the read value is further divided by 8 and manipulated more in do_shrink_slab(). So, I don't think we need less than 2 seconds accuracy for these stats here.
diff --git a/mm/workingset.c b/mm/workingset.c index 2a2a34234df9..226012974328 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; - mem_cgroup_flush_stats(sc->memcg); + mem_cgroup_flush_stats_ratelimited(sc->memcg); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,
One of our internal workload regressed on newer upstream kernel and on further investigation, it seems like the cause is the always synchronous rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats"). On further inspection it seems like we don't really need accurate stats in this function as it was already approximating the amount of appropriate shadow entried to keep for maintaining the refault information. Since there is already 2 sec periodic rstat flush, we don't need exact stats here. Let's ratelimit the rstat flush in this code path. Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") Signed-off-by: Shakeel Butt <shakeelb@google.com> --- mm/workingset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)