Message ID | 20210202184746.119084-2-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | : mm: memcontrol: switch to rstat | expand |
On Tue, Feb 2, 2021 at 12:18 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> I think we need Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Reviewed-by: Shakeel Butt <shakeelb@google.com>
On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. To the whole series: it's really nice to have an accurate stats at non-leaf levels. Just as an illustration: if there are 32 CPUs and 1000 sub-cgroups (which is an absolutely realistic number, because often there are many dying generations of each cgroup), the error margin is 3.9GB. It makes all numbers pretty much random and all possible tests extremely flaky. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> To this patch: Reviewed-by: Roman Gushchin <guro@fb.com> Thanks!
On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > The memcg hotunplug callback erroneously flushes counts on the local > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > Flush the CPU that is actually going away. > > > > Also simplify the code a bit by using mod_memcg_state() and > > count_memcg_events() instead of open-coding the upward flush - this is > > comparable to how vmstat.c handles hotunplug flushing. > > To the whole series: it's really nice to have an accurate stats at > non-leaf levels. Just as an illustration: if there are 32 CPUs and > 1000 sub-cgroups (which is an absolutely realistic number, because > often there are many dying generations of each cgroup), the error > margin is 3.9GB. It makes all numbers pretty much random and all > possible tests extremely flaky. Btw, I was just looking into kmem kselftests failures/flakiness, which is caused by exactly this problem: without waiting for the finish of dying cgroups reclaim, we can't make any reliable assumptions about what to expect from memcg stats. So looking forward to have this patchset merged!
On Tue 02-02-21 13:47:40, Johannes Weiner wrote: > The memcg hotunplug callback erroneously flushes counts on the local > CPU, not the counts of the CPU going away; those counts will be lost. > > Flush the CPU that is actually going away. > > Also simplify the code a bit by using mod_memcg_state() and > count_memcg_events() instead of open-coding the upward flush - this is > comparable to how vmstat.c handles hotunplug flushing. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Shakeel has already pointed out Fixes. > --- > mm/memcontrol.c | 35 +++++++++++++++++++++-------------- > 1 file changed, 21 insertions(+), 14 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ed5cc78a8dbf..8120d565dd79 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) > static int memcg_hotplug_cpu_dead(unsigned int cpu) > { > struct memcg_stock_pcp *stock; > - struct mem_cgroup *memcg, *mi; > + struct mem_cgroup *memcg; > > stock = &per_cpu(memcg_stock, cpu); > drain_stock(stock); > > for_each_mem_cgroup(memcg) { > + struct memcg_vmstats_percpu *statc; > int i; > > + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); > + > for (i = 0; i < MEMCG_NR_STAT; i++) { > int nid; > - long x; > > - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); > - if (x) > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &memcg->vmstats[i]); > + if (statc->stat[i]) { > + mod_memcg_state(memcg, i, statc->stat[i]); > + statc->stat[i] = 0; > + } > > if (i >= NR_VM_NODE_STAT_ITEMS) > continue; > > for_each_node(nid) { > + struct batched_lruvec_stat *lstatc; > struct mem_cgroup_per_node *pn; > + long x; > > pn = mem_cgroup_nodeinfo(memcg, nid); > - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); > - if (x) > + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); > + > + x = lstatc->count[i]; > + lstatc->count[i] = 0; > + > + if (x) { > do { > atomic_long_add(x, &pn->lruvec_stat[i]); > } while ((pn = parent_nodeinfo(pn, nid))); > + } > } > } > > for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { > - long x; > - > - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); > - if (x) > - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) > - atomic_long_add(x, &memcg->vmevents[i]); > + if (statc->events[i]) { > + count_memcg_events(memcg, i, statc->events[i]); > + statc->events[i] = 0; > + } > } > } > > -- > 2.30.0 >
On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote: > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > > The memcg hotunplug callback erroneously flushes counts on the local > > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > > > Flush the CPU that is actually going away. > > > > > > Also simplify the code a bit by using mod_memcg_state() and > > > count_memcg_events() instead of open-coding the upward flush - this is > > > comparable to how vmstat.c handles hotunplug flushing. > > > > To the whole series: it's really nice to have an accurate stats at > > non-leaf levels. Just as an illustration: if there are 32 CPUs and > > 1000 sub-cgroups (which is an absolutely realistic number, because > > often there are many dying generations of each cgroup), the error > > margin is 3.9GB. It makes all numbers pretty much random and all > > possible tests extremely flaky. > > Btw, I was just looking into kmem kselftests failures/flakiness, > which is caused by exactly this problem: without waiting for the > finish of dying cgroups reclaim, we can't make any reliable assumptions > about what to expect from memcg stats. Good point about the selftests. I gave them a shot, and indeed this series makes test_kmem work again: vanilla: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic patched: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic It even passes with a reduced margin in the patched kernel, since the percpu drift - which this test already tried to account for - is now only on the page_counter side (whereas memory.stat is always precise). I'm going to include that data in the v2 changelog, as well as a patch to update test_kmem.c to the more stringent error tolerances. > So looking forward to have this patchset merged! Thanks
On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote: > On Tue, Feb 02, 2021 at 06:28:53PM -0800, Roman Gushchin wrote: > > On Tue, Feb 02, 2021 at 03:07:47PM -0800, Roman Gushchin wrote: > > > On Tue, Feb 02, 2021 at 01:47:40PM -0500, Johannes Weiner wrote: > > > > The memcg hotunplug callback erroneously flushes counts on the local > > > > CPU, not the counts of the CPU going away; those counts will be lost. > > > > > > > > Flush the CPU that is actually going away. > > > > > > > > Also simplify the code a bit by using mod_memcg_state() and > > > > count_memcg_events() instead of open-coding the upward flush - this is > > > > comparable to how vmstat.c handles hotunplug flushing. > > > > > > To the whole series: it's really nice to have an accurate stats at > > > non-leaf levels. Just as an illustration: if there are 32 CPUs and > > > 1000 sub-cgroups (which is an absolutely realistic number, because > > > often there are many dying generations of each cgroup), the error > > > margin is 3.9GB. It makes all numbers pretty much random and all > > > possible tests extremely flaky. > > > > Btw, I was just looking into kmem kselftests failures/flakiness, > > which is caused by exactly this problem: without waiting for the > > finish of dying cgroups reclaim, we can't make any reliable assumptions > > about what to expect from memcg stats. > > Good point about the selftests. I gave them a shot, and indeed this > series makes test_kmem work again: > > vanilla: > ok 1 test_kmem_basic > memory.current = 8810496 > slab + anon + file + kernel_stack = 17074568 > slab = 6101384 > anon = 946176 > file = 0 > kernel_stack = 10027008 > not ok 2 test_kmem_memcg_deletion > ok 3 test_kmem_proc_kpagecgroup > ok 4 test_kmem_kernel_stacks > ok 5 test_kmem_dead_cgroups > ok 6 test_percpu_basic > > patched: > ok 1 test_kmem_basic > ok 2 test_kmem_memcg_deletion > ok 3 test_kmem_proc_kpagecgroup > ok 4 test_kmem_kernel_stacks > ok 5 test_kmem_dead_cgroups > ok 6 test_percpu_basic Nice! Thanks for checking. > > It even passes with a reduced margin in the patched kernel, since the > percpu drift - which this test already tried to account for - is now > only on the page_counter side (whereas memory.stat is always precise). > > I'm going to include that data in the v2 changelog, as well as a patch > to update test_kmem.c to the more stringent error tolerances. Hm, I'm not sure it's a good idea to unconditionally lower the error tolerance: it's convenient to be able to run the same test on older kernels.
On Thu, Feb 04, 2021 at 11:34:46AM -0800, Roman Gushchin wrote: > On Thu, Feb 04, 2021 at 02:29:57PM -0500, Johannes Weiner wrote: > > It even passes with a reduced margin in the patched kernel, since the > > percpu drift - which this test already tried to account for - is now > > only on the page_counter side (whereas memory.stat is always precise). > > > > I'm going to include that data in the v2 changelog, as well as a patch > > to update test_kmem.c to the more stringent error tolerances. > > Hm, I'm not sure it's a good idea to unconditionally lower the error tolerance: > it's convenient to be able to run the same test on older kernels. Well, an older version of the kernel will have an older version of the test that is tailored towards that kernel's specific behavior. That's sort of the point of tracking code and tests in the same git tree: to have meaningful, effective and precise tests of an ever-changing implementation. Trying to be backward compatible will lower the test signal and miss regressions, when a backward compatible version is at most one git checkout away.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ed5cc78a8dbf..8120d565dd79 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2411,45 +2411,52 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg, *mi; + struct mem_cgroup *memcg; stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); for_each_mem_cgroup(memcg) { + struct memcg_vmstats_percpu *statc; int i; + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + for (i = 0; i < MEMCG_NR_STAT; i++) { int nid; - long x; - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmstats[i]); + if (statc->stat[i]) { + mod_memcg_state(memcg, i, statc->stat[i]); + statc->stat[i] = 0; + } if (i >= NR_VM_NODE_STAT_ITEMS) continue; for_each_node(nid) { + struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; + long x; pn = mem_cgroup_nodeinfo(memcg, nid); - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); + + x = lstatc->count[i]; + lstatc->count[i] = 0; + + if (x) { do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); + } } } for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - long x; - - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmevents[i]); + if (statc->events[i]) { + count_memcg_events(memcg, i, statc->events[i]); + statc->events[i] = 0; + } } }
The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost. Flush the CPU that is actually going away. Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/memcontrol.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-)