Message ID | 20200915171801.39761-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5] mm: memcontrol: Add the missing numa_stat interface for cgroup v2 | expand |
On Wed, Sep 16, 2020 at 01:18:01AM +0800, Muchun Song wrote: > In the cgroup v1, we have a numa_stat interface. This is useful for > providing visibility into the numa locality information within an > memcg since the pages are allowed to be allocated from any physical > node. One of the use cases is evaluating application performance by > combining this information with the application's CPU allocation. > But the cgroup v2 does not. So this patch adds the missing information. > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > Suggested-by: Shakeel Butt <shakeelb@google.com> > Reviewed-by: Shakeel Butt <shakeelb@google.com> Yup, that would be useful information to have. Just a few comments on the patch below: > @@ -1368,6 +1368,78 @@ PAGE_SIZE multiple when read back. > collapsing an existing range of pages. This counter is not > present when CONFIG_TRANSPARENT_HUGEPAGE is not set. > > + memory.numa_stat > + A read-only flat-keyed file which exists on non-root cgroups. It's a nested key file, not flat. > + This breaks down the cgroup's memory footprint into different > + types of memory, type-specific details, and other information > + per node on the state of the memory management system. > + > + This is useful for providing visibility into the NUMA locality > + information within an memcg since the pages are allowed to be > + allocated from any physical node. One of the use case is evaluating > + application performance by combining this information with the > + application's CPU allocation. > + > + All memory amounts are in bytes. > + > + The output format of memory.numa_stat is:: > + > + type N0=<bytes in node 0> N1=<bytes in node 1> ... > + > + The entries are ordered to be human readable, and new entries > + can show up in the middle. Don't rely on items remaining in a > + fixed position; use the keys to look up specific values! > + > + anon > + Amount of memory per node used in anonymous mappings such > + as brk(), sbrk(), and mmap(MAP_ANONYMOUS). > + > + file > + Amount of memory per node used to cache filesystem data, > + including tmpfs and shared memory. > + > + kernel_stack > + Amount of memory per node allocated to kernel stacks. > + > + shmem > + Amount of cached filesystem data per node that is swap-backed, > + such as tmpfs, shm segments, shared anonymous mmap()s. > + > + file_mapped > + Amount of cached filesystem data per node mapped with mmap(). > + > + file_dirty > + Amount of cached filesystem data per node that was modified but > + not yet written back to disk. > + > + file_writeback > + Amount of cached filesystem data per node that was modified and > + is currently being written back to disk. > + > + anon_thp > + Amount of memory per node used in anonymous mappings backed by > + transparent hugepages. > + > + inactive_anon, active_anon, inactive_file, active_file, unevictable > + Amount of memory, swap-backed and filesystem-backed, > + per node on the internal memory management lists used > + by the page reclaim algorithm. > + > + As these represent internal list state (e.g. shmem pages are on > + anon memory management lists), inactive_foo + active_foo may not > + be equal to the value for the foo counter, since the foo counter > + is type-based, not list-based. > + > + slab_reclaimable > + Amount of memory per node used for storing in-kernel data > + structures which might be reclaimed, such as dentries and > + inodes. > + > + slab_unreclaimable > + Amount of memory per node used for storing in-kernel data > + structures which cannot be reclaimed on memory pressure. > + > memory.swap.current > A read-only single value file which exists on non-root > cgroups. > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 75cd1a1e66c8..ff919ef3b57b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6425,6 +6425,86 @@ static int memory_stat_show(struct seq_file *m, void *v) > return 0; > } > > +#ifdef CONFIG_NUMA > +struct numa_stat { > + const char *name; > + unsigned int ratio; > + enum node_stat_item idx; > +}; > + > +static struct numa_stat numa_stats[] = { > + { "anon", PAGE_SIZE, NR_ANON_MAPPED }, > + { "file", PAGE_SIZE, NR_FILE_PAGES }, > + { "kernel_stack", 1024, NR_KERNEL_STACK_KB }, > + { "shmem", PAGE_SIZE, NR_SHMEM }, > + { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED }, > + { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY }, > + { "file_writeback", PAGE_SIZE, NR_WRITEBACK }, > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* > + * The ratio will be initialized in numa_stats_init(). Because > + * on some architectures, the macro of HPAGE_PMD_SIZE is not > + * constant(e.g. powerpc). > + */ > + { "anon_thp", 0, NR_ANON_THPS }, > +#endif > + { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON }, > + { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON }, > + { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE }, > + { "active_file", PAGE_SIZE, NR_ACTIVE_FILE }, > + { "unevictable", PAGE_SIZE, NR_UNEVICTABLE }, > + { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B }, > + { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B }, > +}; This is a bit duplicative with memory_stat_format(), and the collections will easily go out of sync as we add/change stat items. Can you please convert memory_stat_format() to use the same shared table? You may have to add another flag for the MEMCG_* items for which we don't have per-node counters. The same applies to the documentation. Please don't duplicate the list of items, but have memory.numa_stat refer to the list for memory.stat. You can add (not in memory.numa_stat) or something to percpu and sock. > +static unsigned long memcg_node_page_state(struct mem_cgroup *memcg, > + unsigned int nid, > + enum node_stat_item idx) > +{ > + VM_BUG_ON(nid >= nr_node_ids); > + return lruvec_page_state(mem_cgroup_lruvec(memcg, NODE_DATA(nid)), idx); > +} Please drop this wrapper and use lruvec_page_state directly below. Otherwise, this looks reasonable to me.
On Wed, Sep 16, 2020 at 5:50 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Sep 16, 2020 at 01:18:01AM +0800, Muchun Song wrote: > > In the cgroup v1, we have a numa_stat interface. This is useful for > > providing visibility into the numa locality information within an > > memcg since the pages are allowed to be allocated from any physical > > node. One of the use cases is evaluating application performance by > > combining this information with the application's CPU allocation. > > But the cgroup v2 does not. So this patch adds the missing information. > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > Suggested-by: Shakeel Butt <shakeelb@google.com> > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > Yup, that would be useful information to have. Just a few comments on > the patch below: > > > @@ -1368,6 +1368,78 @@ PAGE_SIZE multiple when read back. > > collapsing an existing range of pages. This counter is not > > present when CONFIG_TRANSPARENT_HUGEPAGE is not set. > > > > + memory.numa_stat > > + A read-only flat-keyed file which exists on non-root cgroups. > > It's a nested key file, not flat. This is just copied from memory.stat documentation.Is the memory.stat also a nested key file? > > > + This breaks down the cgroup's memory footprint into different > > + types of memory, type-specific details, and other information > > + per node on the state of the memory management system. > > + > > + This is useful for providing visibility into the NUMA locality > > + information within an memcg since the pages are allowed to be > > + allocated from any physical node. One of the use case is evaluating > > + application performance by combining this information with the > > + application's CPU allocation. > > + > > + All memory amounts are in bytes. > > + > > + The output format of memory.numa_stat is:: > > + > > + type N0=<bytes in node 0> N1=<bytes in node 1> ... > > + > > + The entries are ordered to be human readable, and new entries > > + can show up in the middle. Don't rely on items remaining in a > > + fixed position; use the keys to look up specific values! > > + > > + anon > > + Amount of memory per node used in anonymous mappings such > > + as brk(), sbrk(), and mmap(MAP_ANONYMOUS). > > + > > + file > > + Amount of memory per node used to cache filesystem data, > > + including tmpfs and shared memory. > > + > > + kernel_stack > > + Amount of memory per node allocated to kernel stacks. > > + > > + shmem > > + Amount of cached filesystem data per node that is swap-backed, > > + such as tmpfs, shm segments, shared anonymous mmap()s. > > + > > + file_mapped > > + Amount of cached filesystem data per node mapped with mmap(). > > + > > + file_dirty > > + Amount of cached filesystem data per node that was modified but > > + not yet written back to disk. > > + > > + file_writeback > > + Amount of cached filesystem data per node that was modified and > > + is currently being written back to disk. > > + > > + anon_thp > > + Amount of memory per node used in anonymous mappings backed by > > + transparent hugepages. > > + > > + inactive_anon, active_anon, inactive_file, active_file, unevictable > > + Amount of memory, swap-backed and filesystem-backed, > > + per node on the internal memory management lists used > > + by the page reclaim algorithm. > > + > > + As these represent internal list state (e.g. shmem pages are on > > + anon memory management lists), inactive_foo + active_foo may not > > + be equal to the value for the foo counter, since the foo counter > > + is type-based, not list-based. > > + > > + slab_reclaimable > > + Amount of memory per node used for storing in-kernel data > > + structures which might be reclaimed, such as dentries and > > + inodes. > > + > > + slab_unreclaimable > > + Amount of memory per node used for storing in-kernel data > > + structures which cannot be reclaimed on memory pressure. > > + > > memory.swap.current > > A read-only single value file which exists on non-root > > cgroups. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 75cd1a1e66c8..ff919ef3b57b 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6425,6 +6425,86 @@ static int memory_stat_show(struct seq_file *m, void *v) > > return 0; > > } > > > > +#ifdef CONFIG_NUMA > > +struct numa_stat { > > + const char *name; > > + unsigned int ratio; > > + enum node_stat_item idx; > > +}; > > + > > +static struct numa_stat numa_stats[] = { > > + { "anon", PAGE_SIZE, NR_ANON_MAPPED }, > > + { "file", PAGE_SIZE, NR_FILE_PAGES }, > > + { "kernel_stack", 1024, NR_KERNEL_STACK_KB }, > > + { "shmem", PAGE_SIZE, NR_SHMEM }, > > + { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED }, > > + { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY }, > > + { "file_writeback", PAGE_SIZE, NR_WRITEBACK }, > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > + /* > > + * The ratio will be initialized in numa_stats_init(). Because > > + * on some architectures, the macro of HPAGE_PMD_SIZE is not > > + * constant(e.g. powerpc). > > + */ > > + { "anon_thp", 0, NR_ANON_THPS }, > > +#endif > > + { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON }, > > + { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON }, > > + { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE }, > > + { "active_file", PAGE_SIZE, NR_ACTIVE_FILE }, > > + { "unevictable", PAGE_SIZE, NR_UNEVICTABLE }, > > + { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B }, > > + { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B }, > > +}; > > This is a bit duplicative with memory_stat_format(), and the > collections will easily go out of sync as we add/change stat items. > > Can you please convert memory_stat_format() to use the same shared table? > > You may have to add another flag for the MEMCG_* items for which we > don't have per-node counters. > > The same applies to the documentation. Please don't duplicate the list > of items, but have memory.numa_stat refer to the list for memory.stat. > You can add (not in memory.numa_stat) or something to percpu and sock. Thanks for your suggestions. > > > +static unsigned long memcg_node_page_state(struct mem_cgroup *memcg, > > + unsigned int nid, > > + enum node_stat_item idx) > > +{ > > + VM_BUG_ON(nid >= nr_node_ids); > > + return lruvec_page_state(mem_cgroup_lruvec(memcg, NODE_DATA(nid)), idx); > > +} > > Please drop this wrapper and use lruvec_page_state directly below. > > Otherwise, this looks reasonable to me. OK. Will do that.
On Wed, Sep 16, 2020 at 12:14:49PM +0800, Muchun Song wrote: > On Wed, Sep 16, 2020 at 5:50 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Sep 16, 2020 at 01:18:01AM +0800, Muchun Song wrote: > > > In the cgroup v1, we have a numa_stat interface. This is useful for > > > providing visibility into the numa locality information within an > > > memcg since the pages are allowed to be allocated from any physical > > > node. One of the use cases is evaluating application performance by > > > combining this information with the application's CPU allocation. > > > But the cgroup v2 does not. So this patch adds the missing information. > > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > Suggested-by: Shakeel Butt <shakeelb@google.com> > > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > > > Yup, that would be useful information to have. Just a few comments on > > the patch below: > > > > > @@ -1368,6 +1368,78 @@ PAGE_SIZE multiple when read back. > > > collapsing an existing range of pages. This counter is not > > > present when CONFIG_TRANSPARENT_HUGEPAGE is not set. > > > > > > + memory.numa_stat > > > + A read-only flat-keyed file which exists on non-root cgroups. > > > > It's a nested key file, not flat. > > This is just copied from memory.stat documentation.Is the memory.stat > also a nested key file? No, memory.stat is a different format. From higher up in the document: Flat keyed KEY0 VAL0\n KEY1 VAL1\n ... Nested keyed KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... ... > > Otherwise, this looks reasonable to me. > > OK. Will do that. Thanks!
On Wed, Sep 16, 2020 at 10:42 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Sep 16, 2020 at 12:14:49PM +0800, Muchun Song wrote: > > On Wed, Sep 16, 2020 at 5:50 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Sep 16, 2020 at 01:18:01AM +0800, Muchun Song wrote: > > > > In the cgroup v1, we have a numa_stat interface. This is useful for > > > > providing visibility into the numa locality information within an > > > > memcg since the pages are allowed to be allocated from any physical > > > > node. One of the use cases is evaluating application performance by > > > > combining this information with the application's CPU allocation. > > > > But the cgroup v2 does not. So this patch adds the missing information. > > > > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > > > Suggested-by: Shakeel Butt <shakeelb@google.com> > > > > Reviewed-by: Shakeel Butt <shakeelb@google.com> > > > > > > Yup, that would be useful information to have. Just a few comments on > > > the patch below: > > > > > > > @@ -1368,6 +1368,78 @@ PAGE_SIZE multiple when read back. > > > > collapsing an existing range of pages. This counter is not > > > > present when CONFIG_TRANSPARENT_HUGEPAGE is not set. > > > > > > > > + memory.numa_stat > > > > + A read-only flat-keyed file which exists on non-root cgroups. > > > > > > It's a nested key file, not flat. > > > > This is just copied from memory.stat documentation.Is the memory.stat > > also a nested key file? > > No, memory.stat is a different format. From higher up in the document: > > Flat keyed > > KEY0 VAL0\n > KEY1 VAL1\n > ... > > Nested keyed > > KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... > KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... > ... Got it. Thanks for your explanation. > > > > Otherwise, this looks reasonable to me. > > > > OK. Will do that. > > Thanks!
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 6be43781ec7f..48bb12fc7622 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1368,6 +1368,78 @@ PAGE_SIZE multiple when read back. collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set. + memory.numa_stat + A read-only flat-keyed file which exists on non-root cgroups. + + This breaks down the cgroup's memory footprint into different + types of memory, type-specific details, and other information + per node on the state of the memory management system. + + This is useful for providing visibility into the NUMA locality + information within an memcg since the pages are allowed to be + allocated from any physical node. One of the use case is evaluating + application performance by combining this information with the + application's CPU allocation. + + All memory amounts are in bytes. + + The output format of memory.numa_stat is:: + + type N0=<bytes in node 0> N1=<bytes in node 1> ... + + The entries are ordered to be human readable, and new entries + can show up in the middle. Don't rely on items remaining in a + fixed position; use the keys to look up specific values! + + anon + Amount of memory per node used in anonymous mappings such + as brk(), sbrk(), and mmap(MAP_ANONYMOUS). + + file + Amount of memory per node used to cache filesystem data, + including tmpfs and shared memory. + + kernel_stack + Amount of memory per node allocated to kernel stacks. + + shmem + Amount of cached filesystem data per node that is swap-backed, + such as tmpfs, shm segments, shared anonymous mmap()s. + + file_mapped + Amount of cached filesystem data per node mapped with mmap(). + + file_dirty + Amount of cached filesystem data per node that was modified but + not yet written back to disk. + + file_writeback + Amount of cached filesystem data per node that was modified and + is currently being written back to disk. + + anon_thp + Amount of memory per node used in anonymous mappings backed by + transparent hugepages. + + inactive_anon, active_anon, inactive_file, active_file, unevictable + Amount of memory, swap-backed and filesystem-backed, + per node on the internal memory management lists used + by the page reclaim algorithm. + + As these represent internal list state (e.g. shmem pages are on + anon memory management lists), inactive_foo + active_foo may not + be equal to the value for the foo counter, since the foo counter + is type-based, not list-based. + + slab_reclaimable + Amount of memory per node used for storing in-kernel data + structures which might be reclaimed, such as dentries and + inodes. + + slab_unreclaimable + Amount of memory per node used for storing in-kernel data + structures which cannot be reclaimed on memory pressure. + memory.swap.current A read-only single value file which exists on non-root cgroups. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75cd1a1e66c8..ff919ef3b57b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6425,6 +6425,86 @@ static int memory_stat_show(struct seq_file *m, void *v) return 0; } +#ifdef CONFIG_NUMA +struct numa_stat { + const char *name; + unsigned int ratio; + enum node_stat_item idx; +}; + +static struct numa_stat numa_stats[] = { + { "anon", PAGE_SIZE, NR_ANON_MAPPED }, + { "file", PAGE_SIZE, NR_FILE_PAGES }, + { "kernel_stack", 1024, NR_KERNEL_STACK_KB }, + { "shmem", PAGE_SIZE, NR_SHMEM }, + { "file_mapped", PAGE_SIZE, NR_FILE_MAPPED }, + { "file_dirty", PAGE_SIZE, NR_FILE_DIRTY }, + { "file_writeback", PAGE_SIZE, NR_WRITEBACK }, +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* + * The ratio will be initialized in numa_stats_init(). Because + * on some architectures, the macro of HPAGE_PMD_SIZE is not + * constant(e.g. powerpc). + */ + { "anon_thp", 0, NR_ANON_THPS }, +#endif + { "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON }, + { "active_anon", PAGE_SIZE, NR_ACTIVE_ANON }, + { "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE }, + { "active_file", PAGE_SIZE, NR_ACTIVE_FILE }, + { "unevictable", PAGE_SIZE, NR_UNEVICTABLE }, + { "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B }, + { "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B }, +}; + +static int __init numa_stats_init(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(numa_stats); i++) { +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (numa_stats[i].idx == NR_ANON_THPS) + numa_stats[i].ratio = HPAGE_PMD_SIZE; +#endif + } + + return 0; +} +pure_initcall(numa_stats_init); + +static unsigned long memcg_node_page_state(struct mem_cgroup *memcg, + unsigned int nid, + enum node_stat_item idx) +{ + VM_BUG_ON(nid >= nr_node_ids); + return lruvec_page_state(mem_cgroup_lruvec(memcg, NODE_DATA(nid)), idx); +} + +static int memory_numa_stat_show(struct seq_file *m, void *v) +{ + int i; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for (i = 0; i < ARRAY_SIZE(numa_stats); i++) { + int nid; + + seq_printf(m, "%s", numa_stats[i].name); + for_each_node_state(nid, N_MEMORY) { + u64 size; + + size = memcg_node_page_state(memcg, nid, + numa_stats[i].idx); + VM_WARN_ON_ONCE(!numa_stats[i].ratio); + size *= numa_stats[i].ratio; + seq_printf(m, " N%d=%llu", nid, size); + } + seq_putc(m, '\n'); + } + + return 0; +} +#endif + static int memory_oom_group_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); @@ -6502,6 +6582,12 @@ static struct cftype memory_files[] = { .name = "stat", .seq_show = memory_stat_show, }, +#ifdef CONFIG_NUMA + { + .name = "numa_stat", + .seq_show = memory_numa_stat_show, + }, +#endif { .name = "oom.group", .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,