Message ID | 20241101204402.1885383-1-joshua.hahnjy@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v4,1/1] memcg/hugetlb: Add hugeTLB counters to memcg | expand |
On Fri, 1 Nov 2024, Joshua Hahn wrote: > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 69af2173555f..bd7e81c2aa2b 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1646,6 +1646,11 @@ The following nested keys are defined. > pgdemote_khugepaged > Number of pages demoted by khugepaged. > > + hugetlb > + Amount of memory used by hugetlb pages. This metric only shows > + up if hugetlb usage is accounted for in memory.current (i.e. > + cgroup is mounted with the memory_hugetlb_accounting option). > + > memory.numa_stat > A read-only nested-keyed file which exists on non-root cgroups. > Definitely makes sense to include this. Any reason to not account different hugetlb page sizes separately in this stat, however? IOW, should there be separate hugetlb_2048kB and hugetlb_1048576kB stats on x86?
On Sat, Nov 9, 2024 at 9:19 PM David Rientjes <rientjes@google.com> wrote: > > On Fri, 1 Nov 2024, Joshua Hahn wrote: > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69af2173555f..bd7e81c2aa2b 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1646,6 +1646,11 @@ The following nested keys are defined. > > pgdemote_khugepaged > > Number of pages demoted by khugepaged. > > > > + hugetlb > > + Amount of memory used by hugetlb pages. This metric only shows > > + up if hugetlb usage is accounted for in memory.current (i.e. > > + cgroup is mounted with the memory_hugetlb_accounting option). > > + > > memory.numa_stat > > A read-only nested-keyed file which exists on non-root cgroups. > > > > Definitely makes sense to include this. > > Any reason to not account different hugetlb page sizes separately in this > stat, however? IOW, should there be separate hugetlb_2048kB and > hugetlb_1048576kB stats on x86? Hello David, Thank you for reviewing my patch! The reason that I opted not to include a breakdown of each hugetlb size in memory.stat is only because I wanted to keep the addition that this patch makes as minimal as possible, while still addressing the goal of bridging the gap between memory.stat and memory.current. Users who are curious about this breakdown can see how much memory is used by each hugetlb size by enabling the hugetlb controller as well. It's true that this is the case as well for total hugeltb usage, but I felt that not including hugetlb memory usage in memory.stat when it is accounted by memory.current would cause confusion for the users not being able to see that memory.current = sum of memory.stat. On the other hand, seeing the breakdown of how much each hugetlb size felt more like an optimization, and not a solution that bridges a confusion. I have not had a scenario where I had to look at the breakdown of the hugetlb sizes (without the hugetlb controller), or a scenario where not knowing this causes some sort of confusion. If others have had this problem, I would love to hear about it, and perhaps work on a solution that can address this point as well! I hope you have a great day! Joshua
On Mon, 11 Nov 2024, Joshua Hahn wrote: > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > > index 69af2173555f..bd7e81c2aa2b 100644 > > > --- a/Documentation/admin-guide/cgroup-v2.rst > > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > > @@ -1646,6 +1646,11 @@ The following nested keys are defined. > > > pgdemote_khugepaged > > > Number of pages demoted by khugepaged. > > > > > > + hugetlb > > > + Amount of memory used by hugetlb pages. This metric only shows > > > + up if hugetlb usage is accounted for in memory.current (i.e. > > > + cgroup is mounted with the memory_hugetlb_accounting option). > > > + > > > memory.numa_stat > > > A read-only nested-keyed file which exists on non-root cgroups. > > > > > > > Definitely makes sense to include this. > > > > Any reason to not account different hugetlb page sizes separately in this > > stat, however? IOW, should there be separate hugetlb_2048kB and > > hugetlb_1048576kB stats on x86? > > Hello David, Thank you for reviewing my patch! > > The reason that I opted not to include a breakdown of each hugetlb > size in memory.stat is only because I wanted to keep the addition that > this patch makes as minimal as possible, while still addressing > the goal of bridging the gap between memory.stat and memory.current. > Users who are curious about this breakdown can see how much memory > is used by each hugetlb size by enabling the hugetlb controller as well. > While the patch may be minimal, this is solidifying a kernel API that users will start to count on. Users who may be interested in their hugetlb usage may not have control over the configuration of their kernel? Does it make sense to provide a breakdown in memory.stat so that users can differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb pages, which are different global resources? > It's true that this is the case as well for total hugeltb usage, but > I felt that not including hugetlb memory usage in memory.stat when it > is accounted by memory.current would cause confusion for the users > not being able to see that memory.current = sum of memory.stat. On the > other hand, seeing the breakdown of how much each hugetlb size felt more > like an optimization, and not a solution that bridges a confusion. > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for example, users could still do sum of memory.stat, no?> > I have not had a scenario where I had to look at the breakdown of the > hugetlb sizes (without the hugetlb controller), or a scenario where not > knowing this causes some sort of confusion. If others have had this > problem, I would love to hear about it, and perhaps work on a solution > that can address this point as well! > > I hope you have a great day! You too!
On Mon, 11 Nov 2024, David Rientjes wrote: > > The reason that I opted not to include a breakdown of each hugetlb > > size in memory.stat is only because I wanted to keep the addition that > > this patch makes as minimal as possible, while still addressing > > the goal of bridging the gap between memory.stat and memory.current. > > Users who are curious about this breakdown can see how much memory > > is used by each hugetlb size by enabling the hugetlb controller as well. > > > > While the patch may be minimal, this is solidifying a kernel API that > users will start to count on. Users who may be interested in their > hugetlb usage may not have control over the configuration of their kernel? > > Does it make sense to provide a breakdown in memory.stat so that users can > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb > pages, which are different global resources? > > > It's true that this is the case as well for total hugeltb usage, but > > I felt that not including hugetlb memory usage in memory.stat when it > > is accounted by memory.current would cause confusion for the users > > not being able to see that memory.current = sum of memory.stat. On the > > other hand, seeing the breakdown of how much each hugetlb size felt more > > like an optimization, and not a solution that bridges a confusion. > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > example, users could still do sum of memory.stat, no?> > Friendly ping on this, would there be any objections to splitting the memory.stat metrics out to be per hugepage size?
On Wed, Nov 13, 2024 at 2:42 PM David Rientjes <rientjes@google.com> wrote: > > On Mon, 11 Nov 2024, David Rientjes wrote: > > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > > example, users could still do sum of memory.stat, no?> > > > > Friendly ping on this, would there be any objections to splitting the > memory.stat metrics out to be per hugepage size? My 2c is that it's extra complexity + overhead (IIUC these stats are hierarchical and hence would contribute to flushing overhead). So we should justify them with some concrete use cases if we are to add them. From my end, I need hugetlb usage when I want to reason about memory dynamics. This is because hugetlb is a bit special/weird - it cannot be swapped out for e.g. So I have subtract hugetlb usage from overall cgroup's memory usage before making any analysis that involves swapping. For this use case, I just need to know how much memory is consumed by hugetlb, regardless of the type (2M or 1G). I assume many use cases are similar in that sense. Do you or Google have a concrete use case in mind that requires hugetlb categorization? :)
On Wed, Nov 13, 2024 at 02:42:29PM -0800, David Rientjes wrote: > On Mon, 11 Nov 2024, David Rientjes wrote: > > > > The reason that I opted not to include a breakdown of each hugetlb > > > size in memory.stat is only because I wanted to keep the addition that > > > this patch makes as minimal as possible, while still addressing > > > the goal of bridging the gap between memory.stat and memory.current. > > > Users who are curious about this breakdown can see how much memory > > > is used by each hugetlb size by enabling the hugetlb controller as well. > > > > > > > While the patch may be minimal, this is solidifying a kernel API that > > users will start to count on. Users who may be interested in their > > hugetlb usage may not have control over the configuration of their kernel? > > > > Does it make sense to provide a breakdown in memory.stat so that users can > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb > > pages, which are different global resources? > > > > > It's true that this is the case as well for total hugeltb usage, but > > > I felt that not including hugetlb memory usage in memory.stat when it > > > is accounted by memory.current would cause confusion for the users > > > not being able to see that memory.current = sum of memory.stat. On the > > > other hand, seeing the breakdown of how much each hugetlb size felt more > > > like an optimization, and not a solution that bridges a confusion. > > > > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > > example, users could still do sum of memory.stat, no?> > > > > Friendly ping on this, would there be any objections to splitting the > memory.stat metrics out to be per hugepage size? I don't think it has to be either/or. We can add the total here, and a per-size breakdown in a separate patch (with its own changelog)? That said, a per-size breakdown might make more sense in the hugetlb cgroup controller. You're mentioning separate global resources, which suggests this is about more explicitly controlled hugetlb use. From a memcg POV, all hugetlb is the same. It's just (non-swappable) memory consumed by the cgroup.
On Thu 14-11-24 00:26:24, Johannes Weiner wrote: > On Wed, Nov 13, 2024 at 02:42:29PM -0800, David Rientjes wrote: > > On Mon, 11 Nov 2024, David Rientjes wrote: > > > > > > The reason that I opted not to include a breakdown of each hugetlb > > > > size in memory.stat is only because I wanted to keep the addition that > > > > this patch makes as minimal as possible, while still addressing > > > > the goal of bridging the gap between memory.stat and memory.current. > > > > Users who are curious about this breakdown can see how much memory > > > > is used by each hugetlb size by enabling the hugetlb controller as well. > > > > > > > > > > While the patch may be minimal, this is solidifying a kernel API that > > > users will start to count on. Users who may be interested in their > > > hugetlb usage may not have control over the configuration of their kernel? > > > > > > Does it make sense to provide a breakdown in memory.stat so that users can > > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb > > > pages, which are different global resources? > > > > > > > It's true that this is the case as well for total hugeltb usage, but > > > > I felt that not including hugetlb memory usage in memory.stat when it > > > > is accounted by memory.current would cause confusion for the users > > > > not being able to see that memory.current = sum of memory.stat. On the > > > > other hand, seeing the breakdown of how much each hugetlb size felt more > > > > like an optimization, and not a solution that bridges a confusion. > > > > > > > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > > > example, users could still do sum of memory.stat, no?> > > > > > > > Friendly ping on this, would there be any objections to splitting the > > memory.stat metrics out to be per hugepage size? > > I don't think it has to be either/or. We can add the total here, and a > per-size breakdown in a separate patch (with its own changelog)? > > That said, a per-size breakdown might make more sense in the hugetlb > cgroup controller. You're mentioning separate global resources, which > suggests this is about more explicitly controlled hugetlb use. > > >From a memcg POV, all hugetlb is the same. It's just (non-swappable) > memory consumed by the cgroup. Completely agreed. From the memcg POV there is no way to control hugetlb pages or manage per size charging/pools. In a sense this is not much different from slab accounting. We do print overall SLAB accounted memory and do not break down each slab consumer in the stat file.
On Thu, Nov 14, 2024 at 12:26:24AM -0500, Johannes Weiner wrote: > On Wed, Nov 13, 2024 at 02:42:29PM -0800, David Rientjes wrote: > > On Mon, 11 Nov 2024, David Rientjes wrote: > > > > > > The reason that I opted not to include a breakdown of each hugetlb > > > > size in memory.stat is only because I wanted to keep the addition that > > > > this patch makes as minimal as possible, while still addressing > > > > the goal of bridging the gap between memory.stat and memory.current. > > > > Users who are curious about this breakdown can see how much memory > > > > is used by each hugetlb size by enabling the hugetlb controller as well. > > > > > > > > > > While the patch may be minimal, this is solidifying a kernel API that > > > users will start to count on. Users who may be interested in their > > > hugetlb usage may not have control over the configuration of their kernel? > > > > > > Does it make sense to provide a breakdown in memory.stat so that users can > > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb > > > pages, which are different global resources? > > > > > > > It's true that this is the case as well for total hugeltb usage, but > > > > I felt that not including hugetlb memory usage in memory.stat when it > > > > is accounted by memory.current would cause confusion for the users > > > > not being able to see that memory.current = sum of memory.stat. On the > > > > other hand, seeing the breakdown of how much each hugetlb size felt more > > > > like an optimization, and not a solution that bridges a confusion. > > > > > > > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > > > example, users could still do sum of memory.stat, no?> > > > > > > > Friendly ping on this, would there be any objections to splitting the > > memory.stat metrics out to be per hugepage size? > > I don't think it has to be either/or. We can add the total here, and a > per-size breakdown in a separate patch (with its own changelog)? > > That said, a per-size breakdown might make more sense in the hugetlb > cgroup controller. You're mentioning separate global resources, which > suggests this is about more explicitly controlled hugetlb use. +1 > From a memcg POV, all hugetlb is the same. It's just (non-swappable) > memory consumed by the cgroup. And here too. Thanks!
On Wed, 13 Nov 2024 14:42:29 -0800 (PST) David Rientjes <rientjes@google.com> wrote: Hi David, Sorry for the late response on this thread. To be completely transparent with you, I am not someone who inspects hugetlb usage on a regular basis, so I may not have the most relevant insights when it comes to how much utility there would be from breaking down the usage by size. With that said, I believe that over the past couple of days, there have been some responses on this thread regarding how others use hugetlb. As you know, I share Johannes's opinion that if there are people who would benefit from splitting up the hugetlb usage across different page sizes, it should happen in the hugetlb controller. > On Mon, 11 Nov 2024, David Rientjes wrote: > > While the patch may be minimal, this is solidifying a kernel API that > > users will start to count on. Users who may be interested in their > > hugetlb usage may not have control over the configuration of their kernel? This is a good point. With that said, I believe that this is an instance of a feature where both of our proposed ideas can co-exist; we can have the total hugetlb usage reported in memcg for now, and if there is a consensus / majority that would like to see the breakdown as well, we can introduce it in a future patch without breaking the utility of this patch. To quickly address a potential concern of bloating the already large memcg stat: including both the total and breakdown wouldn't be the first time a stat and its breakdown are both included: there is a precedent with this in slab_(un)reclaimable and slab. > > Does it make sense to provide a breakdown in memory.stat so that users can > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb > > pages, which are different global resources? > > > > > It's true that this is the case as well for total hugeltb usage, but > > > I felt that not including hugetlb memory usage in memory.stat when it > > > is accounted by memory.current would cause confusion for the users > > > not being able to see that memory.current = sum of memory.stat. On the > > > other hand, seeing the breakdown of how much each hugetlb size felt more > > > like an optimization, and not a solution that bridges a confusion. > > > > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > > example, users could still do sum of memory.stat, no?> This is true! I still think it would be nice to include the total anyways, since for a lot of people who use this statistic (Nhat's response in this thread and Shakeel's response in the v3 of this patch), all they want is a quick check to see how much memory is being used by hugetlb so they can reason about memory dynamics. Again, I think that if we are to include a breakdown in a future patch, it can coexist with this one. > Friendly ping on this, would there be any objections to splitting the > memory.stat metrics out to be per hugepage size? Sorry for the late reponse again. I think that if you had examples of use cases where having the differnt page sizes, it would help me better understand a motivation for including the breakdown (I would be happy to write the patch for the breakdown as well if there is a consensus!) Thank you for your thoughts, have a great day! Joshua
On Thu, 14 Nov 2024, Johannes Weiner wrote: > > > > The reason that I opted not to include a breakdown of each hugetlb > > > > size in memory.stat is only because I wanted to keep the addition that > > > > this patch makes as minimal as possible, while still addressing > > > > the goal of bridging the gap between memory.stat and memory.current. > > > > Users who are curious about this breakdown can see how much memory > > > > is used by each hugetlb size by enabling the hugetlb controller as well. > > > > > > > > > > While the patch may be minimal, this is solidifying a kernel API that > > > users will start to count on. Users who may be interested in their > > > hugetlb usage may not have control over the configuration of their kernel? > > > > > > Does it make sense to provide a breakdown in memory.stat so that users can > > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb > > > pages, which are different global resources? > > > > > > > It's true that this is the case as well for total hugeltb usage, but > > > > I felt that not including hugetlb memory usage in memory.stat when it > > > > is accounted by memory.current would cause confusion for the users > > > > not being able to see that memory.current = sum of memory.stat. On the > > > > other hand, seeing the breakdown of how much each hugetlb size felt more > > > > like an optimization, and not a solution that bridges a confusion. > > > > > > > > > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for > > > example, users could still do sum of memory.stat, no?> > > > > > > > Friendly ping on this, would there be any objections to splitting the > > memory.stat metrics out to be per hugepage size? > > I don't think it has to be either/or. We can add the total here, and a > per-size breakdown in a separate patch (with its own changelog)? > > That said, a per-size breakdown might make more sense in the hugetlb > cgroup controller. You're mentioning separate global resources, which > suggests this is about more explicitly controlled hugetlb use. > > From a memcg POV, all hugetlb is the same. It's just (non-swappable) > memory consumed by the cgroup. > Ok, that's fair. We have a local patch that tracks hugetlb usage, admittedly for all hugetlb sizes, in struct mem_cgroup_per_node so that we can provide a breakdown in memory.numa_stat because we can't get the per-node breakdown from hugetlb_cgroup. If there is interest in that breakdown, we could easily propose the patch.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 69af2173555f..bd7e81c2aa2b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1646,6 +1646,11 @@ The following nested keys are defined. pgdemote_khugepaged Number of pages demoted by khugepaged. + hugetlb + Amount of memory used by hugetlb pages. This metric only shows + up if hugetlb usage is accounted for in memory.current (i.e. + cgroup is mounted with the memory_hugetlb_accounting option). + memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 17506e4a2835..972795ae5946 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -220,6 +220,9 @@ enum node_stat_item { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, +#ifdef CONFIG_HUGETLB_PAGE + NR_HUGETLB, +#endif NR_VM_NODE_STAT_ITEMS }; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 190fa05635f4..fbb10e52d7ea 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio) pages_per_huge_page(h), folio); hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), pages_per_huge_page(h), folio); + lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h)); mem_cgroup_uncharge(folio); if (restore_reserve) h->resv_huge_pages++; @@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, if (!memcg_charge_ret) mem_cgroup_commit_charge(folio, memcg); + lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h)); mem_cgroup_put(memcg); return folio; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7845c64a2c57..5444d0e7bb64 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -310,6 +310,9 @@ static const unsigned int memcg_node_stat_items[] = { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, +#ifdef CONFIG_HUGETLB_PAGE + NR_HUGETLB, +#endif }; static const unsigned int memcg_stat_items[] = { @@ -1346,6 +1349,9 @@ static const struct memory_stat memory_stats[] = { { "unevictable", NR_UNEVICTABLE }, { "slab_reclaimable", NR_SLAB_RECLAIMABLE_B }, { "slab_unreclaimable", NR_SLAB_UNRECLAIMABLE_B }, +#ifdef CONFIG_HUGETLB_PAGE + { "hugetlb", NR_HUGETLB }, +#endif /* The memory events */ { "workingset_refault_anon", WORKINGSET_REFAULT_ANON }, @@ -1441,6 +1447,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; +#ifdef CONFIG_HUGETLB_PAGE + if (unlikely(memory_stats[i].idx == NR_HUGETLB) && + !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)) + continue; +#endif size = memcg_page_state_output(memcg, memory_stats[i].idx); seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size); diff --git a/mm/vmstat.c b/mm/vmstat.c index b5a4cea423e1..871566b04b79 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1273,6 +1273,9 @@ const char * const vmstat_text[] = { "pgdemote_kswapd", "pgdemote_direct", "pgdemote_khugepaged", +#ifdef CONFIG_HUGETLB_PAGE + "nr_hugetlb", +#endif /* system-wide enum vm_stat_item counters */ "nr_dirty_threshold", "nr_dirty_background_threshold",