diff mbox series

[v3,1/1] memcg/hugetlb: Adding hugeTLB counters to memcg

Message ID 20241028210505.1950884-1-joshua.hahnjy@gmail.com (mailing list archive)
State New
Headers show
Series [v3,1/1] memcg/hugetlb: Adding hugeTLB counters to memcg | expand

Commit Message

Joshua Hahn Oct. 28, 2024, 9:05 p.m. UTC
This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This
feature is enabled the same way hugeTLB accounting is enabled, via
the memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including
the metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistency between the two files, we also see benefits
in observability. Userspace might be interested in the hugeTLB footprint
of cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that
don't consume a lot of memory, or are seen to fault frequently. Note that
this is separate from wanting to inspect the distribution for limiting
purposes (in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability
benefits that I mentioned above, and users can check the total hugeTLB
usage, verify if it is distributed as expected, etc.

With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
    disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
    just for the sake of hugeTLB usage reporting, especially since
    they don't have any use for hugeTLB usage enforcing [2].

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the existing solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.
 
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

---
Changelog
v3:
  * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
    this check is already handled by lruvec_stat_mod (and doing the
    check in hugetlb.c actually breaks the build if MEMCG is not
    enabled.
  * Because there is now only one check for the flags, I've opted to
    do all of the cleanup in a separate patch series.
  * Added hugetlb information in cgroup-v2.rst
  * Added Suggested-by: Shakeel Butt
v2:
  * Enables the feature only if memcg accounts for hugeTLB usage
  * Moves the counter from memcg_stat_item to node_stat_item
  * Expands on motivation & justification in commitlog
  * Added Suggested-by: Nhat Pham

 Documentation/admin-guide/cgroup-v2.rst |  5 +++++
 include/linux/mmzone.h                  |  3 +++
 mm/hugetlb.c                            |  2 ++
 mm/memcontrol.c                         | 11 +++++++++++
 mm/vmstat.c                             |  3 +++
 5 files changed, 24 insertions(+)

Comments

Shakeel Butt Oct. 28, 2024, 10:38 p.m. UTC | #1
On Mon, Oct 28, 2024 at 02:05:05PM GMT, Joshua Hahn wrote:
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
> 
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
> 
> Aside from the consistency between the two files, we also see benefits
> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or are seen to fault frequently. Note that
> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
> 
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
> 
> With this said, there are 2 problems:
> (a) They are still not reported in memory.stat, which means the
>     disparity between the memcg reports are still there.
> (b) We cannot reasonably expect users to enable the hugeTLB controller
>     just for the sake of hugeTLB usage reporting, especially since
>     they don't have any use for hugeTLB usage enforcing [2].
> 
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the existing solution of enabling the
>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
>  
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Roman Gushchin Oct. 29, 2024, 8:28 p.m. UTC | #2
On Mon, Oct 28, 2024 at 02:05:05PM -0700, Joshua Hahn wrote:
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
> 
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
> 
> Aside from the consistency between the two files, we also see benefits
> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or are seen to fault frequently. Note that
> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
> 
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
> 
> With this said, there are 2 problems:
> (a) They are still not reported in memory.stat, which means the
>     disparity between the memcg reports are still there.
> (b) We cannot reasonably expect users to enable the hugeTLB controller
>     just for the sake of hugeTLB usage reporting, especially since
>     they don't have any use for hugeTLB usage enforcing [2].
> 
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the existing solution of enabling the
>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
>  
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!
Johannes Weiner Oct. 29, 2024, 8:48 p.m. UTC | #3
On Mon, Oct 28, 2024 at 02:05:05PM -0700, Joshua Hahn wrote:
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
> 
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
> 
> Aside from the consistency between the two files, we also see benefits
> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or are seen to fault frequently. Note that
> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
> 
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
> 
> With this said, there are 2 problems:
> (a) They are still not reported in memory.stat, which means the
>     disparity between the memcg reports are still there.
> (b) We cannot reasonably expect users to enable the hugeTLB controller
>     just for the sake of hugeTLB usage reporting, especially since
>     they don't have any use for hugeTLB usage enforcing [2].
> 
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the existing solution of enabling the
>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
>  
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Nhat Pham Oct. 29, 2024, 9:04 p.m. UTC | #4
On Mon, Oct 28, 2024 at 2:05 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> This patch introduces a new counter to memory.stat that tracks hugeTLB
> usage, only if hugeTLB accounting is done to memory.current. This
> feature is enabled the same way hugeTLB accounting is enabled, via
> the memory_hugetlb_accounting mount flag for cgroupsv2.
>
> 1. Why is this patch necessary?
> Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
> hugeTLB usage to memory.current. However, the metric is not reported in
> memory.stat. Given that users often interpret memory.stat as a breakdown
> of the value reported in memory.current, the disparity between the two
> reports can be confusing. This patch solves this problem by including
> the metric in memory.stat as well, but only if it is also reported in
> memory.current (it would also be confusing if the value was reported in
> memory.stat, but not in memory.current)
>
> Aside from the consistency between the two files, we also see benefits
> in observability. Userspace might be interested in the hugeTLB footprint
> of cgroups for many reasons. For instance, system admins might want to
> verify that hugeTLB usage is distributed as expected across tasks: i.e.
> memory-intensive tasks are using more hugeTLB pages than tasks that
> don't consume a lot of memory, or are seen to fault frequently. Note that
> this is separate from wanting to inspect the distribution for limiting
> purposes (in which case, hugeTLB controller makes more sense).
>
> 2. We already have a hugeTLB controller. Why not use that?
> It is true that hugeTLB tracks the exact value that we want. In fact, by
> enabling the hugeTLB controller, we get all of the observability
> benefits that I mentioned above, and users can check the total hugeTLB
> usage, verify if it is distributed as expected, etc.
>
> With this said, there are 2 problems:
> (a) They are still not reported in memory.stat, which means the
>     disparity between the memcg reports are still there.
> (b) We cannot reasonably expect users to enable the hugeTLB controller
>     just for the sake of hugeTLB usage reporting, especially since
>     they don't have any use for hugeTLB usage enforcing [2].
>
> [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> [2] Of course, we can't make a new patch for every feature that can be
>     duplicated. However, since the existing solution of enabling the
>     hugeTLB controller is an imperfect solution that still leaves a
>     discrepancy between memory.stat and memory.curent, I think that it
>     is reasonable to isolate the feature in this case.
>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>

I'll take it ;)

> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Michal Hocko Oct. 30, 2024, 11:35 a.m. UTC | #5
On Mon 28-10-24 14:05:05, Joshua Hahn wrote:
[...]
> Changelog
> v3:
>   * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
>     this check is already handled by lruvec_stat_mod (and doing the
>     check in hugetlb.c actually breaks the build if MEMCG is not
>     enabled.
[...]
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 190fa05635f4..fbb10e52d7ea 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio)
>  				     pages_per_huge_page(h), folio);
>  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					  pages_per_huge_page(h), folio);
> +	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
>  	mem_cgroup_uncharge(folio);
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
> @@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  
>  	if (!memcg_charge_ret)
>  		mem_cgroup_commit_charge(folio, memcg);
> +	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
>  	mem_cgroup_put(memcg);
>  
>  	return folio;

I do not see any specific checks for CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING
in these paths. I guess you wanted to say that you rely on
mem_cgroup_commit_charge setting memcg pointer which then __lruvec_stat_mod_folio
relies on when updating stats.

I suspect this all is done because you want a global counter to be
updated as well, right? Changelog doesn't say anything about that
though. Why is this needed when /proc/meminfo already describes the
global hugetlb usage?
Chris Down Oct. 30, 2024, 2:52 p.m. UTC | #6
Thanks for the detailed changelog, it answered a bunch of questions about the 
general semantics I had.

Joshua Hahn writes:
>Suggested-by: Nhat Pham <nphamcs@gmail.com>
>Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
>Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Acked-by: Chris Down <chris@chrisdown.name>
Johannes Weiner Oct. 30, 2024, 3:01 p.m. UTC | #7
On Wed, Oct 30, 2024 at 12:35:25PM +0100, Michal Hocko wrote:
> On Mon 28-10-24 14:05:05, Joshua Hahn wrote:
> [...]
> > Changelog
> > v3:
> >   * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
> >     this check is already handled by lruvec_stat_mod (and doing the
> >     check in hugetlb.c actually breaks the build if MEMCG is not
> >     enabled.
> [...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 190fa05635f4..fbb10e52d7ea 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio)
> >  				     pages_per_huge_page(h), folio);
> >  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> >  					  pages_per_huge_page(h), folio);
> > +	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
> >  	mem_cgroup_uncharge(folio);
> >  	if (restore_reserve)
> >  		h->resv_huge_pages++;
> > @@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> >  
> >  	if (!memcg_charge_ret)
> >  		mem_cgroup_commit_charge(folio, memcg);
> > +	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
> >  	mem_cgroup_put(memcg);
> >  
> >  	return folio;
> 
> I do not see any specific checks for CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING
> in these paths. I guess you wanted to say that you rely on
> mem_cgroup_commit_charge setting memcg pointer which then __lruvec_stat_mod_folio
> relies on when updating stats.

Yes, this is what Shakeel pointed out here:

https://lore.kernel.org/lkml/il346o3nahawquum3t5rzcuuntkdpyahidpm2ctmdibj3td7pm@2aqirlm5hrdh/

> I suspect this all is done because you want a global counter to be
> updated as well, right? Changelog doesn't say anything about that
> though. Why is this needed when /proc/meminfo already describes the
> global hugetlb usage?

Sigh.

vmstats is the preferred framework for cgroup stats. It makes stat
items consistent between global and cgroup. It provides a per-node
breakdown as well which is useful. It avoids proliferating
cgroup-specific hooks in generic MM code.

It was a ton of work to integrate cgroup stats into vmstats and get
rid of all the memcg special casing everywhere. You were there for all
of it. We're not adding cgroup-specific stats unless unavoidable.

Duplication doesn't matter, either. We have plenty of overlap between
vmstat and meminfo. By all means, send a follow-up patch to have the
meminfo one sourced from global_node_page_state().

But you know all this. I'm having a hard time seeing the way you are,
and have been, engaging with this patch as good-faithed.
Michal Hocko Oct. 30, 2024, 3:27 p.m. UTC | #8
On Wed 30-10-24 11:01:02, Johannes Weiner wrote:
> On Wed, Oct 30, 2024 at 12:35:25PM +0100, Michal Hocko wrote:
> > On Mon 28-10-24 14:05:05, Joshua Hahn wrote:
> > [...]
> > > Changelog
> > > v3:
> > >   * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
> > >     this check is already handled by lruvec_stat_mod (and doing the
> > >     check in hugetlb.c actually breaks the build if MEMCG is not
> > >     enabled.
> > [...]
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 190fa05635f4..fbb10e52d7ea 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio)
> > >  				     pages_per_huge_page(h), folio);
> > >  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > >  					  pages_per_huge_page(h), folio);
> > > +	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
> > >  	mem_cgroup_uncharge(folio);
> > >  	if (restore_reserve)
> > >  		h->resv_huge_pages++;
> > > @@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >  
> > >  	if (!memcg_charge_ret)
> > >  		mem_cgroup_commit_charge(folio, memcg);
> > > +	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
> > >  	mem_cgroup_put(memcg);
> > >  
> > >  	return folio;
> > 
> > I do not see any specific checks for CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING
> > in these paths. I guess you wanted to say that you rely on
> > mem_cgroup_commit_charge setting memcg pointer which then __lruvec_stat_mod_folio
> > relies on when updating stats.
> 
> Yes, this is what Shakeel pointed out here:
> 
> https://lore.kernel.org/lkml/il346o3nahawquum3t5rzcuuntkdpyahidpm2ctmdibj3td7pm@2aqirlm5hrdh/

It belongs to the changelog.

> > I suspect this all is done because you want a global counter to be
> > updated as well, right? Changelog doesn't say anything about that
> > though. Why is this needed when /proc/meminfo already describes the
> > global hugetlb usage?
> 
> Sigh.
> 
> vmstats is the preferred framework for cgroup stats. It makes stat
> items consistent between global and cgroup. It provides a per-node
> breakdown as well which is useful. It avoids proliferating
> cgroup-specific hooks in generic MM code.
> 
> It was a ton of work to integrate cgroup stats into vmstats and get
> rid of all the memcg special casing everywhere. You were there for all
> of it. We're not adding cgroup-specific stats unless unavoidable.
> 
> Duplication doesn't matter, either. We have plenty of overlap between
> vmstat and meminfo. By all means, send a follow-up patch to have the
> meminfo one sourced from global_node_page_state().
> 
> But you know all this.

It is not really important what _I_ do know. The commit log is not
written for me.

Joshua has greatly improved the motivation part. Yet implementation
specifics are still lacking behind mostly left in Changelog part of the
commit message which will be dropped along with the diffstat.

> I'm having a hard time seeing the way you are,
> and have been, engaging with this patch as good-faithed.

Sorry to hear that but this hasn't been my intention. My main focus has
been to put implicit assumptions into patch description. I really do not
see reason for such a pushback TBH.
Johannes Weiner Oct. 30, 2024, 6:30 p.m. UTC | #9
On Wed, Oct 30, 2024 at 04:27:37PM +0100, Michal Hocko wrote:
> On Wed 30-10-24 11:01:02, Johannes Weiner wrote:
> > On Wed, Oct 30, 2024 at 12:35:25PM +0100, Michal Hocko wrote:
> > > On Mon 28-10-24 14:05:05, Joshua Hahn wrote:
> > > [...]
> > > > Changelog
> > > > v3:
> > > >   * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
> > > >     this check is already handled by lruvec_stat_mod (and doing the
> > > >     check in hugetlb.c actually breaks the build if MEMCG is not
> > > >     enabled.
> > > [...]
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 190fa05635f4..fbb10e52d7ea 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio)
> > > >  				     pages_per_huge_page(h), folio);
> > > >  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
> > > >  					  pages_per_huge_page(h), folio);
> > > > +	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
> > > >  	mem_cgroup_uncharge(folio);
> > > >  	if (restore_reserve)
> > > >  		h->resv_huge_pages++;
> > > > @@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > > >  
> > > >  	if (!memcg_charge_ret)
> > > >  		mem_cgroup_commit_charge(folio, memcg);
> > > > +	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
> > > >  	mem_cgroup_put(memcg);
> > > >  
> > > >  	return folio;
> > > 
> > > I do not see any specific checks for CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING
> > > in these paths. I guess you wanted to say that you rely on
> > > mem_cgroup_commit_charge setting memcg pointer which then __lruvec_stat_mod_folio
> > > relies on when updating stats.
> > 
> > Yes, this is what Shakeel pointed out here:
> > 
> > https://lore.kernel.org/lkml/il346o3nahawquum3t5rzcuuntkdpyahidpm2ctmdibj3td7pm@2aqirlm5hrdh/
> 
> It belongs to the changelog.

Joshua, can you please include something like this at the end:

lruvec_stat_mod_folio() keys off of folio->memcg linkage, which is
only set up if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is switched
on. This ensures that memory.stat::hugetlb is in sync with the hugetlb
share of memory.current.
Joshua Hahn Oct. 30, 2024, 8:43 p.m. UTC | #10
On Wed, Oct 30, 2024 at 2:30 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Joshua, can you please include something like this at the end:
>
> lruvec_stat_mod_folio() keys off of folio->memcg linkage, which is
> only set up if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is switched
> on. This ensures that memory.stat::hugetlb is in sync with the hugetlb
> share of memory.current.

Hello Andrew,

I saw that it was merged into mm-unstable earlier yesterday. Would it
be possible
to add this block of text to the patch description right before the footnotes?

3. Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb. lruvec_stat_mod_folio
keys off of folio->memcg which is only set up if the
CGRP_ROOT_MEMORY_HUGETLB_ACCOUTING cgroup mount option is used, so
it will not try to accumulate hugetlb unless the flag is set.
This also ensures that memory.stat::hugetlb is the same as
the share of memory.current that is used by hugetlb pages.

And could you also update the list of signatures to reflect the
responses on this version?
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Thank you so much, I hope you have a great rest of your day.
Joshua
Andrew Morton Oct. 30, 2024, 11:26 p.m. UTC | #11
On Wed, 30 Oct 2024 16:43:42 -0400 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> I saw that it was merged into mm-unstable earlier yesterday. Would it
> be possible
> to add this block of text to the patch description right before the footnotes?
> 
> 3. Implementation Details:
> In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
> regardless of whether memcg accounts hugetlb. lruvec_stat_mod_folio
> keys off of folio->memcg which is only set up if the
> CGRP_ROOT_MEMORY_HUGETLB_ACCOUTING cgroup mount option is used, so
> it will not try to accumulate hugetlb unless the flag is set.
> This also ensures that memory.stat::hugetlb is the same as
> the share of memory.current that is used by hugetlb pages.

Thanks, done.

> And could you also update the list of signatures to reflect the
> responses on this version?
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Chris Down <chris@chrisdown.name>
> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Done2.  I already had all that, plus an ack from Chris Down.
Michal Hocko Oct. 31, 2024, 8:12 a.m. UTC | #12
On Wed 30-10-24 16:43:42, Joshua Hahn wrote:
> On Wed, Oct 30, 2024 at 2:30 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Joshua, can you please include something like this at the end:
> >
> > lruvec_stat_mod_folio() keys off of folio->memcg linkage, which is
> > only set up if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is switched
> > on. This ensures that memory.stat::hugetlb is in sync with the hugetlb
> > share of memory.current.
> 
> Hello Andrew,
> 
> I saw that it was merged into mm-unstable earlier yesterday. Would it
> be possible
> to add this block of text to the patch description right before the footnotes?
> 
> 3. Implementation Details:
> In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
> regardless of whether memcg accounts hugetlb. lruvec_stat_mod_folio
> keys off of folio->memcg which is only set up if the
> CGRP_ROOT_MEMORY_HUGETLB_ACCOUTING cgroup mount option is used, so
> it will not try to accumulate hugetlb unless the flag is set.

Thanks for the update and sorry for being pitbull here but this is
is a bit confusing. Let me try to reformulate as per my understanding

In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb.  mem_cgroup_commit_charge
called from alloc_hugetlb_folio will set memcg for folio only if
CGRP_ROOT_MEMORY_HUGETLB_ACCOUTING is enabled so lruvec_stat_mod_folio
accounts per memcg hugetlb counter only if the feature is enabled.
Regardless of the memcg accounting, though, the newly added global
counter is updated and shown in /proc/vmstat.

I would also add the following

The global counter is added because vmstats is the preferred framework
for cgroup stats. It makes stat items consistent between global and
cgroup. It provides a per-node breakdown as well which is useful. It
avoids proliferating cgroup-specific hooks in generic MM code.

I will leave up to you whether to add above paragraphs but I believe
they clarify the intention and the implementation.

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
Johannes Weiner Oct. 31, 2024, 4:09 p.m. UTC | #13
On Thu, Oct 31, 2024 at 09:12:46AM +0100, Michal Hocko wrote:
> I will leave up to you whether to add above paragraphs but I believe
> they clarify the intention and the implementation.

Both seem reasonable to me to add to the changelog as well.

> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
Joshua Hahn Oct. 31, 2024, 7:03 p.m. UTC | #14
Hi Michal,

On Thu, Oct 31, 2024 at 4:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> I would also add the following
>
> The global counter is added because vmstats is the preferred framework
> for cgroup stats. It makes stat items consistent between global and
> cgroup. It provides a per-node breakdown as well which is useful. It
> avoids proliferating cgroup-specific hooks in generic MM code.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

Thank you for your feedback and review. I think this makes sense,
I will add a new paragraph to the implementation details section!

Andrew -- I am sorry to ask again, but do you think you can replace
the 3rd section in the patch (3. Implementation Details) with the
following paragraphs? Thank you so much!

In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio
only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is
used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters
only if the feature is enabled. Regardless of whether memcg accounts
for hugetlb, the newly added global counter is updated and shown
in /proc/vmstat.

The global counter is added because vmstats is the preferred framework
for cgroup stats. It makes stat items consistent between global and
cgroups. It also provides a per-node breakdown, which is useful.
Because it does not use cgroup-specific hooks, we also keep generic
MM code separate from memcg code.

Thank you Johannes & Michal for your continued feedback and interest
in my work, and thank you Andrew for reviewing and allowing me to
fix the patch messages.

I hope you all have a great rest of your day!
Joshua
Andrew Morton Nov. 1, 2024, 1:34 a.m. UTC | #15
On Thu, 31 Oct 2024 15:03:34 -0400 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Andrew -- I am sorry to ask again, but do you think you can replace
> the 3rd section in the patch (3. Implementation Details) with the
> following paragraphs?

No problem.

: This patch introduces a new counter to memory.stat that tracks hugeTLB
: usage, only if hugeTLB accounting is done to memory.current.  This feature
: is enabled the same way hugeTLB accounting is enabled, via the
: memory_hugetlb_accounting mount flag for cgroupsv2.
: 
: 1. Why is this patch necessary?
: Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
: hugeTLB usage to memory.current.  However, the metric is not reported in
: memory.stat.  Given that users often interpret memory.stat as a breakdown
: of the value reported in memory.current, the disparity between the two
: reports can be confusing.  This patch solves this problem by including the
: metric in memory.stat as well, but only if it is also reported in
: memory.current (it would also be confusing if the value was reported in
: memory.stat, but not in memory.current)
: 
: Aside from the consistency between the two files, we also see benefits in
: observability.  Userspace might be interested in the hugeTLB footprint of
: cgroups for many reasons.  For instance, system admins might want to
: verify that hugeTLB usage is distributed as expected across tasks: i.e. 
: memory-intensive tasks are using more hugeTLB pages than tasks that don't
: consume a lot of memory, or are seen to fault frequently.  Note that this
: is separate from wanting to inspect the distribution for limiting purposes
: (in which case, hugeTLB controller makes more sense).
: 
: 2.  We already have a hugeTLB controller.  Why not use that?  It is true
: that hugeTLB tracks the exact value that we want.  In fact, by enabling
: the hugeTLB controller, we get all of the observability benefits that I
: mentioned above, and users can check the total hugeTLB usage, verify if it
: is distributed as expected, etc.
: 
: 3.  Implementation Details:
: In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
: regardless of whether memcg accounts hugetlb.  mem_cgroup_commit_charge
: which is called from alloc_hugetlb_folio will set memcg for the folio
: only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is
: used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters only
: if the feature is enabled.  Regardless of whether memcg accounts for
: hugetlb, the newly added global counter is updated and shown in
: /proc/vmstat.
: 
: The global counter is added because vmstats is the preferred framework
: for cgroup stats.  It makes stat items consistent between global and
: cgroups.  It also provides a per-node breakdown, which is useful. 
: Because it does not use cgroup-specific hooks, we also keep generic MM
: code separate from memcg code.
: 
: With this said, there are 2 problems:
: (a) They are still not reported in memory.stat, which means the
:     disparity between the memcg reports are still there.
: (b) We cannot reasonably expect users to enable the hugeTLB controller
:     just for the sake of hugeTLB usage reporting, especially since
:     they don't have any use for hugeTLB usage enforcing [2].
: 
: [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
: [2] Of course, we can't make a new patch for every feature that can be
:     duplicated. However, since the existing solution of enabling the
:     hugeTLB controller is an imperfect solution that still leaves a
:     discrepancy between memory.stat and memory.curent, I think that it
:     is reasonable to isolate the feature in this case.
Joshua Hahn Nov. 1, 2024, 6:33 p.m. UTC | #16
On Thu, Oct 31, 2024 at 9:34 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 31 Oct 2024 15:03:34 -0400 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> > Andrew -- I am sorry to ask again, but do you think you can replace
> > the 3rd section in the patch (3. Implementation Details) with the
> > following paragraphs?
>
> No problem.
> : 3.  Implementation Details:
> : In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
> : regardless of whether memcg accounts hugetlb.  mem_cgroup_commit_charge
> : which is called from alloc_hugetlb_folio will set memcg for the folio
> : only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is
> : used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters only
> : if the feature is enabled.  Regardless of whether memcg accounts for
> : hugetlb, the newly added global counter is updated and shown in
> : /proc/vmstat.
> :
> : The global counter is added because vmstats is the preferred framework
> : for cgroup stats.  It makes stat items consistent between global and
> : cgroups.  It also provides a per-node breakdown, which is useful.
> : Because it does not use cgroup-specific hooks, we also keep generic MM
> : code separate from memcg code.
> :
> : With this said, there are 2 problems:
> : (a) They are still not reported in memory.stat, which means the
> :     disparity between the memcg reports are still there.
> : (b) We cannot reasonably expect users to enable the hugeTLB controller
> :     just for the sake of hugeTLB usage reporting, especially since
> :     they don't have any use for hugeTLB usage enforcing [2].
> :
> : [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
> : [2] Of course, we can't make a new patch for every feature that can be
> :     duplicated. However, since the existing solution of enabling the
> :     hugeTLB controller is an imperfect solution that still leaves a
> :     discrepancy between memory.stat and memory.curent, I think that it
> :     is reasonable to isolate the feature in this case.
>

Hello Andrew,

Thank you for your help as always. I apologize for not being clear in my
original request -- the "With this said, there are 2 problems:" paragraph is
part of the 2nd section (2. We already have a hugeTLB controller...) So the
outline will be:

This patch introduces...

1. Why is this patch necessary?\n
Currently, memcg hugeTLB accounting...

Aside from the consistency between...

2. We already have a hugeTLB controller. Why not use that?\n
It is true that hugeTLB tracks...

With this said, there are 2 problems:\n
  (a) They are still not...
  (b) We cannot reasonably...

3. Implementation Details\n
In the alloc / free hugetlb functions, ...

The global counter is added because...

[1] https://lore.kernel.org/ ...
[2] Of course, we can't make a new patch...

Thank you for your patience. I promise that this is the last change
to the patch message, I apologize for the frequent requests for
modifications. I hope you have a great day!
Joshua
Andrew Morton Nov. 1, 2024, 8:02 p.m. UTC | #17
On Fri, 1 Nov 2024 14:33:36 -0400 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Thank you for your help as always. I apologize for not being clear in my
> original request -- the "With this said, there are 2 problems:" paragraph is
> part of the 2nd section (2. We already have a hugeTLB controller...) So the
> outline will be:

I think it would be best if you were to send a new version of the patch :)

And while there, please make the title "memcg/hugetlb: Add ...".
Joshua Hahn Nov. 1, 2024, 8:28 p.m. UTC | #18
On Fri, Nov 1, 2024 at 4:02 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> I think it would be best if you were to send a new version of the patch :)
>
> And while there, please make the title "memcg/hugetlb: Add ...".

Hello Andrew,

Makes a lot of sense to me : -) Sorry for all of the hassle! I'll send out a new
version shortly. I hope you have a great weekend!
Joshua
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69af2173555f..bd7e81c2aa2b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1646,6 +1646,11 @@  The following nested keys are defined.
 	  pgdemote_khugepaged
 		Number of pages demoted by khugepaged.
 
+	  hugetlb
+		Amount of memory used by hugetlb pages. This metric only shows
+		up if hugetlb usage is accounted for in memory.current (i.e.
+		cgroup is mounted with the memory_hugetlb_accounting option).
+
   memory.numa_stat
 	A read-only nested-keyed file which exists on non-root cgroups.
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..972795ae5946 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,6 +220,9 @@  enum node_stat_item {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+	NR_HUGETLB,
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..fbb10e52d7ea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1925,6 +1925,7 @@  void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
 	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
@@ -3093,6 +3094,7 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	if (!memcg_charge_ret)
 		mem_cgroup_commit_charge(folio, memcg);
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
 	mem_cgroup_put(memcg);
 
 	return folio;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..5444d0e7bb64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -310,6 +310,9 @@  static const unsigned int memcg_node_stat_items[] = {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+	NR_HUGETLB,
+#endif
 };
 
 static const unsigned int memcg_stat_items[] = {
@@ -1346,6 +1349,9 @@  static const struct memory_stat memory_stats[] = {
 	{ "unevictable",		NR_UNEVICTABLE			},
 	{ "slab_reclaimable",		NR_SLAB_RECLAIMABLE_B		},
 	{ "slab_unreclaimable",		NR_SLAB_UNRECLAIMABLE_B		},
+#ifdef CONFIG_HUGETLB_PAGE
+	{ "hugetlb",			NR_HUGETLB			},
+#endif
 
 	/* The memory events */
 	{ "workingset_refault_anon",	WORKINGSET_REFAULT_ANON		},
@@ -1441,6 +1447,11 @@  static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
 
+#ifdef CONFIG_HUGETLB_PAGE
+		if (unlikely(memory_stats[i].idx == NR_HUGETLB) &&
+		    !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+			continue;
+#endif
 		size = memcg_page_state_output(memcg, memory_stats[i].idx);
 		seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b5a4cea423e1..871566b04b79 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1273,6 +1273,9 @@  const char * const vmstat_text[] = {
 	"pgdemote_kswapd",
 	"pgdemote_direct",
 	"pgdemote_khugepaged",
+#ifdef CONFIG_HUGETLB_PAGE
+	"nr_hugetlb",
+#endif
 	/* system-wide enum vm_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",