diff mbox series

[v4,1/1] memcg/hugetlb: Add hugeTLB counters to memcg

Message ID 20241101204402.1885383-1-joshua.hahnjy@gmail.com (mailing list archive)
State New
Headers show
Series [v4,1/1] memcg/hugetlb: Add hugeTLB counters to memcg | expand

Commit Message

Joshua Hahn Nov. 1, 2024, 8:44 p.m. UTC
This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current. This
feature is enabled the same way hugeTLB accounting is enabled, via
the memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current. However, the metric is not reported in
memory.stat. Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing. This patch solves this problem by including
the metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistency between the two files, we also see benefits
in observability. Userspace might be interested in the hugeTLB footprint
of cgroups for many reasons. For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that
don't consume a lot of memory, or are seen to fault frequently. Note that
this is separate from wanting to inspect the distribution for limiting
purposes (in which case, hugeTLB controller makes more sense).

2. We already have a hugeTLB controller. Why not use that?
It is true that hugeTLB tracks the exact value that we want. In fact, by
enabling the hugeTLB controller, we get all of the observability
benefits that I mentioned above, and users can check the total hugeTLB
usage, verify if it is distributed as expected, etc.

With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
    disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
    just for the sake of hugeTLB usage reporting, especially since
    they don't have any use for hugeTLB usage enforcing [2].

3. Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio
only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is
used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters
only if the feature is enabled. Regardless of whether memcg accounts
for hugetlb, the newly added global counter is updated and shown
in /proc/vmstat.

The global counter is added because vmstats is the preferred framework
for cgroup stats. It makes stat items consistent between global and
cgroups. It also provides a per-node breakdown, which is useful.
Because it does not use cgroup-specific hooks, we also keep generic
MM code separate from memcg code.

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the existing solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.
 
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

---
Changelog
v4:
  * Added {acked,suggested,reviewed}-by to the list
  * Added an extra section detailing why having no checks for mount
    options or configs is ok (handled by lruvec_stat_mod_folio). Also
    includes justifications for adding a global counter
v3:
  * Removed check for whether CGRP_ROOT_HUGETLB_ACCOUNTING is on, since
    this check is already handled by lruvec_stat_mod (and doing the
    check in hugetlb.c actually breaks the build if MEMCG is not
    enabled.
  * Because there is now only one check for the flags, I've opted to
    do all of the cleanup in a separate patch series.
  * Added hugetlb information in cgroup-v2.rst
  * Added Suggested-by: Shakeel Butt
v2:
  * Enables the feature only if memcg accounts for hugeTLB usage
  * Moves the counter from memcg_stat_item to node_stat_item
  * Expands on motivation & justification in commitlog
  * Added Suggested-by: Nhat Pham

 Documentation/admin-guide/cgroup-v2.rst |  5 +++++
 include/linux/mmzone.h                  |  3 +++
 mm/hugetlb.c                            |  2 ++
 mm/memcontrol.c                         | 11 +++++++++++
 mm/vmstat.c                             |  3 +++
 5 files changed, 24 insertions(+)

Comments

David Rientjes Nov. 10, 2024, 2:19 a.m. UTC | #1
On Fri, 1 Nov 2024, Joshua Hahn wrote:

> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69af2173555f..bd7e81c2aa2b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1646,6 +1646,11 @@ The following nested keys are defined.
>  	  pgdemote_khugepaged
>  		Number of pages demoted by khugepaged.
>  
> +	  hugetlb
> +		Amount of memory used by hugetlb pages. This metric only shows
> +		up if hugetlb usage is accounted for in memory.current (i.e.
> +		cgroup is mounted with the memory_hugetlb_accounting option).
> +
>    memory.numa_stat
>  	A read-only nested-keyed file which exists on non-root cgroups.
>  

Definitely makes sense to include this.

Any reason to not account different hugetlb page sizes separately in this 
stat, however?  IOW, should there be separate hugetlb_2048kB and 
hugetlb_1048576kB stats on x86?
Joshua Hahn Nov. 11, 2024, 3:58 p.m. UTC | #2
On Sat, Nov 9, 2024 at 9:19 PM David Rientjes <rientjes@google.com> wrote:
>
> On Fri, 1 Nov 2024, Joshua Hahn wrote:
>
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69af2173555f..bd7e81c2aa2b 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1646,6 +1646,11 @@ The following nested keys are defined.
> >         pgdemote_khugepaged
> >               Number of pages demoted by khugepaged.
> >
> > +       hugetlb
> > +             Amount of memory used by hugetlb pages. This metric only shows
> > +             up if hugetlb usage is accounted for in memory.current (i.e.
> > +             cgroup is mounted with the memory_hugetlb_accounting option).
> > +
> >    memory.numa_stat
> >       A read-only nested-keyed file which exists on non-root cgroups.
> >
>
> Definitely makes sense to include this.
>
> Any reason to not account different hugetlb page sizes separately in this
> stat, however?  IOW, should there be separate hugetlb_2048kB and
> hugetlb_1048576kB stats on x86?

Hello David, Thank you for reviewing my patch!

The reason that I opted not to include a breakdown of each hugetlb
size in memory.stat is only because I wanted to keep the addition that
this patch makes as minimal as possible, while still addressing
the goal of bridging the gap between memory.stat and memory.current.
Users who are curious about this breakdown can see how much memory
is used by each hugetlb size by enabling the hugetlb controller as well.

It's true that this is the case as well for total hugeltb usage, but
I felt that not including hugetlb memory usage in memory.stat when it
is accounted by memory.current would cause confusion for the users
not being able to see that memory.current = sum of memory.stat. On the
other hand, seeing the breakdown of how much each hugetlb size felt more
like an optimization, and not a solution that bridges a confusion.

I have not had a scenario where I had to look at the breakdown of the
hugetlb sizes (without the hugetlb controller), or a scenario where not
knowing this causes some sort of confusion. If others have had this
problem, I would love to hear about it, and perhaps work on a solution
that can address this point as well!

I hope you have a great day!
Joshua
David Rientjes Nov. 11, 2024, 5:41 p.m. UTC | #3
On Mon, 11 Nov 2024, Joshua Hahn wrote:

> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 69af2173555f..bd7e81c2aa2b 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1646,6 +1646,11 @@ The following nested keys are defined.
> > >         pgdemote_khugepaged
> > >               Number of pages demoted by khugepaged.
> > >
> > > +       hugetlb
> > > +             Amount of memory used by hugetlb pages. This metric only shows
> > > +             up if hugetlb usage is accounted for in memory.current (i.e.
> > > +             cgroup is mounted with the memory_hugetlb_accounting option).
> > > +
> > >    memory.numa_stat
> > >       A read-only nested-keyed file which exists on non-root cgroups.
> > >
> >
> > Definitely makes sense to include this.
> >
> > Any reason to not account different hugetlb page sizes separately in this
> > stat, however?  IOW, should there be separate hugetlb_2048kB and
> > hugetlb_1048576kB stats on x86?
> 
> Hello David, Thank you for reviewing my patch!
> 
> The reason that I opted not to include a breakdown of each hugetlb
> size in memory.stat is only because I wanted to keep the addition that
> this patch makes as minimal as possible, while still addressing
> the goal of bridging the gap between memory.stat and memory.current.
> Users who are curious about this breakdown can see how much memory
> is used by each hugetlb size by enabling the hugetlb controller as well.
> 

While the patch may be minimal, this is solidifying a kernel API that 
users will start to count on.  Users who may be interested in their 
hugetlb usage may not have control over the configuration of their kernel?

Does it make sense to provide a breakdown in memory.stat so that users can 
differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
pages, which are different global resources?

> It's true that this is the case as well for total hugeltb usage, but
> I felt that not including hugetlb memory usage in memory.stat when it
> is accounted by memory.current would cause confusion for the users
> not being able to see that memory.current = sum of memory.stat. On the
> other hand, seeing the breakdown of how much each hugetlb size felt more
> like an optimization, and not a solution that bridges a confusion.
> 

If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
example, users could still do sum of memory.stat, no?>

> I have not had a scenario where I had to look at the breakdown of the
> hugetlb sizes (without the hugetlb controller), or a scenario where not
> knowing this causes some sort of confusion. If others have had this
> problem, I would love to hear about it, and perhaps work on a solution
> that can address this point as well!
> 
> I hope you have a great day!

You too!
David Rientjes Nov. 13, 2024, 10:42 p.m. UTC | #4
On Mon, 11 Nov 2024, David Rientjes wrote:

> > The reason that I opted not to include a breakdown of each hugetlb
> > size in memory.stat is only because I wanted to keep the addition that
> > this patch makes as minimal as possible, while still addressing
> > the goal of bridging the gap between memory.stat and memory.current.
> > Users who are curious about this breakdown can see how much memory
> > is used by each hugetlb size by enabling the hugetlb controller as well.
> > 
> 
> While the patch may be minimal, this is solidifying a kernel API that 
> users will start to count on.  Users who may be interested in their 
> hugetlb usage may not have control over the configuration of their kernel?
> 
> Does it make sense to provide a breakdown in memory.stat so that users can 
> differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
> pages, which are different global resources?
> 
> > It's true that this is the case as well for total hugeltb usage, but
> > I felt that not including hugetlb memory usage in memory.stat when it
> > is accounted by memory.current would cause confusion for the users
> > not being able to see that memory.current = sum of memory.stat. On the
> > other hand, seeing the breakdown of how much each hugetlb size felt more
> > like an optimization, and not a solution that bridges a confusion.
> > 
> 
> If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
> example, users could still do sum of memory.stat, no?>
> 

Friendly ping on this, would there be any objections to splitting the 
memory.stat metrics out to be per hugepage size?
Nhat Pham Nov. 14, 2024, 1:08 a.m. UTC | #5
On Wed, Nov 13, 2024 at 2:42 PM David Rientjes <rientjes@google.com> wrote:
>
> On Mon, 11 Nov 2024, David Rientjes wrote:
>
> >
> > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for
> > example, users could still do sum of memory.stat, no?>
> >
>
> Friendly ping on this, would there be any objections to splitting the
> memory.stat metrics out to be per hugepage size?

My 2c is that it's extra complexity + overhead (IIUC these stats are
hierarchical and hence would contribute to flushing overhead). So we
should justify them with some concrete use cases if we are to add
them.

From my end, I need hugetlb usage when I want to reason about memory
dynamics. This is because hugetlb is a bit special/weird - it cannot
be swapped out for e.g. So I have subtract hugetlb usage from overall
cgroup's memory usage before making any analysis that involves
swapping. For this use case, I just need to know how much memory is
consumed by hugetlb, regardless of the type (2M or 1G). I assume many
use cases are similar in that sense.

Do you or Google have a concrete use case in mind that requires
hugetlb categorization? :)
Johannes Weiner Nov. 14, 2024, 5:26 a.m. UTC | #6
On Wed, Nov 13, 2024 at 02:42:29PM -0800, David Rientjes wrote:
> On Mon, 11 Nov 2024, David Rientjes wrote:
> 
> > > The reason that I opted not to include a breakdown of each hugetlb
> > > size in memory.stat is only because I wanted to keep the addition that
> > > this patch makes as minimal as possible, while still addressing
> > > the goal of bridging the gap between memory.stat and memory.current.
> > > Users who are curious about this breakdown can see how much memory
> > > is used by each hugetlb size by enabling the hugetlb controller as well.
> > > 
> > 
> > While the patch may be minimal, this is solidifying a kernel API that 
> > users will start to count on.  Users who may be interested in their 
> > hugetlb usage may not have control over the configuration of their kernel?
> > 
> > Does it make sense to provide a breakdown in memory.stat so that users can 
> > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
> > pages, which are different global resources?
> > 
> > > It's true that this is the case as well for total hugeltb usage, but
> > > I felt that not including hugetlb memory usage in memory.stat when it
> > > is accounted by memory.current would cause confusion for the users
> > > not being able to see that memory.current = sum of memory.stat. On the
> > > other hand, seeing the breakdown of how much each hugetlb size felt more
> > > like an optimization, and not a solution that bridges a confusion.
> > > 
> > 
> > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
> > example, users could still do sum of memory.stat, no?>
> > 
> 
> Friendly ping on this, would there be any objections to splitting the 
> memory.stat metrics out to be per hugepage size?

I don't think it has to be either/or. We can add the total here, and a
per-size breakdown in a separate patch (with its own changelog)?

That said, a per-size breakdown might make more sense in the hugetlb
cgroup controller. You're mentioning separate global resources, which
suggests this is about more explicitly controlled hugetlb use.

From a memcg POV, all hugetlb is the same. It's just (non-swappable)
memory consumed by the cgroup.
Michal Hocko Nov. 14, 2024, 10:33 a.m. UTC | #7
On Thu 14-11-24 00:26:24, Johannes Weiner wrote:
> On Wed, Nov 13, 2024 at 02:42:29PM -0800, David Rientjes wrote:
> > On Mon, 11 Nov 2024, David Rientjes wrote:
> > 
> > > > The reason that I opted not to include a breakdown of each hugetlb
> > > > size in memory.stat is only because I wanted to keep the addition that
> > > > this patch makes as minimal as possible, while still addressing
> > > > the goal of bridging the gap between memory.stat and memory.current.
> > > > Users who are curious about this breakdown can see how much memory
> > > > is used by each hugetlb size by enabling the hugetlb controller as well.
> > > > 
> > > 
> > > While the patch may be minimal, this is solidifying a kernel API that 
> > > users will start to count on.  Users who may be interested in their 
> > > hugetlb usage may not have control over the configuration of their kernel?
> > > 
> > > Does it make sense to provide a breakdown in memory.stat so that users can 
> > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
> > > pages, which are different global resources?
> > > 
> > > > It's true that this is the case as well for total hugeltb usage, but
> > > > I felt that not including hugetlb memory usage in memory.stat when it
> > > > is accounted by memory.current would cause confusion for the users
> > > > not being able to see that memory.current = sum of memory.stat. On the
> > > > other hand, seeing the breakdown of how much each hugetlb size felt more
> > > > like an optimization, and not a solution that bridges a confusion.
> > > > 
> > > 
> > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
> > > example, users could still do sum of memory.stat, no?>
> > > 
> > 
> > Friendly ping on this, would there be any objections to splitting the 
> > memory.stat metrics out to be per hugepage size?
> 
> I don't think it has to be either/or. We can add the total here, and a
> per-size breakdown in a separate patch (with its own changelog)?
> 
> That said, a per-size breakdown might make more sense in the hugetlb
> cgroup controller. You're mentioning separate global resources, which
> suggests this is about more explicitly controlled hugetlb use.
> 
> >From a memcg POV, all hugetlb is the same. It's just (non-swappable)
> memory consumed by the cgroup.

Completely agreed. From the memcg POV there is no way to control hugetlb
pages or manage per size charging/pools. In a sense this is not much
different from slab accounting. We do print overall SLAB accounted
memory and do not break down each slab consumer in the stat file.
Roman Gushchin Nov. 14, 2024, 4:36 p.m. UTC | #8
On Thu, Nov 14, 2024 at 12:26:24AM -0500, Johannes Weiner wrote:
> On Wed, Nov 13, 2024 at 02:42:29PM -0800, David Rientjes wrote:
> > On Mon, 11 Nov 2024, David Rientjes wrote:
> > 
> > > > The reason that I opted not to include a breakdown of each hugetlb
> > > > size in memory.stat is only because I wanted to keep the addition that
> > > > this patch makes as minimal as possible, while still addressing
> > > > the goal of bridging the gap between memory.stat and memory.current.
> > > > Users who are curious about this breakdown can see how much memory
> > > > is used by each hugetlb size by enabling the hugetlb controller as well.
> > > > 
> > > 
> > > While the patch may be minimal, this is solidifying a kernel API that 
> > > users will start to count on.  Users who may be interested in their 
> > > hugetlb usage may not have control over the configuration of their kernel?
> > > 
> > > Does it make sense to provide a breakdown in memory.stat so that users can 
> > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
> > > pages, which are different global resources?
> > > 
> > > > It's true that this is the case as well for total hugeltb usage, but
> > > > I felt that not including hugetlb memory usage in memory.stat when it
> > > > is accounted by memory.current would cause confusion for the users
> > > > not being able to see that memory.current = sum of memory.stat. On the
> > > > other hand, seeing the breakdown of how much each hugetlb size felt more
> > > > like an optimization, and not a solution that bridges a confusion.
> > > > 
> > > 
> > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
> > > example, users could still do sum of memory.stat, no?>
> > > 
> > 
> > Friendly ping on this, would there be any objections to splitting the 
> > memory.stat metrics out to be per hugepage size?
> 
> I don't think it has to be either/or. We can add the total here, and a
> per-size breakdown in a separate patch (with its own changelog)?
> 
> That said, a per-size breakdown might make more sense in the hugetlb
> cgroup controller. You're mentioning separate global resources, which
> suggests this is about more explicitly controlled hugetlb use.

+1

> From a memcg POV, all hugetlb is the same. It's just (non-swappable)
> memory consumed by the cgroup.

And here too.

Thanks!
Joshua Hahn Nov. 14, 2024, 4:42 p.m. UTC | #9
On Wed, 13 Nov 2024 14:42:29 -0800 (PST) David Rientjes <rientjes@google.com> wrote:

Hi David,

Sorry for the late response on this thread. To be completely transparent
with you, I am not someone who inspects hugetlb usage on a regular
basis, so I may not have the most relevant insights when it comes to
how much utility there would be from breaking down the usage by size.

With that said, I believe that over the past couple of days, there have
been some responses on this thread regarding how others use hugetlb. As you
know, I share Johannes's opinion that if there are people who would benefit
from splitting up the hugetlb usage across different page sizes, it should
happen in the hugetlb controller. 

> On Mon, 11 Nov 2024, David Rientjes wrote:
> > While the patch may be minimal, this is solidifying a kernel API that 
> > users will start to count on.  Users who may be interested in their 
> > hugetlb usage may not have control over the configuration of their kernel?

This is a good point. With that said, I believe that this is an instance
of a feature where both of our proposed ideas can co-exist; we can have the
total hugetlb usage reported in memcg for now, and if there is a consensus 
/ majority that would like to see the breakdown as well, we can introduce
it in a future patch without breaking the utility of this patch.

To quickly address a potential concern of bloating the already large memcg
stat: including both the total and breakdown wouldn't be the first time
a stat and its breakdown are both included: there is a precedent with this
in slab_(un)reclaimable and slab. 

> > Does it make sense to provide a breakdown in memory.stat so that users can 
> > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
> > pages, which are different global resources?
> > 
> > > It's true that this is the case as well for total hugeltb usage, but
> > > I felt that not including hugetlb memory usage in memory.stat when it
> > > is accounted by memory.current would cause confusion for the users
> > > not being able to see that memory.current = sum of memory.stat. On the
> > > other hand, seeing the breakdown of how much each hugetlb size felt more
> > > like an optimization, and not a solution that bridges a confusion.
> > > 
> > 
> > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
> > example, users could still do sum of memory.stat, no?>

This is true! I still think it would be nice to include the total anyways,
since for a lot of people who use this statistic (Nhat's response in this
thread and Shakeel's response in the v3 of this patch), all they want is
a quick check to see how much memory is being used by hugetlb so they can
reason about memory dynamics. Again, I think that if we are to include
a breakdown in a future patch, it can coexist with this one.

> Friendly ping on this, would there be any objections to splitting the 
> memory.stat metrics out to be per hugepage size?

Sorry for the late reponse again. I think that if you had examples of use
cases where having the differnt page sizes, it would help me better
understand a motivation for including the breakdown (I would be happy
to write the patch for the breakdown as well if there is a consensus!)

Thank you for your thoughts, have a great day!
Joshua
David Rientjes Nov. 17, 2024, 3:34 a.m. UTC | #10
On Thu, 14 Nov 2024, Johannes Weiner wrote:

> > > > The reason that I opted not to include a breakdown of each hugetlb
> > > > size in memory.stat is only because I wanted to keep the addition that
> > > > this patch makes as minimal as possible, while still addressing
> > > > the goal of bridging the gap between memory.stat and memory.current.
> > > > Users who are curious about this breakdown can see how much memory
> > > > is used by each hugetlb size by enabling the hugetlb controller as well.
> > > > 
> > > 
> > > While the patch may be minimal, this is solidifying a kernel API that 
> > > users will start to count on.  Users who may be interested in their 
> > > hugetlb usage may not have control over the configuration of their kernel?
> > > 
> > > Does it make sense to provide a breakdown in memory.stat so that users can 
> > > differentiate between mapping one 1GB hugetlb page and 512 2MB hugetlb 
> > > pages, which are different global resources?
> > > 
> > > > It's true that this is the case as well for total hugeltb usage, but
> > > > I felt that not including hugetlb memory usage in memory.stat when it
> > > > is accounted by memory.current would cause confusion for the users
> > > > not being able to see that memory.current = sum of memory.stat. On the
> > > > other hand, seeing the breakdown of how much each hugetlb size felt more
> > > > like an optimization, and not a solution that bridges a confusion.
> > > > 
> > > 
> > > If broken down into hugetlb_2048kB and hugetlb_1048576kB on x86, for 
> > > example, users could still do sum of memory.stat, no?>
> > > 
> > 
> > Friendly ping on this, would there be any objections to splitting the 
> > memory.stat metrics out to be per hugepage size?
> 
> I don't think it has to be either/or. We can add the total here, and a
> per-size breakdown in a separate patch (with its own changelog)?
> 
> That said, a per-size breakdown might make more sense in the hugetlb
> cgroup controller. You're mentioning separate global resources, which
> suggests this is about more explicitly controlled hugetlb use.
> 
> From a memcg POV, all hugetlb is the same. It's just (non-swappable)
> memory consumed by the cgroup.
> 

Ok, that's fair.  We have a local patch that tracks hugetlb usage, 
admittedly for all hugetlb sizes, in struct mem_cgroup_per_node so that we 
can provide a breakdown in memory.numa_stat because we can't get the 
per-node breakdown from hugetlb_cgroup.  If there is interest in that 
breakdown, we could easily propose the patch.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69af2173555f..bd7e81c2aa2b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1646,6 +1646,11 @@  The following nested keys are defined.
 	  pgdemote_khugepaged
 		Number of pages demoted by khugepaged.
 
+	  hugetlb
+		Amount of memory used by hugetlb pages. This metric only shows
+		up if hugetlb usage is accounted for in memory.current (i.e.
+		cgroup is mounted with the memory_hugetlb_accounting option).
+
   memory.numa_stat
 	A read-only nested-keyed file which exists on non-root cgroups.
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..972795ae5946 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,6 +220,9 @@  enum node_stat_item {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+	NR_HUGETLB,
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..fbb10e52d7ea 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1925,6 +1925,7 @@  void free_huge_folio(struct folio *folio)
 				     pages_per_huge_page(h), folio);
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
 	mem_cgroup_uncharge(folio);
 	if (restore_reserve)
 		h->resv_huge_pages++;
@@ -3093,6 +3094,7 @@  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	if (!memcg_charge_ret)
 		mem_cgroup_commit_charge(folio, memcg);
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
 	mem_cgroup_put(memcg);
 
 	return folio;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..5444d0e7bb64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -310,6 +310,9 @@  static const unsigned int memcg_node_stat_items[] = {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+	NR_HUGETLB,
+#endif
 };
 
 static const unsigned int memcg_stat_items[] = {
@@ -1346,6 +1349,9 @@  static const struct memory_stat memory_stats[] = {
 	{ "unevictable",		NR_UNEVICTABLE			},
 	{ "slab_reclaimable",		NR_SLAB_RECLAIMABLE_B		},
 	{ "slab_unreclaimable",		NR_SLAB_UNRECLAIMABLE_B		},
+#ifdef CONFIG_HUGETLB_PAGE
+	{ "hugetlb",			NR_HUGETLB			},
+#endif
 
 	/* The memory events */
 	{ "workingset_refault_anon",	WORKINGSET_REFAULT_ANON		},
@@ -1441,6 +1447,11 @@  static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
 
+#ifdef CONFIG_HUGETLB_PAGE
+		if (unlikely(memory_stats[i].idx == NR_HUGETLB) &&
+		    !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+			continue;
+#endif
 		size = memcg_page_state_output(memcg, memory_stats[i].idx);
 		seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b5a4cea423e1..871566b04b79 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1273,6 +1273,9 @@  const char * const vmstat_text[] = {
 	"pgdemote_kswapd",
 	"pgdemote_direct",
 	"pgdemote_khugepaged",
+#ifdef CONFIG_HUGETLB_PAGE
+	"nr_hugetlb",
+#endif
 	/* system-wide enum vm_stat_item counters */
 	"nr_dirty_threshold",
 	"nr_dirty_background_threshold",