memcg: add hierarchical effective limits for v2

Message ID	20250205222029.2979048-1-shakeel.butt@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Shakeel Butt <shakeel.butt@linux.dev> To: Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team <kernel-team@meta.com> Subject: [PATCH] memcg: add hierarchical effective limits for v2 Date: Wed, 5 Feb 2025 14:20:29 -0800 Message-ID: <20250205222029.2979048-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: add hierarchical effective limits for v2 \| expand memcg: add hierarchical effective limits for v2

Shakeel Butt Feb. 5, 2025, 10:20 p.m. UTC

Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its
memory.stat file which applications can use to get their effective limit
which is the minimum of limits of itself and all of its ancestors. This
is pretty useful in environments where cgroup namespace is used and the
application does not have access to the full view of the cgroup
hierarchy. Let's expose effective limits for memcg v2 as well.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 Documentation/admin-guide/cgroup-v2.rst | 24 +++++++++++++
 mm/memcontrol.c                         | 48 +++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

Balbir Singh Feb. 5, 2025, 10:33 p.m. UTC | #1

On 2/6/25 09:20, Shakeel Butt wrote:
> Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its
> memory.stat file which applications can use to get their effective limit
> which is the minimum of limits of itself and all of its ancestors. This
> is pretty useful in environments where cgroup namespace is used and the
> application does not have access to the full view of the cgroup
> hierarchy. Let's expose effective limits for memcg v2 as well.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>


Even without namespaces, in a hierarchy the application might be restricted
in reading the parent cgroup information (read permission removed for example)

Otherwise looks good to me

Reviewed-by: Balbir Singh <balbirs@nvidia.com>

Michal Koutný Feb. 6, 2025, 3:57 p.m. UTC | #2

Hello Shakeel.

On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its
> memory.stat file which applications can use to get their effective limit
> which is the minimum of limits of itself and all of its ancestors.

I was fan of equal idea too [1]. The referenced series also tackles
change notifications (to make this complete for apps that really want to
scale based on the actual limit). I ceased to like it when I realized
there can be hierarchies when the effective value cannot be effectively
:) determined [2].

> This is pretty useful in environments where cgroup namespace is used
> and the application does not have access to the full view of the
> cgroup hierarchy. Let's expose effective limits for memcg v2 as well.

Also, the case for this exposition was never strongly built.
Why isn't PSI enough in your case?

Thanks,
Michal

[1] https://lore.kernel.org/r/20240606152232.20253-1-mkoutny@suse.com
[2] https://lore.kernel.org/r/7chi6d2sdhwdsfihoxqmtmi4lduea3dsgc7xorvonugkm4qz2j@gehs4slutmtg

Shakeel Butt Feb. 6, 2025, 7:09 p.m. UTC | #3

On Thu, Feb 06, 2025 at 04:57:39PM +0100, Michal Koutný wrote:
> Hello Shakeel.
> 
> On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its
> > memory.stat file which applications can use to get their effective limit
> > which is the minimum of limits of itself and all of its ancestors.
> 
> I was fan of equal idea too [1]. The referenced series also tackles
> change notifications (to make this complete for apps that really want to
> scale based on the actual limit). I ceased to like it when I realized
> there can be hierarchies when the effective value cannot be effectively
> :) determined [2].
> 
> > This is pretty useful in environments where cgroup namespace is used
> > and the application does not have access to the full view of the
> > cgroup hierarchy. Let's expose effective limits for memcg v2 as well.
> 
> Also, the case for this exposition was never strongly built.
> Why isn't PSI enough in your case?
> 

Hi Michal,

Oh I totally forgot about your series. In my use-case, it is not about
dynamically knowning how much they can expand and adjust themselves but
rather knowing statically upfront what resources they have been given.
More concretely, these are workloads which used to completely occupy a
single machine, though within containers but without limits. These
workloads used to look at machine level metrics at startup on how much
resources are available.

Now these workloads are being moved to multi-tenant environment but
still the machine is partitioned statically between the workloads. So,
these workloads need to know upfront how much resources are allocated to
them upfront and the way the cgroup hierarchy is setup, that information
is a bit above the tree.

I hope this clarifies the motivation behind this change i.e. the target
is not dynamic load balancing but rather upfront static knowledge.

thanks,
Shakeel

T.J. Mercier Feb. 6, 2025, 7:37 p.m. UTC | #4

On Thu, Feb 6, 2025 at 11:09 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Feb 06, 2025 at 04:57:39PM +0100, Michal Koutný wrote:
> > Hello Shakeel.
> >
> > On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its
> > > memory.stat file which applications can use to get their effective limit
> > > which is the minimum of limits of itself and all of its ancestors.
> >
> > I was fan of equal idea too [1]. The referenced series also tackles
> > change notifications (to make this complete for apps that really want to
> > scale based on the actual limit). I ceased to like it when I realized
> > there can be hierarchies when the effective value cannot be effectively
> > :) determined [2].
> >
> > > This is pretty useful in environments where cgroup namespace is used
> > > and the application does not have access to the full view of the
> > > cgroup hierarchy. Let's expose effective limits for memcg v2 as well.
> >
> > Also, the case for this exposition was never strongly built.
> > Why isn't PSI enough in your case?
> >
>
> Hi Michal,
>
> Oh I totally forgot about your series. In my use-case, it is not about
> dynamically knowning how much they can expand and adjust themselves but
> rather knowing statically upfront what resources they have been given.
> More concretely, these are workloads which used to completely occupy a
> single machine, though within containers but without limits. These
> workloads used to look at machine level metrics at startup on how much
> resources are available.
>
> Now these workloads are being moved to multi-tenant environment but
> still the machine is partitioned statically between the workloads. So,
> these workloads need to know upfront how much resources are allocated to
> them upfront and the way the cgroup hierarchy is setup, that information
> is a bit above the tree.
>
> I hope this clarifies the motivation behind this change i.e. the target
> is not dynamic load balancing but rather upfront static knowledge.
>
> thanks,
> Shakeel
>

We've been thinking of using memcg to both protect (memory.min) and
limit (via memcg OOM) memory hungry apps (games), while informing such
apps of their upper limit so they know how much they can allocate
before risking being killed. Visibility of the cgroup hierarchy isn't
an issue, but having a single file to read instead of walking up the
tree with multiple reads to calculate an effective limit would be
nice. Partial memcg activation in the hierarchy *is* an issue, but
walking up to the closest ancestor with memcg activated is better than
reading all the way up.

Shakeel Butt Feb. 6, 2025, 10:24 p.m. UTC | #5

Oops, I forgot to CC Andrew.

On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt wrote:
> Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its
> memory.stat file which applications can use to get their effective limit
> which is the minimum of limits of itself and all of its ancestors. This
> is pretty useful in environments where cgroup namespace is used and the
> application does not have access to the full view of the cgroup
> hierarchy. Let's expose effective limits for memcg v2 as well.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 24 +++++++++++++
>  mm/memcontrol.c                         | 48 +++++++++++++++++++++++++
>  2 files changed, 72 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index cb1b4e759b7e..175e9435ad5c 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1311,6 +1311,14 @@ PAGE_SIZE multiple when read back.
>  	Caller could retry them differently, return into userspace
>  	as -ENOMEM or silently ignore in cases like disk readahead.
>  
> +  memory.max.effective
> +	A read-only single value file which exists on non-root cgroups.
> +
> +        The effective limit of the cgroup i.e. the minimum memory.max
> +        of all ancestors including itself. This is useful for environments
> +        where cgroup namespace is being used and the application does not
> +        have full view of the hierarchy.
> +
>    memory.reclaim
>  	A write-only nested-keyed file which exists for all cgroups.
>  
> @@ -1726,6 +1734,14 @@ The following nested keys are defined.
>  	Swap usage hard limit.  If a cgroup's swap usage reaches this
>  	limit, anonymous memory of the cgroup will not be swapped out.
>  
> +  memory.swap.max.effective
> +	A read-only single value file which exists on non-root cgroups.
> +
> +        The effective limit of the cgroup i.e. the minimum memory.swap.max
> +        of all ancestors including itself. This is useful for environments
> +        where cgroup namespace is being used and the application does not
> +        have full view of the hierarchy.
> +
>    memory.swap.events
>  	A read-only flat-keyed file which exists on non-root cgroups.
>  	The following entries are defined.  Unless specified
> @@ -1766,6 +1782,14 @@ The following nested keys are defined.
>  	limit, it will refuse to take any more stores before existing
>  	entries fault back in or are written out to disk.
>  
> +  memory.zswap.max.effective
> +	A read-only single value file which exists on non-root cgroups.
> +
> +        The effective limit of the cgroup i.e. the minimum memory.zswap.max
> +        of all ancestors including itself. This is useful for environments
> +        where cgroup namespace is being used and the application does not
> +        have full view of the hierarchy.
> +
>    memory.zswap.writeback
>  	A read-write single value file. The default value is "1".
>  	Note that this setting is hierarchical, i.e. the writeback would be
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cae1c2e0cc71..8d21c1a44220 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4161,6 +4161,17 @@ static int memory_max_show(struct seq_file *m, void *v)
>  		READ_ONCE(mem_cgroup_from_seq(m)->memory.max));
>  }
>  
> +static int memory_max_effective_show(struct seq_file *m, void *v)
> +{
> +	unsigned long max = PAGE_COUNTER_MAX;
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	for (; memcg; memcg = parent_mem_cgroup(memcg))
> +		max = min(max, READ_ONCE(memcg->memory.max));
> +
> +	return seq_puts_memcg_tunable(m, max);
> +}
> +
>  static ssize_t memory_max_write(struct kernfs_open_file *of,
>  				char *buf, size_t nbytes, loff_t off)
>  {
> @@ -4438,6 +4449,11 @@ static struct cftype memory_files[] = {
>  		.seq_show = memory_max_show,
>  		.write = memory_max_write,
>  	},
> +	{
> +		.name = "max.effective",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_max_effective_show,
> +	},
>  	{
>  		.name = "events",
>  		.flags = CFTYPE_NOT_ON_ROOT,
> @@ -5117,6 +5133,17 @@ static int swap_max_show(struct seq_file *m, void *v)
>  		READ_ONCE(mem_cgroup_from_seq(m)->swap.max));
>  }
>  
> +static int swap_max_effective_show(struct seq_file *m, void *v)
> +{
> +	unsigned long max = PAGE_COUNTER_MAX;
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	for (; memcg; memcg = parent_mem_cgroup(memcg))
> +		max = min(max, READ_ONCE(memcg->swap.max));
> +
> +	return seq_puts_memcg_tunable(m, max);
> +}
> +
>  static ssize_t swap_max_write(struct kernfs_open_file *of,
>  			      char *buf, size_t nbytes, loff_t off)
>  {
> @@ -5166,6 +5193,11 @@ static struct cftype swap_files[] = {
>  		.seq_show = swap_max_show,
>  		.write = swap_max_write,
>  	},
> +	{
> +		.name = "swap.max.effective",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = swap_max_effective_show,
> +	},
>  	{
>  		.name = "swap.peak",
>  		.flags = CFTYPE_NOT_ON_ROOT,
> @@ -5308,6 +5340,17 @@ static int zswap_max_show(struct seq_file *m, void *v)
>  		READ_ONCE(mem_cgroup_from_seq(m)->zswap_max));
>  }
>  
> +static int zswap_max_effective_show(struct seq_file *m, void *v)
> +{
> +	unsigned long max = PAGE_COUNTER_MAX;
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	for (; memcg; memcg = parent_mem_cgroup(memcg))
> +		max = min(max, READ_ONCE(memcg->zswap_max));
> +
> +	return seq_puts_memcg_tunable(m, max);
> +}
> +
>  static ssize_t zswap_max_write(struct kernfs_open_file *of,
>  			       char *buf, size_t nbytes, loff_t off)
>  {
> @@ -5362,6 +5405,11 @@ static struct cftype zswap_files[] = {
>  		.seq_show = zswap_max_show,
>  		.write = zswap_max_write,
>  	},
> +	{
> +		.name = "zswap.max.effective",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = zswap_max_effective_show,
> +	},
>  	{
>  		.name = "zswap.writeback",
>  		.seq_show = zswap_writeback_show,
> -- 
> 2.43.5
>

Michal Koutný Feb. 10, 2025, 4:24 p.m. UTC | #6

Hello.

On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> Oh I totally forgot about your series. In my use-case, it is not about
> dynamically knowning how much they can expand and adjust themselves but
> rather knowing statically upfront what resources they have been given.

From the memcg PoV, the effective value doesn't tell how much they were
given (because of sharing).

> More concretely, these are workloads which used to completely occupy a
> single machine, though within containers but without limits. These
> workloads used to look at machine level metrics at startup on how much
> resources are available.

I've been there but haven't found convincing mapping of global to memcg
limits.

The issue is that such a value won't guarantee no OOM when below because
it can be (generally) effectively shared.

(Alas, apps typically don't express their memory needs in units of
PSI. So it boils down to a system wide monitor like systemd-oomd and
cooperation with it.)

> Now these workloads are being moved to multi-tenant environment but
> still the machine is partitioned statically between the workloads. So,
> these workloads need to know upfront how much resources are allocated to
> them upfront and the way the cgroup hierarchy is setup, that information
> is a bit above the tree.

FTR, e.g. in systemd setups, this can be partially overcome by exposed
EffectiveMemoryMax= (the service manager who configures the resources
also can do the ancestry traversal).
kubernetes has downward API where generic resource info is shared into
containers and I recall that lxcfs could mangle procfs
memory info wrt memory limits for legacy apps.

As I think about it, the cgroupns (in)visibility should be resolved by
assigning the proper limit to namespace's root group memory.max (read
only for contained user) and the traversal...

On Thu, Feb 06, 2025 at 11:37:31AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote:
> but having a single file to read instead of walking up the
> tree with multiple reads to calculate an effective limit would be
> nice.

...in kernel is nice but possible performance gain isn't worth hiding
the shareability of the effective limit.

So I wonder what is the current PoV of more MM people...

Michal

Shakeel Butt Feb. 10, 2025, 6:34 p.m. UTC | #7

On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > Oh I totally forgot about your series. In my use-case, it is not about
> > dynamically knowning how much they can expand and adjust themselves but
> > rather knowing statically upfront what resources they have been given.
> 
> From the memcg PoV, the effective value doesn't tell how much they were
> given (because of sharing).
> 
> > More concretely, these are workloads which used to completely occupy a
> > single machine, though within containers but without limits. These
> > workloads used to look at machine level metrics at startup on how much
> > resources are available.
> 
> I've been there but haven't found convincing mapping of global to memcg
> limits.
> 
> The issue is that such a value won't guarantee no OOM when below because
> it can be (generally) effectively shared.
> 
> (Alas, apps typically don't express their memory needs in units of
> PSI. So it boils down to a system wide monitor like systemd-oomd and
> cooperation with it.)
> 

I think you missed the static partitioning of resources use-case I
mentioned. The issue you are pointing exist for the system level metrics
as well i.e. a worklod looking at system metrics can't say how much they
are given but in my specific case, the workloads know they occupy the
full machine. Now we want to move such workloads to multi-tenant
environment but the resources are still statically partitioned and not
overcommitted, so effective limit will tell how much they are given.

> > Now these workloads are being moved to multi-tenant environment but
> > still the machine is partitioned statically between the workloads. So,
> > these workloads need to know upfront how much resources are allocated to
> > them upfront and the way the cgroup hierarchy is setup, that information
> > is a bit above the tree.
> 
> FTR, e.g. in systemd setups, this can be partially overcome by exposed
> EffectiveMemoryMax= (the service manager who configures the resources
> also can do the ancestry traversal).
> kubernetes has downward API where generic resource info is shared into
> containers and I recall that lxcfs could mangle procfs
> memory info wrt memory limits for legacy apps.
> 
> 
> As I think about it, the cgroupns (in)visibility should be resolved by
> assigning the proper limit to namespace's root group memory.max (read
> only for contained user) and the traversal...
> 

I think here your point is why not have userspace based solution. I
think it is possible but not convenient and adds an external dependency
in the workload.

> 
> On Thu, Feb 06, 2025 at 11:37:31AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote:
> > but having a single file to read instead of walking up the
> > tree with multiple reads to calculate an effective limit would be
> > nice.
> 
> ...in kernel is nice but possible performance gain isn't worth hiding
> the shareability of the effective limit.
> 
> 
> So I wonder what is the current PoV of more MM people...

Yup, let's see more opinion on this.

Thanks Michal for your feedback.

Johannes Weiner Feb. 10, 2025, 10:52 p.m. UTC | #8

On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > Oh I totally forgot about your series. In my use-case, it is not about
> > dynamically knowning how much they can expand and adjust themselves but
> > rather knowing statically upfront what resources they have been given.
> 
> From the memcg PoV, the effective value doesn't tell how much they were
> given (because of sharing).

It's definitely true that if you have an ancestral limit for several
otherwise unlimited siblings, then interpreting this number as "this
is how much memory I have available" will be completely misleading.

I would also say that sharing a limit with several siblings requires a
certain degree of awareness and cooperation between them. From that
POV, IMO it would be fine to provide a metric with contextual caveats.

The problem is, what do we do with canned, unaware, maybe untrusted
applications? And they don't necessarily know which they are.

It depends heavily on the judgement of the administrator of any given
deployment. Some workloads might be completely untrusted and hard
limited. Another deployment might consider the same workload
reasonably predictable that it's configured only with a failsafe max
limit that is much higher than where the workload is *expected* to
operate. The allotment might happen altogether with min/low
protections and no max limit. Or there could be a combination of
protection slightly below and a limit slightly above the expected
workload size.

It seems basically impossible to write portable code against this
without knowing the intent of the person setting it up.

But how do we communicate intent down to the container? The two broad
options are implicitly or explicitly:

a) Provide a cgroup file that automatically derives intended target
   size from how min/low/high/max are set up.

   Right now those can be set up super loosely depending on what the
   administrator thinks about the application. In order for this to
   work, we'd likely have to define an idiomatic way of configuring
   the controller. E.g. if you set max by itself, we assume this is
   the target size. If you set low, with or without max, then low is
   the target size. Or if you set both, target is in between.

   I'm not completely convinced this is workable. It might require
   settings beyond what's actually needed for the safe containment of
   the workload, which carries the risk of excluding something useful.
   I don't mean enforced configuration rules, but rather the case
   where a configuration is reasonable and effective given the
   workload and environment, but now the target file shows nonsense.

b) Provide a cgroup file that is freely configurable by the
   administrator with the target size of the container.

   This has obvious drawbacks as well. What's the default value? Also,
   a lot of setups are dead simple: set a hard limit and expect the
   workload to adhere to that, period. Nobody is going to reliably set
   another cgroup file that a workload may or may not consume.

The third option is to wash our hands of all of this, provide the
static hierarchy settings to the leaves (like this patch, plus do it
for the other knobs as well) and let userspace figure it out.

Thoughts?

Roman Gushchin Feb. 11, 2025, 4:55 a.m. UTC | #9

On Mon, Feb 10, 2025 at 05:52:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote:
> > Hello.
> > 
> > On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > Oh I totally forgot about your series. In my use-case, it is not about
> > > dynamically knowning how much they can expand and adjust themselves but
> > > rather knowing statically upfront what resources they have been given.
> > 
> > From the memcg PoV, the effective value doesn't tell how much they were
> > given (because of sharing).
> 
> It's definitely true that if you have an ancestral limit for several
> otherwise unlimited siblings, then interpreting this number as "this
> is how much memory I have available" will be completely misleading.
> 
> I would also say that sharing a limit with several siblings requires a
> certain degree of awareness and cooperation between them. From that
> POV, IMO it would be fine to provide a metric with contextual caveats.
> 
> The problem is, what do we do with canned, unaware, maybe untrusted
> applications? And they don't necessarily know which they are.
> 
> It depends heavily on the judgement of the administrator of any given
> deployment. Some workloads might be completely untrusted and hard
> limited. Another deployment might consider the same workload
> reasonably predictable that it's configured only with a failsafe max
> limit that is much higher than where the workload is *expected* to
> operate. The allotment might happen altogether with min/low
> protections and no max limit. Or there could be a combination of
> protection slightly below and a limit slightly above the expected
> workload size.
> 
> It seems basically impossible to write portable code against this
> without knowing the intent of the person setting it up.
> 
> But how do we communicate intent down to the container? The two broad
> options are implicitly or explicitly:
> 
> a) Provide a cgroup file that automatically derives intended target
>    size from how min/low/high/max are set up.
> 
>    Right now those can be set up super loosely depending on what the
>    administrator thinks about the application. In order for this to
>    work, we'd likely have to define an idiomatic way of configuring
>    the controller. E.g. if you set max by itself, we assume this is
>    the target size. If you set low, with or without max, then low is
>    the target size. Or if you set both, target is in between.
> 
>    I'm not completely convinced this is workable.

This sounds like memory.available.

It's hard to implement well, especially taking into account things like
numa, memory sharing, estimating how much can be reclaimed, etc.

But at the same time there is a value in providing such metric.
There is a clear use case. And it's even harder to implement this
in userspace.

> b) Provide a cgroup file that is freely configurable by the
>    administrator with the target size of the container.
> 
>    This has obvious drawbacks as well. What's the default value? Also,
>    a lot of setups are dead simple: set a hard limit and expect the
>    workload to adhere to that, period. Nobody is going to reliably set
>    another cgroup file that a workload may or may not consume.

Yeah, this is a weird option.

> 
> The third option is to wash our hands of all of this, provide the
> static hierarchy settings to the leaves (like this patch, plus do it
> for the other knobs as well) and let userspace figure it out.

Idk, I see a very little value in it. I'm not necessarily opposing this patchset,
just not seeing a lot of value.

Maybe I'm missing something, but somehow it wasn't a problem for many years.
Nothing really changed here.

So maybe someone can come up with a better explanation of a specific problem
we're trying to solve here?

Thanks!

Shakeel Butt Feb. 12, 2025, 1:08 a.m. UTC | #10

On Tue, Feb 11, 2025 at 04:55:33AM +0000, Roman Gushchin wrote:
[...]
> 
> Maybe I'm missing something, but somehow it wasn't a problem for many years.
> Nothing really changed here.
> 
> So maybe someone can come up with a better explanation of a specific problem
> we're trying to solve here?

The most simple explanation is visibility. Workloads that used to run
solo are being moved to a multi-tenant but non-overcommited environment
and they need to know their capacity which they used to get from system
metrics. Now they have to get from cgroup limit files but usage of
cgroup namespace limits those workloads to extract the needed
information.

Michal Koutný Feb. 17, 2025, 5:57 p.m. UTC | #11

Hello.

On Tue, Feb 11, 2025 at 05:08:03PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > So maybe someone can come up with a better explanation of a specific problem
> > we're trying to solve here?

In my experience, another factor is the switch from v1 to v2 (which
propagates slower to downstreams) and applications that rely on
memory.stat:hierarchical_memory_limit. (Funnily, enough the commit
fee7b548e6f2b ("memcg: show real limit under hierarchy mode") introduces
it primarily for debugging purposes (not sizing). An application being
killed with no apparent (immediate) limit breach.)

Roman, you may also remember that it had already popped ~year ago [1].

> The most simple explanation is visibility. Workloads that used to run
> solo are being moved to a multi-tenant but non-overcommited environment
> and they need to know their capacity which they used to get from system
> metrics.

> Now they have to get from cgroup limit files but usage of
> cgroup namespace limits those workloads to extract the needed
> information.

I remember Shakeel said the limit may be set higher in the hierarchy for
container + siblings but then it's potentially overcommitted, no?

I.e. namespace visibility alone is not the problem. The cgns root's
memory.max is the shared medium between host and guest through which the
memory allowance can be passed -- that actually sounds to me like
Johannes' option b).

(Which leads me to an idea of memory.max.effective that'd only present
the value iff there's no sibling between tightest ancestor..self. If one
looks at nr_tasks, it's partial but correct memory available. Not that
useful due to the partiality.)

Since I was originally fan of the idea, I'm not a strong opponent of
plain memory.max.effective, especially when Johannes considers the
option of kernel stepping back here and it may help some users. But I'd
like to see the original incarnations [2] somehow linked (and maybe
start only with memory.max as
that has some usecases).

Thanks,
Michal

[1] https://lore.kernel.org/all/ZcY7NmjkJMhGz8fP@host1.jankratochvil.net/
[2] https://lore.kernel.org/all/20240606152232.20253-1-mkoutny@suse.com/

memcg: add hierarchical effective limits for v2

Commit Message

Comments

Patch