Message ID | 20250205222029.2979048-1-shakeel.butt@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: add hierarchical effective limits for v2 | expand |
On 2/6/25 09:20, Shakeel Butt wrote: > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its > memory.stat file which applications can use to get their effective limit > which is the minimum of limits of itself and all of its ancestors. This > is pretty useful in environments where cgroup namespace is used and the > application does not have access to the full view of the cgroup > hierarchy. Let's expose effective limits for memcg v2 as well. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Even without namespaces, in a hierarchy the application might be restricted in reading the parent cgroup information (read permission removed for example) Otherwise looks good to me Reviewed-by: Balbir Singh <balbirs@nvidia.com>
Hello Shakeel. On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its > memory.stat file which applications can use to get their effective limit > which is the minimum of limits of itself and all of its ancestors. I was fan of equal idea too [1]. The referenced series also tackles change notifications (to make this complete for apps that really want to scale based on the actual limit). I ceased to like it when I realized there can be hierarchies when the effective value cannot be effectively :) determined [2]. > This is pretty useful in environments where cgroup namespace is used > and the application does not have access to the full view of the > cgroup hierarchy. Let's expose effective limits for memcg v2 as well. Also, the case for this exposition was never strongly built. Why isn't PSI enough in your case? Thanks, Michal [1] https://lore.kernel.org/r/20240606152232.20253-1-mkoutny@suse.com [2] https://lore.kernel.org/r/7chi6d2sdhwdsfihoxqmtmi4lduea3dsgc7xorvonugkm4qz2j@gehs4slutmtg
On Thu, Feb 06, 2025 at 04:57:39PM +0100, Michal Koutný wrote: > Hello Shakeel. > > On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its > > memory.stat file which applications can use to get their effective limit > > which is the minimum of limits of itself and all of its ancestors. > > I was fan of equal idea too [1]. The referenced series also tackles > change notifications (to make this complete for apps that really want to > scale based on the actual limit). I ceased to like it when I realized > there can be hierarchies when the effective value cannot be effectively > :) determined [2]. > > > This is pretty useful in environments where cgroup namespace is used > > and the application does not have access to the full view of the > > cgroup hierarchy. Let's expose effective limits for memcg v2 as well. > > Also, the case for this exposition was never strongly built. > Why isn't PSI enough in your case? > Hi Michal, Oh I totally forgot about your series. In my use-case, it is not about dynamically knowning how much they can expand and adjust themselves but rather knowing statically upfront what resources they have been given. More concretely, these are workloads which used to completely occupy a single machine, though within containers but without limits. These workloads used to look at machine level metrics at startup on how much resources are available. Now these workloads are being moved to multi-tenant environment but still the machine is partitioned statically between the workloads. So, these workloads need to know upfront how much resources are allocated to them upfront and the way the cgroup hierarchy is setup, that information is a bit above the tree. I hope this clarifies the motivation behind this change i.e. the target is not dynamic load balancing but rather upfront static knowledge. thanks, Shakeel
On Thu, Feb 6, 2025 at 11:09 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Feb 06, 2025 at 04:57:39PM +0100, Michal Koutný wrote: > > Hello Shakeel. > > > > On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its > > > memory.stat file which applications can use to get their effective limit > > > which is the minimum of limits of itself and all of its ancestors. > > > > I was fan of equal idea too [1]. The referenced series also tackles > > change notifications (to make this complete for apps that really want to > > scale based on the actual limit). I ceased to like it when I realized > > there can be hierarchies when the effective value cannot be effectively > > :) determined [2]. > > > > > This is pretty useful in environments where cgroup namespace is used > > > and the application does not have access to the full view of the > > > cgroup hierarchy. Let's expose effective limits for memcg v2 as well. > > > > Also, the case for this exposition was never strongly built. > > Why isn't PSI enough in your case? > > > > Hi Michal, > > Oh I totally forgot about your series. In my use-case, it is not about > dynamically knowning how much they can expand and adjust themselves but > rather knowing statically upfront what resources they have been given. > More concretely, these are workloads which used to completely occupy a > single machine, though within containers but without limits. These > workloads used to look at machine level metrics at startup on how much > resources are available. > > Now these workloads are being moved to multi-tenant environment but > still the machine is partitioned statically between the workloads. So, > these workloads need to know upfront how much resources are allocated to > them upfront and the way the cgroup hierarchy is setup, that information > is a bit above the tree. > > I hope this clarifies the motivation behind this change i.e. the target > is not dynamic load balancing but rather upfront static knowledge. > > thanks, > Shakeel > We've been thinking of using memcg to both protect (memory.min) and limit (via memcg OOM) memory hungry apps (games), while informing such apps of their upper limit so they know how much they can allocate before risking being killed. Visibility of the cgroup hierarchy isn't an issue, but having a single file to read instead of walking up the tree with multiple reads to calculate an effective limit would be nice. Partial memcg activation in the hierarchy *is* an issue, but walking up to the closest ancestor with memcg activated is better than reading all the way up.
Oops, I forgot to CC Andrew. On Wed, Feb 05, 2025 at 02:20:29PM -0800, Shakeel Butt wrote: > Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its > memory.stat file which applications can use to get their effective limit > which is the minimum of limits of itself and all of its ancestors. This > is pretty useful in environments where cgroup namespace is used and the > application does not have access to the full view of the cgroup > hierarchy. Let's expose effective limits for memcg v2 as well. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > --- > Documentation/admin-guide/cgroup-v2.rst | 24 +++++++++++++ > mm/memcontrol.c | 48 +++++++++++++++++++++++++ > 2 files changed, 72 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index cb1b4e759b7e..175e9435ad5c 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1311,6 +1311,14 @@ PAGE_SIZE multiple when read back. > Caller could retry them differently, return into userspace > as -ENOMEM or silently ignore in cases like disk readahead. > > + memory.max.effective > + A read-only single value file which exists on non-root cgroups. > + > + The effective limit of the cgroup i.e. the minimum memory.max > + of all ancestors including itself. This is useful for environments > + where cgroup namespace is being used and the application does not > + have full view of the hierarchy. > + > memory.reclaim > A write-only nested-keyed file which exists for all cgroups. > > @@ -1726,6 +1734,14 @@ The following nested keys are defined. > Swap usage hard limit. If a cgroup's swap usage reaches this > limit, anonymous memory of the cgroup will not be swapped out. > > + memory.swap.max.effective > + A read-only single value file which exists on non-root cgroups. > + > + The effective limit of the cgroup i.e. the minimum memory.swap.max > + of all ancestors including itself. This is useful for environments > + where cgroup namespace is being used and the application does not > + have full view of the hierarchy. > + > memory.swap.events > A read-only flat-keyed file which exists on non-root cgroups. > The following entries are defined. Unless specified > @@ -1766,6 +1782,14 @@ The following nested keys are defined. > limit, it will refuse to take any more stores before existing > entries fault back in or are written out to disk. > > + memory.zswap.max.effective > + A read-only single value file which exists on non-root cgroups. > + > + The effective limit of the cgroup i.e. the minimum memory.zswap.max > + of all ancestors including itself. This is useful for environments > + where cgroup namespace is being used and the application does not > + have full view of the hierarchy. > + > memory.zswap.writeback > A read-write single value file. The default value is "1". > Note that this setting is hierarchical, i.e. the writeback would be > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cae1c2e0cc71..8d21c1a44220 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4161,6 +4161,17 @@ static int memory_max_show(struct seq_file *m, void *v) > READ_ONCE(mem_cgroup_from_seq(m)->memory.max)); > } > > +static int memory_max_effective_show(struct seq_file *m, void *v) > +{ > + unsigned long max = PAGE_COUNTER_MAX; > + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > + > + for (; memcg; memcg = parent_mem_cgroup(memcg)) > + max = min(max, READ_ONCE(memcg->memory.max)); > + > + return seq_puts_memcg_tunable(m, max); > +} > + > static ssize_t memory_max_write(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off) > { > @@ -4438,6 +4449,11 @@ static struct cftype memory_files[] = { > .seq_show = memory_max_show, > .write = memory_max_write, > }, > + { > + .name = "max.effective", > + .flags = CFTYPE_NOT_ON_ROOT, > + .seq_show = memory_max_effective_show, > + }, > { > .name = "events", > .flags = CFTYPE_NOT_ON_ROOT, > @@ -5117,6 +5133,17 @@ static int swap_max_show(struct seq_file *m, void *v) > READ_ONCE(mem_cgroup_from_seq(m)->swap.max)); > } > > +static int swap_max_effective_show(struct seq_file *m, void *v) > +{ > + unsigned long max = PAGE_COUNTER_MAX; > + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > + > + for (; memcg; memcg = parent_mem_cgroup(memcg)) > + max = min(max, READ_ONCE(memcg->swap.max)); > + > + return seq_puts_memcg_tunable(m, max); > +} > + > static ssize_t swap_max_write(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off) > { > @@ -5166,6 +5193,11 @@ static struct cftype swap_files[] = { > .seq_show = swap_max_show, > .write = swap_max_write, > }, > + { > + .name = "swap.max.effective", > + .flags = CFTYPE_NOT_ON_ROOT, > + .seq_show = swap_max_effective_show, > + }, > { > .name = "swap.peak", > .flags = CFTYPE_NOT_ON_ROOT, > @@ -5308,6 +5340,17 @@ static int zswap_max_show(struct seq_file *m, void *v) > READ_ONCE(mem_cgroup_from_seq(m)->zswap_max)); > } > > +static int zswap_max_effective_show(struct seq_file *m, void *v) > +{ > + unsigned long max = PAGE_COUNTER_MAX; > + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > + > + for (; memcg; memcg = parent_mem_cgroup(memcg)) > + max = min(max, READ_ONCE(memcg->zswap_max)); > + > + return seq_puts_memcg_tunable(m, max); > +} > + > static ssize_t zswap_max_write(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off) > { > @@ -5362,6 +5405,11 @@ static struct cftype zswap_files[] = { > .seq_show = zswap_max_show, > .write = zswap_max_write, > }, > + { > + .name = "zswap.max.effective", > + .flags = CFTYPE_NOT_ON_ROOT, > + .seq_show = zswap_max_effective_show, > + }, > { > .name = "zswap.writeback", > .seq_show = zswap_writeback_show, > -- > 2.43.5 >
Hello. On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > Oh I totally forgot about your series. In my use-case, it is not about > dynamically knowning how much they can expand and adjust themselves but > rather knowing statically upfront what resources they have been given. From the memcg PoV, the effective value doesn't tell how much they were given (because of sharing). > More concretely, these are workloads which used to completely occupy a > single machine, though within containers but without limits. These > workloads used to look at machine level metrics at startup on how much > resources are available. I've been there but haven't found convincing mapping of global to memcg limits. The issue is that such a value won't guarantee no OOM when below because it can be (generally) effectively shared. (Alas, apps typically don't express their memory needs in units of PSI. So it boils down to a system wide monitor like systemd-oomd and cooperation with it.) > Now these workloads are being moved to multi-tenant environment but > still the machine is partitioned statically between the workloads. So, > these workloads need to know upfront how much resources are allocated to > them upfront and the way the cgroup hierarchy is setup, that information > is a bit above the tree. FTR, e.g. in systemd setups, this can be partially overcome by exposed EffectiveMemoryMax= (the service manager who configures the resources also can do the ancestry traversal). kubernetes has downward API where generic resource info is shared into containers and I recall that lxcfs could mangle procfs memory info wrt memory limits for legacy apps. As I think about it, the cgroupns (in)visibility should be resolved by assigning the proper limit to namespace's root group memory.max (read only for contained user) and the traversal... On Thu, Feb 06, 2025 at 11:37:31AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote: > but having a single file to read instead of walking up the > tree with multiple reads to calculate an effective limit would be > nice. ...in kernel is nice but possible performance gain isn't worth hiding the shareability of the effective limit. So I wonder what is the current PoV of more MM people... Michal
On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote: > Hello. > > On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > Oh I totally forgot about your series. In my use-case, it is not about > > dynamically knowning how much they can expand and adjust themselves but > > rather knowing statically upfront what resources they have been given. > > From the memcg PoV, the effective value doesn't tell how much they were > given (because of sharing). > > > More concretely, these are workloads which used to completely occupy a > > single machine, though within containers but without limits. These > > workloads used to look at machine level metrics at startup on how much > > resources are available. > > I've been there but haven't found convincing mapping of global to memcg > limits. > > The issue is that such a value won't guarantee no OOM when below because > it can be (generally) effectively shared. > > (Alas, apps typically don't express their memory needs in units of > PSI. So it boils down to a system wide monitor like systemd-oomd and > cooperation with it.) > I think you missed the static partitioning of resources use-case I mentioned. The issue you are pointing exist for the system level metrics as well i.e. a worklod looking at system metrics can't say how much they are given but in my specific case, the workloads know they occupy the full machine. Now we want to move such workloads to multi-tenant environment but the resources are still statically partitioned and not overcommitted, so effective limit will tell how much they are given. > > Now these workloads are being moved to multi-tenant environment but > > still the machine is partitioned statically between the workloads. So, > > these workloads need to know upfront how much resources are allocated to > > them upfront and the way the cgroup hierarchy is setup, that information > > is a bit above the tree. > > FTR, e.g. in systemd setups, this can be partially overcome by exposed > EffectiveMemoryMax= (the service manager who configures the resources > also can do the ancestry traversal). > kubernetes has downward API where generic resource info is shared into > containers and I recall that lxcfs could mangle procfs > memory info wrt memory limits for legacy apps. > > > As I think about it, the cgroupns (in)visibility should be resolved by > assigning the proper limit to namespace's root group memory.max (read > only for contained user) and the traversal... > I think here your point is why not have userspace based solution. I think it is possible but not convenient and adds an external dependency in the workload. > > On Thu, Feb 06, 2025 at 11:37:31AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote: > > but having a single file to read instead of walking up the > > tree with multiple reads to calculate an effective limit would be > > nice. > > ...in kernel is nice but possible performance gain isn't worth hiding > the shareability of the effective limit. > > > So I wonder what is the current PoV of more MM people... Yup, let's see more opinion on this. Thanks Michal for your feedback.
On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote: > Hello. > > On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > Oh I totally forgot about your series. In my use-case, it is not about > > dynamically knowning how much they can expand and adjust themselves but > > rather knowing statically upfront what resources they have been given. > > From the memcg PoV, the effective value doesn't tell how much they were > given (because of sharing). It's definitely true that if you have an ancestral limit for several otherwise unlimited siblings, then interpreting this number as "this is how much memory I have available" will be completely misleading. I would also say that sharing a limit with several siblings requires a certain degree of awareness and cooperation between them. From that POV, IMO it would be fine to provide a metric with contextual caveats. The problem is, what do we do with canned, unaware, maybe untrusted applications? And they don't necessarily know which they are. It depends heavily on the judgement of the administrator of any given deployment. Some workloads might be completely untrusted and hard limited. Another deployment might consider the same workload reasonably predictable that it's configured only with a failsafe max limit that is much higher than where the workload is *expected* to operate. The allotment might happen altogether with min/low protections and no max limit. Or there could be a combination of protection slightly below and a limit slightly above the expected workload size. It seems basically impossible to write portable code against this without knowing the intent of the person setting it up. But how do we communicate intent down to the container? The two broad options are implicitly or explicitly: a) Provide a cgroup file that automatically derives intended target size from how min/low/high/max are set up. Right now those can be set up super loosely depending on what the administrator thinks about the application. In order for this to work, we'd likely have to define an idiomatic way of configuring the controller. E.g. if you set max by itself, we assume this is the target size. If you set low, with or without max, then low is the target size. Or if you set both, target is in between. I'm not completely convinced this is workable. It might require settings beyond what's actually needed for the safe containment of the workload, which carries the risk of excluding something useful. I don't mean enforced configuration rules, but rather the case where a configuration is reasonable and effective given the workload and environment, but now the target file shows nonsense. b) Provide a cgroup file that is freely configurable by the administrator with the target size of the container. This has obvious drawbacks as well. What's the default value? Also, a lot of setups are dead simple: set a hard limit and expect the workload to adhere to that, period. Nobody is going to reliably set another cgroup file that a workload may or may not consume. The third option is to wash our hands of all of this, provide the static hierarchy settings to the leaves (like this patch, plus do it for the other knobs as well) and let userspace figure it out. Thoughts?
On Mon, Feb 10, 2025 at 05:52:34PM -0500, Johannes Weiner wrote: > On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote: > > Hello. > > > > On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > Oh I totally forgot about your series. In my use-case, it is not about > > > dynamically knowning how much they can expand and adjust themselves but > > > rather knowing statically upfront what resources they have been given. > > > > From the memcg PoV, the effective value doesn't tell how much they were > > given (because of sharing). > > It's definitely true that if you have an ancestral limit for several > otherwise unlimited siblings, then interpreting this number as "this > is how much memory I have available" will be completely misleading. > > I would also say that sharing a limit with several siblings requires a > certain degree of awareness and cooperation between them. From that > POV, IMO it would be fine to provide a metric with contextual caveats. > > The problem is, what do we do with canned, unaware, maybe untrusted > applications? And they don't necessarily know which they are. > > It depends heavily on the judgement of the administrator of any given > deployment. Some workloads might be completely untrusted and hard > limited. Another deployment might consider the same workload > reasonably predictable that it's configured only with a failsafe max > limit that is much higher than where the workload is *expected* to > operate. The allotment might happen altogether with min/low > protections and no max limit. Or there could be a combination of > protection slightly below and a limit slightly above the expected > workload size. > > It seems basically impossible to write portable code against this > without knowing the intent of the person setting it up. > > But how do we communicate intent down to the container? The two broad > options are implicitly or explicitly: > > a) Provide a cgroup file that automatically derives intended target > size from how min/low/high/max are set up. > > Right now those can be set up super loosely depending on what the > administrator thinks about the application. In order for this to > work, we'd likely have to define an idiomatic way of configuring > the controller. E.g. if you set max by itself, we assume this is > the target size. If you set low, with or without max, then low is > the target size. Or if you set both, target is in between. > > I'm not completely convinced this is workable. This sounds like memory.available. It's hard to implement well, especially taking into account things like numa, memory sharing, estimating how much can be reclaimed, etc. But at the same time there is a value in providing such metric. There is a clear use case. And it's even harder to implement this in userspace. > b) Provide a cgroup file that is freely configurable by the > administrator with the target size of the container. > > This has obvious drawbacks as well. What's the default value? Also, > a lot of setups are dead simple: set a hard limit and expect the > workload to adhere to that, period. Nobody is going to reliably set > another cgroup file that a workload may or may not consume. Yeah, this is a weird option. > > The third option is to wash our hands of all of this, provide the > static hierarchy settings to the leaves (like this patch, plus do it > for the other knobs as well) and let userspace figure it out. Idk, I see a very little value in it. I'm not necessarily opposing this patchset, just not seeing a lot of value. Maybe I'm missing something, but somehow it wasn't a problem for many years. Nothing really changed here. So maybe someone can come up with a better explanation of a specific problem we're trying to solve here? Thanks!
On Tue, Feb 11, 2025 at 04:55:33AM +0000, Roman Gushchin wrote: [...] > > Maybe I'm missing something, but somehow it wasn't a problem for many years. > Nothing really changed here. > > So maybe someone can come up with a better explanation of a specific problem > we're trying to solve here? The most simple explanation is visibility. Workloads that used to run solo are being moved to a multi-tenant but non-overcommited environment and they need to know their capacity which they used to get from system metrics. Now they have to get from cgroup limit files but usage of cgroup namespace limits those workloads to extract the needed information.
Hello. On Tue, Feb 11, 2025 at 05:08:03PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote: > > So maybe someone can come up with a better explanation of a specific problem > > we're trying to solve here? In my experience, another factor is the switch from v1 to v2 (which propagates slower to downstreams) and applications that rely on memory.stat:hierarchical_memory_limit. (Funnily, enough the commit fee7b548e6f2b ("memcg: show real limit under hierarchy mode") introduces it primarily for debugging purposes (not sizing). An application being killed with no apparent (immediate) limit breach.) Roman, you may also remember that it had already popped ~year ago [1]. > The most simple explanation is visibility. Workloads that used to run > solo are being moved to a multi-tenant but non-overcommited environment > and they need to know their capacity which they used to get from system > metrics. > Now they have to get from cgroup limit files but usage of > cgroup namespace limits those workloads to extract the needed > information. I remember Shakeel said the limit may be set higher in the hierarchy for container + siblings but then it's potentially overcommitted, no? I.e. namespace visibility alone is not the problem. The cgns root's memory.max is the shared medium between host and guest through which the memory allowance can be passed -- that actually sounds to me like Johannes' option b). (Which leads me to an idea of memory.max.effective that'd only present the value iff there's no sibling between tightest ancestor..self. If one looks at nr_tasks, it's partial but correct memory available. Not that useful due to the partiality.) Since I was originally fan of the idea, I'm not a strong opponent of plain memory.max.effective, especially when Johannes considers the option of kernel stepping back here and it may help some users. But I'd like to see the original incarnations [2] somehow linked (and maybe start only with memory.max as that has some usecases). Thanks, Michal [1] https://lore.kernel.org/all/ZcY7NmjkJMhGz8fP@host1.jankratochvil.net/ [2] https://lore.kernel.org/all/20240606152232.20253-1-mkoutny@suse.com/
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..175e9435ad5c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1311,6 +1311,14 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + memory.max.effective + A read-only single value file which exists on non-root cgroups. + + The effective limit of the cgroup i.e. the minimum memory.max + of all ancestors including itself. This is useful for environments + where cgroup namespace is being used and the application does not + have full view of the hierarchy. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. @@ -1726,6 +1734,14 @@ The following nested keys are defined. Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out. + memory.swap.max.effective + A read-only single value file which exists on non-root cgroups. + + The effective limit of the cgroup i.e. the minimum memory.swap.max + of all ancestors including itself. This is useful for environments + where cgroup namespace is being used and the application does not + have full view of the hierarchy. + memory.swap.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1766,6 +1782,14 @@ The following nested keys are defined. limit, it will refuse to take any more stores before existing entries fault back in or are written out to disk. + memory.zswap.max.effective + A read-only single value file which exists on non-root cgroups. + + The effective limit of the cgroup i.e. the minimum memory.zswap.max + of all ancestors including itself. This is useful for environments + where cgroup namespace is being used and the application does not + have full view of the hierarchy. + memory.zswap.writeback A read-write single value file. The default value is "1". Note that this setting is hierarchical, i.e. the writeback would be diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cae1c2e0cc71..8d21c1a44220 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4161,6 +4161,17 @@ static int memory_max_show(struct seq_file *m, void *v) READ_ONCE(mem_cgroup_from_seq(m)->memory.max)); } +static int memory_max_effective_show(struct seq_file *m, void *v) +{ + unsigned long max = PAGE_COUNTER_MAX; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for (; memcg; memcg = parent_mem_cgroup(memcg)) + max = min(max, READ_ONCE(memcg->memory.max)); + + return seq_puts_memcg_tunable(m, max); +} + static ssize_t memory_max_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -4438,6 +4449,11 @@ static struct cftype memory_files[] = { .seq_show = memory_max_show, .write = memory_max_write, }, + { + .name = "max.effective", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_max_effective_show, + }, { .name = "events", .flags = CFTYPE_NOT_ON_ROOT, @@ -5117,6 +5133,17 @@ static int swap_max_show(struct seq_file *m, void *v) READ_ONCE(mem_cgroup_from_seq(m)->swap.max)); } +static int swap_max_effective_show(struct seq_file *m, void *v) +{ + unsigned long max = PAGE_COUNTER_MAX; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for (; memcg; memcg = parent_mem_cgroup(memcg)) + max = min(max, READ_ONCE(memcg->swap.max)); + + return seq_puts_memcg_tunable(m, max); +} + static ssize_t swap_max_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -5166,6 +5193,11 @@ static struct cftype swap_files[] = { .seq_show = swap_max_show, .write = swap_max_write, }, + { + .name = "swap.max.effective", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = swap_max_effective_show, + }, { .name = "swap.peak", .flags = CFTYPE_NOT_ON_ROOT, @@ -5308,6 +5340,17 @@ static int zswap_max_show(struct seq_file *m, void *v) READ_ONCE(mem_cgroup_from_seq(m)->zswap_max)); } +static int zswap_max_effective_show(struct seq_file *m, void *v) +{ + unsigned long max = PAGE_COUNTER_MAX; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for (; memcg; memcg = parent_mem_cgroup(memcg)) + max = min(max, READ_ONCE(memcg->zswap_max)); + + return seq_puts_memcg_tunable(m, max); +} + static ssize_t zswap_max_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -5362,6 +5405,11 @@ static struct cftype zswap_files[] = { .seq_show = zswap_max_show, .write = zswap_max_write, }, + { + .name = "zswap.max.effective", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = zswap_max_effective_show, + }, { .name = "zswap.writeback", .seq_show = zswap_writeback_show,
Memcg-v1 exposes hierarchical_[memory|memsw]_limit counters in its memory.stat file which applications can use to get their effective limit which is the minimum of limits of itself and all of its ancestors. This is pretty useful in environments where cgroup namespace is used and the application does not have access to the full view of the cgroup hierarchy. Let's expose effective limits for memcg v2 as well. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- Documentation/admin-guide/cgroup-v2.rst | 24 +++++++++++++ mm/memcontrol.c | 48 +++++++++++++++++++++++++ 2 files changed, 72 insertions(+)