memcg: introduce non-blocking limit setting interfaces

Message ID	20250418195956.64824-1-shakeel.butt@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Shakeel Butt <shakeel.butt@linux.dev> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, Yosry Ahmed <yosry.ahmed@linux.dev>, Tejun Heo <tj@kernel.org>, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team <kernel-team@meta.com> Subject: [PATCH] memcg: introduce non-blocking limit setting interfaces Date: Fri, 18 Apr 2025 12:59:56 -0700 Message-ID: <20250418195956.64824-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: introduce non-blocking limit setting interfaces \| expand memcg: introduce non-blocking limit setting interfaces

Shakeel Butt April 18, 2025, 7:59 p.m. UTC

Setting the max and high limits can trigger synchronous reclaim and/or
oom-kill if the usage is higher than the given limit. This behavior is
fine for newly created cgroups but it can cause issues for the node
controller while setting limits for existing cgroups.

In our production multi-tenant and overcommitted environment, we are
seeing priority inversion when the node controller dynamically adjusts
the limits of running jobs of different priorities. Based on the system
situation, the node controller may reduce the limits of lower priority
jobs and increase the limits of higher priority jobs. However we are
seeing node controller getting stuck for long period of time while
reclaiming from lower priority jobs while setting their limits and also
spends a lot of its own CPU.

One of the workaround we are trying is to fork a new process which sets
the limit of the lower priority job along with setting an alarm to get
itself killed if it get stuck in the reclaim for lower priority job.
However we are finding it very unreliable and costly. Either we need a
good enough time buffer for the alarm to be delivered after setting
limit and potentialy spend a lot of CPU in the reclaim or be unreliable
in setting the limit for much shorter but cheaper (less reclaim) alarms.

Let's introduce new limit setting interfaces which does not trigger
reclaim and/or oom-kill and let the processes in the target cgroup to
trigger reclaim and/or throttling and/or oom-kill in their next charge
request. This will make the node controller on multi-tenant
overcommitted environment much more reliable.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++
 mm/memcontrol.c                         | 46 +++++++++++++++++++++++++
 2 files changed, 62 insertions(+)

Greg Thelen April 18, 2025, 8:18 p.m. UTC | #1

On Fri, Apr 18, 2025 at 1:00 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Setting the max and high limits can trigger synchronous reclaim and/or
> oom-kill if the usage is higher than the given limit. This behavior is
> fine for newly created cgroups but it can cause issues for the node
> controller while setting limits for existing cgroups.
>
> In our production multi-tenant and overcommitted environment, we are
> seeing priority inversion when the node controller dynamically adjusts
> the limits of running jobs of different priorities. Based on the system
> situation, the node controller may reduce the limits of lower priority
> jobs and increase the limits of higher priority jobs. However we are
> seeing node controller getting stuck for long period of time while
> reclaiming from lower priority jobs while setting their limits and also
> spends a lot of its own CPU.
>
> One of the workaround we are trying is to fork a new process which sets
> the limit of the lower priority job along with setting an alarm to get
> itself killed if it get stuck in the reclaim for lower priority job.
> However we are finding it very unreliable and costly. Either we need a
> good enough time buffer for the alarm to be delivered after setting
> limit and potentialy spend a lot of CPU in the reclaim or be unreliable
> in setting the limit for much shorter but cheaper (less reclaim) alarms.
>
> Let's introduce new limit setting interfaces which does not trigger
> reclaim and/or oom-kill and let the processes in the target cgroup to
> trigger reclaim and/or throttling and/or oom-kill in their next charge
> request. This will make the node controller on multi-tenant
> overcommitted environment much more reliable.

Would opening the typical synchronous files (e.g. memory.max) with
O_NONBLOCK be a more general way to tell the kernel that the user
space controller doesn't want to wait? It's not quite consistent with
traditional use of O_NONBLOCK, which would make operations to
fully succeed or fail, rather than altering the operation being requested.
But O_NONBLOCK would allow for a semantics of non-blocking
reclaim, if that's fast enough for your controller.

> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++
>  mm/memcontrol.c                         | 46 +++++++++++++++++++++++++
>  2 files changed, 62 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 8fb14ffab7d1..7b459c821afa 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1299,6 +1299,14 @@ PAGE_SIZE multiple when read back.
>         monitors the limited cgroup to alleviate heavy reclaim
>         pressure.
>
> +  memory.high.nonblock
> +        This is the same limit as memory.high but have different
> +        behaviour for the writer of this interface. The program setting
> +        the limit will not trigger reclaim synchronously if the
> +        usage is higher than the limit and let the processes in the
> +        target cgroup to trigger reclaim and/or get throttled on
> +        hitting the high limit.
> +
>    memory.max
>         A read-write single value file which exists on non-root
>         cgroups.  The default is "max".
> @@ -1316,6 +1324,14 @@ PAGE_SIZE multiple when read back.
>         Caller could retry them differently, return into userspace
>         as -ENOMEM or silently ignore in cases like disk readahead.
>
> +  memory.max.nonblock
> +        This is the same limit as memory.max but have different
> +        behaviour for the writer of this interface. The program setting
> +        the limit will not trigger reclaim synchronously and/or trigger
> +        the oom-kill if the usage is higher than the limit and let the
> +        processes in the target cgroup to trigger reclaim and/or get
> +        oom-killed on hitting their max limit.
> +
>    memory.reclaim
>         A write-only nested-keyed file which exists for all cgroups.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e2ea8b8a898..6ad1464b621a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4279,6 +4279,23 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static ssize_t memory_high_nonblock_write(struct kernfs_open_file *of,
> +                                         char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       unsigned long high;
> +       int err;
> +
> +       buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "max", &high);
> +       if (err)
> +               return err;
> +
> +       page_counter_set_high(&memcg->memory, high);
> +       memcg_wb_domain_size_changed(memcg);
> +       return nbytes;
> +}
> +
>  static int memory_max_show(struct seq_file *m, void *v)
>  {
>         return seq_puts_memcg_tunable(m,
> @@ -4333,6 +4350,23 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static ssize_t memory_max_nonblock_write(struct kernfs_open_file *of,
> +                                        char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       unsigned long max;
> +       int err;
> +
> +       buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "max", &max);
> +       if (err)
> +               return err;
> +
> +       xchg(&memcg->memory.max, max);
> +       memcg_wb_domain_size_changed(memcg);
> +       return nbytes;
> +}
> +
>  /*
>   * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
>   * if any new events become available.
> @@ -4557,12 +4591,24 @@ static struct cftype memory_files[] = {
>                 .seq_show = memory_high_show,
>                 .write = memory_high_write,
>         },
> +       {
> +               .name = "high.nonblock",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = memory_high_show,
> +               .write = memory_high_nonblock_write,
> +       },
>         {
>                 .name = "max",
>                 .flags = CFTYPE_NOT_ON_ROOT,
>                 .seq_show = memory_max_show,
>                 .write = memory_max_write,
>         },
> +       {
> +               .name = "max.nonblock",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = memory_max_show,
> +               .write = memory_max_nonblock_write,
> +       },
>         {
>                 .name = "events",
>                 .flags = CFTYPE_NOT_ON_ROOT,
> --
> 2.47.1
>
>

Shakeel Butt April 18, 2025, 8:30 p.m. UTC | #2

On Fri, Apr 18, 2025 at 01:18:53PM -0700, Greg Thelen wrote:
> On Fri, Apr 18, 2025 at 1:00 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > Setting the max and high limits can trigger synchronous reclaim and/or
> > oom-kill if the usage is higher than the given limit. This behavior is
> > fine for newly created cgroups but it can cause issues for the node
> > controller while setting limits for existing cgroups.
> >
> > In our production multi-tenant and overcommitted environment, we are
> > seeing priority inversion when the node controller dynamically adjusts
> > the limits of running jobs of different priorities. Based on the system
> > situation, the node controller may reduce the limits of lower priority
> > jobs and increase the limits of higher priority jobs. However we are
> > seeing node controller getting stuck for long period of time while
> > reclaiming from lower priority jobs while setting their limits and also
> > spends a lot of its own CPU.
> >
> > One of the workaround we are trying is to fork a new process which sets
> > the limit of the lower priority job along with setting an alarm to get
> > itself killed if it get stuck in the reclaim for lower priority job.
> > However we are finding it very unreliable and costly. Either we need a
> > good enough time buffer for the alarm to be delivered after setting
> > limit and potentialy spend a lot of CPU in the reclaim or be unreliable
> > in setting the limit for much shorter but cheaper (less reclaim) alarms.
> >
> > Let's introduce new limit setting interfaces which does not trigger
> > reclaim and/or oom-kill and let the processes in the target cgroup to
> > trigger reclaim and/or throttling and/or oom-kill in their next charge
> > request. This will make the node controller on multi-tenant
> > overcommitted environment much more reliable.
> 
> Would opening the typical synchronous files (e.g. memory.max) with
> O_NONBLOCK be a more general way to tell the kernel that the user
> space controller doesn't want to wait? It's not quite consistent with
> traditional use of O_NONBLOCK, which would make operations to
> fully succeed or fail, rather than altering the operation being requested.
> But O_NONBLOCK would allow for a semantics of non-blocking
> reclaim, if that's fast enough for your controller.
> 

We actually thought about O_NONBLOCK but the challenge with that is how
would the node controller knows if the underlying kernel has O_NONBLOCK
implying no-reclaim/no-oom-kill feature. I don't think opening
memory.max with O_NONBLOCK will fail today, so the node controller would
still need to implement the complicated fork+set-limit+alarm logic
until the whole fleet has moved away from older kernel. Also I have
checked with systemd folks and they are not happy to implement that
complicated fork+set-limit+alarm logic.

Roman Gushchin April 18, 2025, 10:07 p.m. UTC | #3

On Fri, Apr 18, 2025 at 01:30:03PM -0700, Shakeel Butt wrote:
> On Fri, Apr 18, 2025 at 01:18:53PM -0700, Greg Thelen wrote:
> > On Fri, Apr 18, 2025 at 1:00 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > Setting the max and high limits can trigger synchronous reclaim and/or
> > > oom-kill if the usage is higher than the given limit. This behavior is
> > > fine for newly created cgroups but it can cause issues for the node
> > > controller while setting limits for existing cgroups.
> > >
> > > In our production multi-tenant and overcommitted environment, we are
> > > seeing priority inversion when the node controller dynamically adjusts
> > > the limits of running jobs of different priorities. Based on the system
> > > situation, the node controller may reduce the limits of lower priority
> > > jobs and increase the limits of higher priority jobs. However we are
> > > seeing node controller getting stuck for long period of time while
> > > reclaiming from lower priority jobs while setting their limits and also
> > > spends a lot of its own CPU.
> > >
> > > One of the workaround we are trying is to fork a new process which sets
> > > the limit of the lower priority job along with setting an alarm to get
> > > itself killed if it get stuck in the reclaim for lower priority job.
> > > However we are finding it very unreliable and costly. Either we need a
> > > good enough time buffer for the alarm to be delivered after setting
> > > limit and potentialy spend a lot of CPU in the reclaim or be unreliable
> > > in setting the limit for much shorter but cheaper (less reclaim) alarms.
> > >
> > > Let's introduce new limit setting interfaces which does not trigger
> > > reclaim and/or oom-kill and let the processes in the target cgroup to
> > > trigger reclaim and/or throttling and/or oom-kill in their next charge
> > > request. This will make the node controller on multi-tenant
> > > overcommitted environment much more reliable.
> > 
> > Would opening the typical synchronous files (e.g. memory.max) with
> > O_NONBLOCK be a more general way to tell the kernel that the user
> > space controller doesn't want to wait? It's not quite consistent with
> > traditional use of O_NONBLOCK, which would make operations to
> > fully succeed or fail, rather than altering the operation being requested.
> > But O_NONBLOCK would allow for a semantics of non-blocking
> > reclaim, if that's fast enough for your controller.

+1

> > 
> 
> We actually thought about O_NONBLOCK but the challenge with that is how
> would the node controller knows if the underlying kernel has O_NONBLOCK
> implying no-reclaim/no-oom-kill feature. I don't think opening
> memory.max with O_NONBLOCK will fail today, so the node controller would
> still need to implement the complicated fork+set-limit+alarm logic
> until the whole fleet has moved away from older kernel. Also I have
> checked with systemd folks and they are not happy to implement that
> complicated fork+set-limit+alarm logic.

/sys/kernel/cgroup/features ?

Tejun Heo April 19, 2025, 3:15 a.m. UTC | #4

On Fri, Apr 18, 2025 at 04:08:42PM -0700, Shakeel Butt wrote:
> Any reasons to prefer one over the other? To me having separate
> files/interfaces seem more clean and are more script friendly. Also
> let's see what others have to say or prefer.

I kinda like O_NONBLOCK. The subtlety level of the interface seems to match
that of the implemented behavior.

Thanks.

Shakeel Butt April 19, 2025, 4:36 p.m. UTC | #5

On Fri, Apr 18, 2025 at 05:15:38PM -1000, Tejun Heo wrote:
> On Fri, Apr 18, 2025 at 04:08:42PM -0700, Shakeel Butt wrote:
> > Any reasons to prefer one over the other? To me having separate
> > files/interfaces seem more clean and are more script friendly. Also
> > let's see what others have to say or prefer.
> 
> I kinda like O_NONBLOCK. The subtlety level of the interface seems to match
> that of the implemented behavior.
> 

Ok, it seems like more people prefer O_NONBLOCK, so be it. I will send
v2 soon.

Also I would request to backport to stable kernels. Let me know if
anyone have concerns.

I asked AI how to do the nonblock write in a script and got following:

$ echo 10G | dd of=memory.max oflag=nonblock

Shakeel

memcg: introduce non-blocking limit setting interfaces

Commit Message

Comments

Patch