[RFC,0/8] memcg: Enable fine-grained per process memory control

Message ID	20200817140831.30260-1-longman@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=JTjn=B3=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Waiman Long <longman@redhat.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Alexey Dobriyan <adobriyan@gmail.com>, Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org> Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Waiman Long <longman@redhat.com> Subject: [RFC PATCH 0/8] memcg: Enable fine-grained per process memory control Date: Mon, 17 Aug 2020 10:08:23 -0400 Message-Id: <20200817140831.30260-1-longman@redhat.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	memcg: Enable fine-grained per process memory control \| expand [RFC,0/8] memcg: Enable fine-grained per process memory control [RFC,1/8] memcg: Enable fine-grained control of over memory.high action [RFC,2/8] memcg, mm: Return ENOMEM or delay if memcg_over_limit [RFC,3/8] memcg: Allow the use of task RSS memory as over-high action trigger [RFC,4/8] fs/proc: Support a new procfs memctl file [RFC,5/8] memcg: Allow direct per-task memory limit checking [RFC,6/8] memcg: Introduce additional memory control slowdown if needed [RFC,7/8] memcg: Enable logging of memory control mitigation action [RFC,8/8] memcg: Add over-high action prctl() documentation

Waiman Long Aug. 17, 2020, 2:08 p.m. UTC

Memory controller can be used to control and limit the amount of
physical memory used by a task. When a limit is set in "memory.high" in
a v2 non-root memory cgroup, the memory controller will try to reclaim
memory if the limit has been exceeded. Normally, that will be enough
to keep the physical memory consumption of tasks in the memory cgroup
to be around or below the "memory.high" limit.

Sometimes, memory reclaim may not be able to recover memory in a rate
that can catch up to the physical memory allocation rate. In this case,
the physical memory consumption will keep on increasing.  When it reaches
"memory.max" for memory cgroup v2 or when the system is running out of
free memory, the OOM killer will be invoked to kill some tasks to free
up additional memory. However, one has little control of which tasks
are going to be killed by an OOM killer. Killing tasks that hold some
important resources without freeing them first can create other system
problems down the road.

Users who do not want the OOM killer to be invoked to kill random
tasks in an out-of-memory situation can use the memory control
facility provided by this new patchset via prctl(2) to better manage
the mitigation action that needs to be performed to various tasks when
the specified memory limit is exceeded with memory cgroup v2 being used.

The currently supported mitigation actions include the followings:

 1) Return ENOMEM for some syscalls that allocate or handle memory
 2) Slow down the process for memory reclaim to catch up
 3) Send a specific signal to the task
 4) Kill the task

The users that want better memory control for their applicatons can
either modify their applications to call the prctl(2) syscall directly
with the new memory control command code or write the desired action to
the newly provided memctl procfs files of their applications provided
that those applications run in a non-root v2 memory cgroup.

Waiman Long (8):
  memcg: Enable fine-grained control of over memory.high action
  memcg, mm: Return ENOMEM or delay if memcg_over_limit
  memcg: Allow the use of task RSS memory as over-high action trigger
  fs/proc: Support a new procfs memctl file
  memcg: Allow direct per-task memory limit checking
  memcg: Introduce additional memory control slowdown if needed
  memcg: Enable logging of memory control mitigation action
  memcg: Add over-high action prctl() documentation

 Documentation/userspace-api/index.rst      |   1 +
 Documentation/userspace-api/memcontrol.rst | 174 ++++++++++++++++
 fs/proc/base.c                             | 109 ++++++++++
 include/linux/memcontrol.h                 |   4 +
 include/linux/sched.h                      |  24 +++
 include/uapi/linux/prctl.h                 |  48 +++++
 kernel/fork.c                              |   1 +
 kernel/sys.c                               |  16 ++
 mm/memcontrol.c                            | 227 +++++++++++++++++++++
 mm/mlock.c                                 |   6 +
 mm/mmap.c                                  |  12 ++
 mm/mprotect.c                              |   3 +
 12 files changed, 625 insertions(+)
 create mode 100644 Documentation/userspace-api/memcontrol.rst

Michal Hocko Aug. 17, 2020, 3:26 p.m. UTC | #1

On Mon 17-08-20 10:08:23, Waiman Long wrote:
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high" in
> a v2 non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
> 
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate. In this case,
> the physical memory consumption will keep on increasing.  When it reaches
> "memory.max" for memory cgroup v2 or when the system is running out of
> free memory, the OOM killer will be invoked to kill some tasks to free
> up additional memory. However, one has little control of which tasks
> are going to be killed by an OOM killer. Killing tasks that hold some
> important resources without freeing them first can create other system
> problems down the road.
> 
> Users who do not want the OOM killer to be invoked to kill random
> tasks in an out-of-memory situation can use the memory control
> facility provided by this new patchset via prctl(2) to better manage
> the mitigation action that needs to be performed to various tasks when
> the specified memory limit is exceeded with memory cgroup v2 being used.
> 
> The currently supported mitigation actions include the followings:
> 
>  1) Return ENOMEM for some syscalls that allocate or handle memory
>  2) Slow down the process for memory reclaim to catch up
>  3) Send a specific signal to the task
>  4) Kill the task
> 
> The users that want better memory control for their applicatons can
> either modify their applications to call the prctl(2) syscall directly
> with the new memory control command code or write the desired action to
> the newly provided memctl procfs files of their applications provided
> that those applications run in a non-root v2 memory cgroup.

prctl is fundamentally about per-process control while cgroup (not only
memcg) is about group of processes interface. How do those two interact
together? In other words what is the semantic when different processes
have a different views on the same underlying memcg event?

Also the above description doesn't really describe any usecase which
struggles with the existing interface. We already do allow slow down and
along with PSI also provide user space control over close to OOM
situation.

Waiman Long Aug. 17, 2020, 3:55 p.m. UTC | #2

On 8/17/20 11:26 AM, Michal Hocko wrote:
> On Mon 17-08-20 10:08:23, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.  When it reaches
>> "memory.max" for memory cgroup v2 or when the system is running out of
>> free memory, the OOM killer will be invoked to kill some tasks to free
>> up additional memory. However, one has little control of which tasks
>> are going to be killed by an OOM killer. Killing tasks that hold some
>> important resources without freeing them first can create other system
>> problems down the road.
>>
>> Users who do not want the OOM killer to be invoked to kill random
>> tasks in an out-of-memory situation can use the memory control
>> facility provided by this new patchset via prctl(2) to better manage
>> the mitigation action that needs to be performed to various tasks when
>> the specified memory limit is exceeded with memory cgroup v2 being used.
>>
>> The currently supported mitigation actions include the followings:
>>
>>   1) Return ENOMEM for some syscalls that allocate or handle memory
>>   2) Slow down the process for memory reclaim to catch up
>>   3) Send a specific signal to the task
>>   4) Kill the task
>>
>> The users that want better memory control for their applicatons can
>> either modify their applications to call the prctl(2) syscall directly
>> with the new memory control command code or write the desired action to
>> the newly provided memctl procfs files of their applications provided
>> that those applications run in a non-root v2 memory cgroup.
> prctl is fundamentally about per-process control while cgroup (not only
> memcg) is about group of processes interface. How do those two interact
> together? In other words what is the semantic when different processes
> have a different views on the same underlying memcg event?
As said in a previous mail, this patchset is derived from a customer 
request and per-process control is exactly what the customer wants. That 
is why prctl() is used. This patchset is intended to supplement the 
existing memory cgroup features. Processes in a memory cgroup that don't 
use this new API will behave exactly like before. Only processes that 
opt to use this new API will have additional mitigation actions applied 
on them in case the additional limits are reached.
>
> Also the above description doesn't really describe any usecase which
> struggles with the existing interface. We already do allow slow down and
> along with PSI also provide user space control over close to OOM
> situation.
>
The customer that request it was using Solaris. Solaris does allow 
per-process memory control and they have tools that rely on this 
capability. This patchset will help them migrate off Solaris easier. I 
will look closer into how PSI can help here.

Thanks,
Longman

Michal Hocko Aug. 17, 2020, 7:26 p.m. UTC | #3

On Mon 17-08-20 11:55:37, Waiman Long wrote:
> On 8/17/20 11:26 AM, Michal Hocko wrote:
> > On Mon 17-08-20 10:08:23, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing.  When it reaches
> > > "memory.max" for memory cgroup v2 or when the system is running out of
> > > free memory, the OOM killer will be invoked to kill some tasks to free
> > > up additional memory. However, one has little control of which tasks
> > > are going to be killed by an OOM killer. Killing tasks that hold some
> > > important resources without freeing them first can create other system
> > > problems down the road.
> > > 
> > > Users who do not want the OOM killer to be invoked to kill random
> > > tasks in an out-of-memory situation can use the memory control
> > > facility provided by this new patchset via prctl(2) to better manage
> > > the mitigation action that needs to be performed to various tasks when
> > > the specified memory limit is exceeded with memory cgroup v2 being used.
> > > 
> > > The currently supported mitigation actions include the followings:
> > > 
> > >   1) Return ENOMEM for some syscalls that allocate or handle memory
> > >   2) Slow down the process for memory reclaim to catch up
> > >   3) Send a specific signal to the task
> > >   4) Kill the task
> > > 
> > > The users that want better memory control for their applicatons can
> > > either modify their applications to call the prctl(2) syscall directly
> > > with the new memory control command code or write the desired action to
> > > the newly provided memctl procfs files of their applications provided
> > > that those applications run in a non-root v2 memory cgroup.
> > prctl is fundamentally about per-process control while cgroup (not only
> > memcg) is about group of processes interface. How do those two interact
> > together? In other words what is the semantic when different processes
> > have a different views on the same underlying memcg event?
> As said in a previous mail, this patchset is derived from a customer request
> and per-process control is exactly what the customer wants. That is why
> prctl() is used. This patchset is intended to supplement the existing memory
> cgroup features. Processes in a memory cgroup that don't use this new API
> will behave exactly like before. Only processes that opt to use this new API
> will have additional mitigation actions applied on them in case the
> additional limits are reached.

Please keep in mind that you are proposing a new user API that we will
have to maintain for ever. That requires that the interface is
consistent and well defined. As I've said the fundamental problem with
this interface is that you are trying to hammer a process centric
interface into a framework that is fundamentally process group oriented.
Maybe there is a sensible way to do that without all sorts of weird
corner cases but I haven't seen any of that explained here.

Really just try to describe a semantic when two different tasks in the
same memcg have a different opinion on the same event. One wants ENOMEM
and other a specific signal to be delivered. Right now the behavior will
be timing specific because who hits the oom path is non-deterministic
from the userspace POV. Let's say that you can somehow handle that, now
how are you going implement ENOMEM for any context other than current
task? I am pretty sure the more specific questions you will have the
more this will get awkward.

Peter Zijlstra Aug. 18, 2020, 9:14 a.m. UTC | #4

On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> Memory controller can be used to control and limit the amount of
> physical memory used by a task. When a limit is set in "memory.high" in
> a v2 non-root memory cgroup, the memory controller will try to reclaim
> memory if the limit has been exceeded. Normally, that will be enough
> to keep the physical memory consumption of tasks in the memory cgroup
> to be around or below the "memory.high" limit.
> 
> Sometimes, memory reclaim may not be able to recover memory in a rate
> that can catch up to the physical memory allocation rate. In this case,
> the physical memory consumption will keep on increasing. 

Then slow down the allocator? That's what we do for dirty pages too, we
slow down the dirtier when we run against the limits.

Michal Hocko Aug. 18, 2020, 9:26 a.m. UTC | #5

On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > Memory controller can be used to control and limit the amount of
> > physical memory used by a task. When a limit is set in "memory.high" in
> > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > memory if the limit has been exceeded. Normally, that will be enough
> > to keep the physical memory consumption of tasks in the memory cgroup
> > to be around or below the "memory.high" limit.
> > 
> > Sometimes, memory reclaim may not be able to recover memory in a rate
> > that can catch up to the physical memory allocation rate. In this case,
> > the physical memory consumption will keep on increasing. 
> 
> Then slow down the allocator? That's what we do for dirty pages too, we
> slow down the dirtier when we run against the limits.

This is what we actually do. Have a look at mem_cgroup_handle_over_high.

Chris Down Aug. 18, 2020, 9:27 a.m. UTC | #6

peterz@infradead.org writes:
>On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.
>
>Then slow down the allocator? That's what we do for dirty pages too, we
>slow down the dirtier when we run against the limits.

We already do that since v5.4. I'm wondering whether Waiman's customer is just 
running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle 
allocators when failing reclaim over memory.high") backported.

Peter Zijlstra Aug. 18, 2020, 9:59 a.m. UTC | #7

On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote:
> On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing. 
> > 
> > Then slow down the allocator? That's what we do for dirty pages too, we
> > slow down the dirtier when we run against the limits.
> 
> This is what we actually do. Have a look at mem_cgroup_handle_over_high.

But then how can it run-away like Waiman suggested?

/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.

That's a fail... :-(

Peter Zijlstra Aug. 18, 2020, 10:04 a.m. UTC | #8

On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> peterz@infradead.org writes:
> > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > Memory controller can be used to control and limit the amount of
> > > physical memory used by a task. When a limit is set in "memory.high" in
> > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > memory if the limit has been exceeded. Normally, that will be enough
> > > to keep the physical memory consumption of tasks in the memory cgroup
> > > to be around or below the "memory.high" limit.
> > > 
> > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > that can catch up to the physical memory allocation rate. In this case,
> > > the physical memory consumption will keep on increasing.
> > 
> > Then slow down the allocator? That's what we do for dirty pages too, we
> > slow down the dirtier when we run against the limits.
> 
> We already do that since v5.4. I'm wondering whether Waiman's customer is
> just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> allocators when failing reclaim over memory.high") backported.

That commit is fundamentally broken, it doesn't guarantee anything.

Please go read how the dirty throttling works (unless people wrecked
that since..).

Michal Hocko Aug. 18, 2020, 10:05 a.m. UTC | #9

On Tue 18-08-20 11:59:10, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote:
> > On Tue 18-08-20 11:14:53, Peter Zijlstra wrote:
> > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > Memory controller can be used to control and limit the amount of
> > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > to be around or below the "memory.high" limit.
> > > > 
> > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > that can catch up to the physical memory allocation rate. In this case,
> > > > the physical memory consumption will keep on increasing. 
> > > 
> > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > slow down the dirtier when we run against the limits.
> > 
> > This is what we actually do. Have a look at mem_cgroup_handle_over_high.
> 
> But then how can it run-away like Waiman suggested?

As Chris mentioned in other reply. This functionality is quite new.
 
> /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.

We can certainly tune a different backoff delays but I suspect this is
not the problem here.
 
> That's a fail... :-(

Chris Down Aug. 18, 2020, 10:17 a.m. UTC | #10

peterz@infradead.org writes:
>But then how can it run-away like Waiman suggested?

Probably because he's not running with that commit at all. We and others use 
this to prevent runaway allocation on a huge range of production and desktop 
use cases and it works just fine.

>/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
>
>That's a fail... :-(

I'd ask that you understand a bit more about the tradeoffs and intentions of 
the patch before rushing in to declare its failure, considering it works just 
fine :-)

Clamping the maximal time allows the application to take some action to 
remediate the situation, while still being slowed down significantly. 2 seconds 
per allocation batch is still absolutely plenty for any use case I've come 
across. If you have evidence it isn't, then present that instead of vague 
notions of "wrongness".

Peter Zijlstra Aug. 18, 2020, 10:18 a.m. UTC | #11

On Tue, Aug 18, 2020 at 12:05:16PM +0200, Michal Hocko wrote:
> > But then how can it run-away like Waiman suggested?
> 
> As Chris mentioned in other reply. This functionality is quite new.
>  
> > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
> 
> We can certainly tune a different backoff delays but I suspect this is
> not the problem here.

Tuning? That thing needs throwing out, it's fundamentally buggered. Why
didn't anybody look at how the I/O drtying thing works first?

What you need is a feeback loop against the rate of freeing pages, and
when you near the saturation point, the allocation rate should exactly
match the freeing rate.

But this thing has nothing what so ever like that.

Peter Zijlstra Aug. 18, 2020, 10:26 a.m. UTC | #12

On Tue, Aug 18, 2020 at 11:17:56AM +0100, Chris Down wrote:

> I'd ask that you understand a bit more about the tradeoffs and intentions of
> the patch before rushing in to declare its failure, considering it works
> just fine :-)
> 
> Clamping the maximal time allows the application to take some action to
> remediate the situation, while still being slowed down significantly. 2
> seconds per allocation batch is still absolutely plenty for any use case
> I've come across. If you have evidence it isn't, then present that instead
> of vague notions of "wrongness".

There is no feedback from the freeing rate, therefore it cannot be
correct in maintaining a maximum amount of pages.

0.5 pages / sec is still non-zero, and if the free rate is 0, you'll
crawl across whatever limit was set without any bounds. This is math
101.

It's true that I haven't been paying attention to mm in a while, but I
was one of the original authors of the I/O dirty balancing, I do think I
understand how these things work.

Michal Hocko Aug. 18, 2020, 10:30 a.m. UTC | #13

On Tue 18-08-20 12:18:44, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 12:05:16PM +0200, Michal Hocko wrote:
> > > But then how can it run-away like Waiman suggested?
> > 
> > As Chris mentioned in other reply. This functionality is quite new.
> >  
> > > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
> > 
> > We can certainly tune a different backoff delays but I suspect this is
> > not the problem here.
> 
> Tuning? That thing needs throwing out, it's fundamentally buggered. Why
> didn't anybody look at how the I/O drtying thing works first?
> 
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.
> 
> But this thing has nothing what so ever like that.

Existing usecases seem to be doing fine with the existing
implementation. If we find out that this is insufficient then we can
work on that but I believe this is tangent to this email thread. There
are no indications that the current implementation doesn't throttle
enough. The proposal also aims at much richer interface to define the
oom behavior.

Chris Down Aug. 18, 2020, 10:35 a.m. UTC | #14

peterz@infradead.org writes:
>On Tue, Aug 18, 2020 at 11:17:56AM +0100, Chris Down wrote:
>
>> I'd ask that you understand a bit more about the tradeoffs and intentions of
>> the patch before rushing in to declare its failure, considering it works
>> just fine :-)
>>
>> Clamping the maximal time allows the application to take some action to
>> remediate the situation, while still being slowed down significantly. 2
>> seconds per allocation batch is still absolutely plenty for any use case
>> I've come across. If you have evidence it isn't, then present that instead
>> of vague notions of "wrongness".
>
>There is no feedback from the freeing rate, therefore it cannot be
>correct in maintaining a maximum amount of pages.

memory.high is not about maintaining a maximum amount of pages. It's strictly 
best-effort, and the ramifications of a breach are typically fundamentally 
different than for dirty throttling.

>0.5 pages / sec is still non-zero, and if the free rate is 0, you'll
>crawl across whatever limit was set without any bounds. This is math
>101.
>
>It's true that I haven't been paying attention to mm in a while, but I
>was one of the original authors of the I/O dirty balancing, I do think I
>understand how these things work.

You're suggesting we replace a well understood, easy to reason about model with 
something non-trivially more complex, all on the back of you suggesting that 
the current approach is "wrong" without any evidence or quantification.

Peter, we're not going to throw out perfectly function memcg code simply 
because of your say so, especially when you've not asked for information or 
context about the tradeoffs involved, or presented any evidence that something 
perverse is actually happening.

Prescribing a specific solution modelled on some other code path here without 
producing evidence or measurements specific to the nuances of this particular 
endpoint is not a recipe for success.

Peter Zijlstra Aug. 18, 2020, 10:36 a.m. UTC | #15

On Tue, Aug 18, 2020 at 12:30:59PM +0200, Michal Hocko wrote:
> The proposal also aims at much richer interface to define the
> oom behavior.

Oh yeah, I'm not defending any of that prctl() nonsense.

Just saying that from a math / control theory point of view, the current
thing is a abhorrent failure.

Matthew Wilcox Aug. 18, 2020, 12:55 p.m. UTC | #16

On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz@infradead.org wrote:
> On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> > peterz@infradead.org writes:
> > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > Memory controller can be used to control and limit the amount of
> > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > to be around or below the "memory.high" limit.
> > > > 
> > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > that can catch up to the physical memory allocation rate. In this case,
> > > > the physical memory consumption will keep on increasing.
> > > 
> > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > slow down the dirtier when we run against the limits.
> > 
> > We already do that since v5.4. I'm wondering whether Waiman's customer is
> > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> > allocators when failing reclaim over memory.high") backported.
> 
> That commit is fundamentally broken, it doesn't guarantee anything.
> 
> Please go read how the dirty throttling works (unless people wrecked
> that since..).

Of course they did.

https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk/

Johannes Weiner Aug. 18, 2020, 1:49 p.m. UTC | #17

On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> What you need is a feeback loop against the rate of freeing pages, and
> when you near the saturation point, the allocation rate should exactly
> match the freeing rate.

IO throttling solves a slightly different problem.

IO occurs in parallel to the workload's execution stream, and you're
trying to take the workload from dirtying at CPU speed to rate match
to the independent IO stream.

With memory allocations, though, freeing happens from inside the
execution stream of the workload. If you throttle allocations, you're
most likely throttling the freeing rate as well. And you'll slow down
reclaim scanning by the same amount as the page references, so it's
not making reclaim more successful either. The alloc/use/free
(im)balance is an inherent property of the workload, regardless of the
speed you're executing it at.

So the goal here is different. We're not trying to pace the workload
into some form of sustainability. Rather, it's for OOM handling. When
we detect the workload's alloc/use/free pattern is unsustainable given
available memory, we slow it down just enough to allow userspace to
implement OOM policy and job priorities (on containerized hosts these
tend to be too complex to express in the kernel's oom scoring system).

The exponential curve makes it look like we're trying to do some type
of feedback system, but it's really only to let minor infractions pass
and throttle unsustainable expansion ruthlessly. Drop-behind reclaim
can be a bit bumpy because we batch on the allocation side as well as
on the reclaim side, hence the fuzz factor there.

Waiman Long Aug. 18, 2020, 7:20 p.m. UTC | #18

On 8/17/20 3:26 PM, Michal Hocko wrote:
> On Mon 17-08-20 11:55:37, Waiman Long wrote:
>> On 8/17/20 11:26 AM, Michal Hocko wrote:
>>> On Mon 17-08-20 10:08:23, Waiman Long wrote:
>>>> Memory controller can be used to control and limit the amount of
>>>> physical memory used by a task. When a limit is set in "memory.high" in
>>>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>>>> memory if the limit has been exceeded. Normally, that will be enough
>>>> to keep the physical memory consumption of tasks in the memory cgroup
>>>> to be around or below the "memory.high" limit.
>>>>
>>>> Sometimes, memory reclaim may not be able to recover memory in a rate
>>>> that can catch up to the physical memory allocation rate. In this case,
>>>> the physical memory consumption will keep on increasing.  When it reaches
>>>> "memory.max" for memory cgroup v2 or when the system is running out of
>>>> free memory, the OOM killer will be invoked to kill some tasks to free
>>>> up additional memory. However, one has little control of which tasks
>>>> are going to be killed by an OOM killer. Killing tasks that hold some
>>>> important resources without freeing them first can create other system
>>>> problems down the road.
>>>>
>>>> Users who do not want the OOM killer to be invoked to kill random
>>>> tasks in an out-of-memory situation can use the memory control
>>>> facility provided by this new patchset via prctl(2) to better manage
>>>> the mitigation action that needs to be performed to various tasks when
>>>> the specified memory limit is exceeded with memory cgroup v2 being used.
>>>>
>>>> The currently supported mitigation actions include the followings:
>>>>
>>>>    1) Return ENOMEM for some syscalls that allocate or handle memory
>>>>    2) Slow down the process for memory reclaim to catch up
>>>>    3) Send a specific signal to the task
>>>>    4) Kill the task
>>>>
>>>> The users that want better memory control for their applicatons can
>>>> either modify their applications to call the prctl(2) syscall directly
>>>> with the new memory control command code or write the desired action to
>>>> the newly provided memctl procfs files of their applications provided
>>>> that those applications run in a non-root v2 memory cgroup.
>>> prctl is fundamentally about per-process control while cgroup (not only
>>> memcg) is about group of processes interface. How do those two interact
>>> together? In other words what is the semantic when different processes
>>> have a different views on the same underlying memcg event?
>> As said in a previous mail, this patchset is derived from a customer request
>> and per-process control is exactly what the customer wants. That is why
>> prctl() is used. This patchset is intended to supplement the existing memory
>> cgroup features. Processes in a memory cgroup that don't use this new API
>> will behave exactly like before. Only processes that opt to use this new API
>> will have additional mitigation actions applied on them in case the
>> additional limits are reached.
> Please keep in mind that you are proposing a new user API that we will
> have to maintain for ever. That requires that the interface is
> consistent and well defined. As I've said the fundamental problem with
> this interface is that you are trying to hammer a process centric
> interface into a framework that is fundamentally process group oriented.
> Maybe there is a sensible way to do that without all sorts of weird
> corner cases but I haven't seen any of that explained here.
>
> Really just try to describe a semantic when two different tasks in the
> same memcg have a different opinion on the same event. One wants ENOMEM
> and other a specific signal to be delivered. Right now the behavior will
> be timing specific because who hits the oom path is non-deterministic
> from the userspace POV. Let's say that you can somehow handle that, now
> how are you going implement ENOMEM for any context other than current
> task? I am pretty sure the more specific questions you will have the
> more this will get awkward.

The basic idea of triggering a user-specified memory-over-high 
mitigation is when the actual memory usage exceed a threshold which is 
supposed to be between "high" and "max". The additional limit that is 
passed in is for setting this additional threshold. We want to avoid OOM 
at all cost.

The ENOMEM error may not be suitable for all applications as some of 
them may not be able to handle ENOMEM gracefully. That is for 
applications that are designed to handle that.

Cheers,
Longman

Waiman Long Aug. 18, 2020, 7:27 p.m. UTC | #19

On 8/18/20 5:14 AM, peterz@infradead.org wrote:
> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>> Memory controller can be used to control and limit the amount of
>> physical memory used by a task. When a limit is set in "memory.high" in
>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>> memory if the limit has been exceeded. Normally, that will be enough
>> to keep the physical memory consumption of tasks in the memory cgroup
>> to be around or below the "memory.high" limit.
>>
>> Sometimes, memory reclaim may not be able to recover memory in a rate
>> that can catch up to the physical memory allocation rate. In this case,
>> the physical memory consumption will keep on increasing.
> Then slow down the allocator? That's what we do for dirty pages too, we
> slow down the dirtier when we run against the limits.
>
I missed that there are already allocator throttling done in upstream 
code. So I will need to reexamine if this patch is necessary or not.

Thanks,
Longman

Waiman Long Aug. 18, 2020, 7:30 p.m. UTC | #20

On 8/18/20 5:27 AM, Chris Down wrote:
> peterz@infradead.org writes:
>> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
>>> Memory controller can be used to control and limit the amount of
>>> physical memory used by a task. When a limit is set in "memory.high" in
>>> a v2 non-root memory cgroup, the memory controller will try to reclaim
>>> memory if the limit has been exceeded. Normally, that will be enough
>>> to keep the physical memory consumption of tasks in the memory cgroup
>>> to be around or below the "memory.high" limit.
>>>
>>> Sometimes, memory reclaim may not be able to recover memory in a rate
>>> that can catch up to the physical memory allocation rate. In this case,
>>> the physical memory consumption will keep on increasing.
>>
>> Then slow down the allocator? That's what we do for dirty pages too, we
>> slow down the dirtier when we run against the limits.
>
> We already do that since v5.4. I'm wondering whether Waiman's customer 
> is just running with a too-old kernel without 0e4b01df865 ("mm, memcg: 
> throttle allocators when failing reclaim over memory.high") backported.
>
The fact is that we don't have that in RHEL8 yet and cgroup v2 is still 
not the default at the moment.

I am planning to backport the throttling patches to RHEL and hopefully 
can switch to use cgroup v2 soon.

Cheers,
Longman

Dave Chinner Aug. 20, 2020, 6:11 a.m. UTC | #21

On Tue, Aug 18, 2020 at 01:55:59PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz@infradead.org wrote:
> > On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote:
> > > peterz@infradead.org writes:
> > > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote:
> > > > > Memory controller can be used to control and limit the amount of
> > > > > physical memory used by a task. When a limit is set in "memory.high" in
> > > > > a v2 non-root memory cgroup, the memory controller will try to reclaim
> > > > > memory if the limit has been exceeded. Normally, that will be enough
> > > > > to keep the physical memory consumption of tasks in the memory cgroup
> > > > > to be around or below the "memory.high" limit.
> > > > > 
> > > > > Sometimes, memory reclaim may not be able to recover memory in a rate
> > > > > that can catch up to the physical memory allocation rate. In this case,
> > > > > the physical memory consumption will keep on increasing.
> > > > 
> > > > Then slow down the allocator? That's what we do for dirty pages too, we
> > > > slow down the dirtier when we run against the limits.
> > > 
> > > We already do that since v5.4. I'm wondering whether Waiman's customer is
> > > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle
> > > allocators when failing reclaim over memory.high") backported.
> > 
> > That commit is fundamentally broken, it doesn't guarantee anything.
> > 
> > Please go read how the dirty throttling works (unless people wrecked
> > that since..).
> 
> Of course they did.
> 
> https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk/

Different thing. That's memory reclaim throttling, not dirty page
throttling.  balance_dirty_pages() still works just fine as it does
not look at device congestion. page cleaning rate is accounted in
test_clear_page_writeback(), page dirtying rate is accounted
directly in balance_dirty_pages(). That feedback loop has not been
broken...

And I compeltely agree with Peter here - the control theory we
applied to the dirty throttling problem is still 100% valid and so
the algorithm still just works all these years later. I've only been
saying that allocation should use the same feedback model for
reclaim throttling since ~2011...

Cheers,

Dave.

Peter Zijlstra Aug. 21, 2020, 7:37 p.m. UTC | #22

On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote:
> On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> > What you need is a feeback loop against the rate of freeing pages, and
> > when you near the saturation point, the allocation rate should exactly
> > match the freeing rate.
> 
> IO throttling solves a slightly different problem.
> 
> IO occurs in parallel to the workload's execution stream, and you're
> trying to take the workload from dirtying at CPU speed to rate match
> to the independent IO stream.
> 
> With memory allocations, though, freeing happens from inside the
> execution stream of the workload. If you throttle allocations, you're

For a single task, but even then you're making the argument that we need
to allocate memory to free memory, and we all know where that gets us.

But we're actually talking about a cgroup here, which is a collection of
tasks all doing things in parallel.

> most likely throttling the freeing rate as well. And you'll slow down
> reclaim scanning by the same amount as the page references, so it's
> not making reclaim more successful either. The alloc/use/free
> (im)balance is an inherent property of the workload, regardless of the
> speed you're executing it at.

Arguably seeing the rate drop to near 0 is a very good point to consider
running cgroup-OOM.

Waiman Long Aug. 23, 2020, 2:49 a.m. UTC | #23

On 8/18/20 6:17 AM, Chris Down wrote:
> peterz@infradead.org writes:
>> But then how can it run-away like Waiman suggested?
>
> Probably because he's not running with that commit at all. We and 
> others use this to prevent runaway allocation on a huge range of 
> production and desktop use cases and it works just fine.
>
>> /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES.
>>
>> That's a fail... :-(
>
> I'd ask that you understand a bit more about the tradeoffs and 
> intentions of the patch before rushing in to declare its failure, 
> considering it works just fine :-)
>
> Clamping the maximal time allows the application to take some action 
> to remediate the situation, while still being slowed down 
> significantly. 2 seconds per allocation batch is still absolutely 
> plenty for any use case I've come across. If you have evidence it 
> isn't, then present that instead of vague notions of "wrongness".
>
Sorry for the late reply.

I ran some test on the latest kernel and and it seems to work as 
expected. I was running the test on an older kernel that doesn't have 
this patch and I was not aware of it before hand.

Sorry for the confusion.

Cheers,
Longman

Johannes Weiner Aug. 24, 2020, 4:58 p.m. UTC | #24

On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote:
> > On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote:
> > > What you need is a feeback loop against the rate of freeing pages, and
> > > when you near the saturation point, the allocation rate should exactly
> > > match the freeing rate.
> > 
> > IO throttling solves a slightly different problem.
> > 
> > IO occurs in parallel to the workload's execution stream, and you're
> > trying to take the workload from dirtying at CPU speed to rate match
> > to the independent IO stream.
> > 
> > With memory allocations, though, freeing happens from inside the
> > execution stream of the workload. If you throttle allocations, you're
> 
> For a single task, but even then you're making the argument that we need
> to allocate memory to free memory, and we all know where that gets us.
>
> But we're actually talking about a cgroup here, which is a collection of
> tasks all doing things in parallel.

Right, but sharing a memory cgroup means sharing an LRU list, and that
transfers memory pressure and allocation burden between otherwise
independent tasks - if nothing else through cache misses on the
executables and libraries. I doubt that one task can go through
several comprehensive reclaim cycles on a shared LRU without
completely annihilating the latency or throughput targets of everybody
else in the group in most real world applications.

> > most likely throttling the freeing rate as well. And you'll slow down
> > reclaim scanning by the same amount as the page references, so it's
> > not making reclaim more successful either. The alloc/use/free
> > (im)balance is an inherent property of the workload, regardless of the
> > speed you're executing it at.
> 
> Arguably seeing the rate drop to near 0 is a very good point to consider
> running cgroup-OOM.

Agreed. In the past, that's actually what we did: In cgroup1, you
could disable the kernel OOM killer, and when reclaim failed at the
limit, the allocating task would be put on a waitqueue until woken up
by a freeing event. Conceptually this is clean & straight-forward.

However,

1. Putting allocation contexts with unknown locks to indefinite sleep
   caused deadlocks, for obvious reasons. Userspace OOM killing tends
   to take a lot of task-specific locks when scanning through /proc
   files for kill candidates, and can easily get stuck.

   Using bounded over indefinite waits is simply acknowledging that
   the deadlock potential when connecting arbitrary task stacks in the
   system through free->alloc ordering is equally difficult to plan
   out as alloc->free ordering.

   The non-cgroup OOM killer actually has the same deadlock potential,
   where the allocating/killing task can hold resources that the OOM
   victim requires to exit. The OOM reaper hides it, the static
   emergency reserves hide it - but to truly solve this problem, you
   would have to have full knowledge of memory & lock ordering
   dependencies of those tasks. And then can still end up with
   scenarios where the only answer is panic().

2. I don't recall ever seeing situations in cgroup1 where the precise
   matching of allocation rate to freeing rate has allowed cgroups to
   run sustainably after reclaim has failed. The practical benefit of
   a complicated feedback loop over something crude & robust once
   we're in an OOM situation is not apparent to me.

   [ That's different from the IO-throttling *while still doing
     reclaim* that Dave brought up. *That* justifies the same effort
     we put into dirty throttling. I'm only talking about the
     situation where reclaim has already failed and we need to
     facilitate userspace OOM handling. ]

So that was the motivation for the bounded sleeps. They do not
guarantee containment, but they provide a reasonable amount of time
for the userspace OOM handler to intervene, without deadlocking.

That all being said, the semantics of the new 'high' limit in cgroup2
have allowed us to move reclaim/limit enforcement out of the
allocation context and into the userspace return path.

See the call to mem_cgroup_handle_over_high() from
tracehook_notify_resume(), and the comments in try_charge() around
set_notify_resume().

This already solves the free->alloc ordering problem by allowing the
allocation to exceed the limit temporarily until at least all locks
are dropped, we know we can sleep etc., before performing enforcement.

That means we may not need the timed sleeps anymore for that purpose,
and could bring back directed waits for freeing-events again.

What do you think? Any hazards around indefinite sleeps in that resume
path? It's called before __rseq_handle_notify_resume and the
arch-specific resume callback (which appears to be a no-op currently).

Chris, Michal, what are your thoughts? It would certainly be simpler
conceptually on the memcg side.

Chris Down Sept. 7, 2020, 11:47 a.m. UTC | #25

Johannes Weiner writes:
>That all being said, the semantics of the new 'high' limit in cgroup2
>have allowed us to move reclaim/limit enforcement out of the
>allocation context and into the userspace return path.
>
>See the call to mem_cgroup_handle_over_high() from
>tracehook_notify_resume(), and the comments in try_charge() around
>set_notify_resume().
>
>This already solves the free->alloc ordering problem by allowing the
>allocation to exceed the limit temporarily until at least all locks
>are dropped, we know we can sleep etc., before performing enforcement.
>
>That means we may not need the timed sleeps anymore for that purpose,
>and could bring back directed waits for freeing-events again.
>
>What do you think? Any hazards around indefinite sleeps in that resume
>path? It's called before __rseq_handle_notify_resume and the
>arch-specific resume callback (which appears to be a no-op currently).
>
>Chris, Michal, what are your thoughts? It would certainly be simpler
>conceptually on the memcg side.

I'm not against that, although I personally don't feel very strongly about it 
either way, since the current behaviour clearly works in practice.

Michal Hocko Sept. 9, 2020, 11:53 a.m. UTC | #26

[Sorry, this slipped through cracks]

On Mon 24-08-20 12:58:50, Johannes Weiner wrote:
> On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote:
[...]
> > Arguably seeing the rate drop to near 0 is a very good point to consider
> > running cgroup-OOM.
> 
> Agreed. In the past, that's actually what we did: In cgroup1, you
> could disable the kernel OOM killer, and when reclaim failed at the
> limit, the allocating task would be put on a waitqueue until woken up
> by a freeing event. Conceptually this is clean & straight-forward.
> 
> However,
> 
> 1. Putting allocation contexts with unknown locks to indefinite sleep
>    caused deadlocks, for obvious reasons. Userspace OOM killing tends
>    to take a lot of task-specific locks when scanning through /proc
>    files for kill candidates, and can easily get stuck.
> 
>    Using bounded over indefinite waits is simply acknowledging that
>    the deadlock potential when connecting arbitrary task stacks in the
>    system through free->alloc ordering is equally difficult to plan
>    out as alloc->free ordering.
> 
>    The non-cgroup OOM killer actually has the same deadlock potential,
>    where the allocating/killing task can hold resources that the OOM
>    victim requires to exit. The OOM reaper hides it, the static
>    emergency reserves hide it - but to truly solve this problem, you
>    would have to have full knowledge of memory & lock ordering
>    dependencies of those tasks. And then can still end up with
>    scenarios where the only answer is panic().

Yes. Even killing all eligible tasks is not guaranteed to help the
situation because a) resources might be not bound to a process life time
(e.g. tmpfs) or ineligible task might be holding resources that block
others to do the proper cleanup. OOM reaper is here to make sure we
reclaim some of the address space of the victim and we go over all
eligible tasks rather than getting stuck at the first victim for ever.
 
> 2. I don't recall ever seeing situations in cgroup1 where the precise
>    matching of allocation rate to freeing rate has allowed cgroups to
>    run sustainably after reclaim has failed. The practical benefit of
>    a complicated feedback loop over something crude & robust once
>    we're in an OOM situation is not apparent to me.

Yes, this is usually go OOM and kill something. Running on a very edge
of the (memcg) oom doesn't tend to be sustainable and I am not sure it
makes sense to optimize for.

>    [ That's different from the IO-throttling *while still doing
>      reclaim* that Dave brought up. *That* justifies the same effort
>      we put into dirty throttling. I'm only talking about the
>      situation where reclaim has already failed and we need to
>      facilitate userspace OOM handling. ]
> 
> So that was the motivation for the bounded sleeps. They do not
> guarantee containment, but they provide a reasonable amount of time
> for the userspace OOM handler to intervene, without deadlocking.

Yes, memory.high is mostly a best effort containment. We do have the
hard limit to put a stop on runaways or if you are watching for PSI then
the high limit throttling would give you enough idea to take an action
from the userspace.

> That all being said, the semantics of the new 'high' limit in cgroup2
> have allowed us to move reclaim/limit enforcement out of the
> allocation context and into the userspace return path.
> 
> See the call to mem_cgroup_handle_over_high() from
> tracehook_notify_resume(), and the comments in try_charge() around
> set_notify_resume().
> 
> This already solves the free->alloc ordering problem by allowing the
> allocation to exceed the limit temporarily until at least all locks
> are dropped, we know we can sleep etc., before performing enforcement.
> 
> That means we may not need the timed sleeps anymore for that purpose,
> and could bring back directed waits for freeing-events again.
> 
> What do you think? Any hazards around indefinite sleeps in that resume
> path? It's called before __rseq_handle_notify_resume and the
> arch-specific resume callback (which appears to be a no-op currently).
> 
> Chris, Michal, what are your thoughts? It would certainly be simpler
> conceptually on the memcg side.

I would need a more specific description. But as I've already said. It
doesn't seem that we are in a need to fix any practical problem here.
High limit implementation has changed quite a lot recently. I would
rather see it settled for a while and see how it behaves in wider
variety of workloads before changing the implementation again.

[RFC,0/8] memcg: Enable fine-grained per process memory control

Message

Comments