Message ID | 20200817140831.30260-1-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | memcg: Enable fine-grained per process memory control | expand |
On Mon 17-08-20 10:08:23, Waiman Long wrote: > Memory controller can be used to control and limit the amount of > physical memory used by a task. When a limit is set in "memory.high" in > a v2 non-root memory cgroup, the memory controller will try to reclaim > memory if the limit has been exceeded. Normally, that will be enough > to keep the physical memory consumption of tasks in the memory cgroup > to be around or below the "memory.high" limit. > > Sometimes, memory reclaim may not be able to recover memory in a rate > that can catch up to the physical memory allocation rate. In this case, > the physical memory consumption will keep on increasing. When it reaches > "memory.max" for memory cgroup v2 or when the system is running out of > free memory, the OOM killer will be invoked to kill some tasks to free > up additional memory. However, one has little control of which tasks > are going to be killed by an OOM killer. Killing tasks that hold some > important resources without freeing them first can create other system > problems down the road. > > Users who do not want the OOM killer to be invoked to kill random > tasks in an out-of-memory situation can use the memory control > facility provided by this new patchset via prctl(2) to better manage > the mitigation action that needs to be performed to various tasks when > the specified memory limit is exceeded with memory cgroup v2 being used. > > The currently supported mitigation actions include the followings: > > 1) Return ENOMEM for some syscalls that allocate or handle memory > 2) Slow down the process for memory reclaim to catch up > 3) Send a specific signal to the task > 4) Kill the task > > The users that want better memory control for their applicatons can > either modify their applications to call the prctl(2) syscall directly > with the new memory control command code or write the desired action to > the newly provided memctl procfs files of their applications provided > that those applications run in a non-root v2 memory cgroup. prctl is fundamentally about per-process control while cgroup (not only memcg) is about group of processes interface. How do those two interact together? In other words what is the semantic when different processes have a different views on the same underlying memcg event? Also the above description doesn't really describe any usecase which struggles with the existing interface. We already do allow slow down and along with PSI also provide user space control over close to OOM situation.
On 8/17/20 11:26 AM, Michal Hocko wrote: > On Mon 17-08-20 10:08:23, Waiman Long wrote: >> Memory controller can be used to control and limit the amount of >> physical memory used by a task. When a limit is set in "memory.high" in >> a v2 non-root memory cgroup, the memory controller will try to reclaim >> memory if the limit has been exceeded. Normally, that will be enough >> to keep the physical memory consumption of tasks in the memory cgroup >> to be around or below the "memory.high" limit. >> >> Sometimes, memory reclaim may not be able to recover memory in a rate >> that can catch up to the physical memory allocation rate. In this case, >> the physical memory consumption will keep on increasing. When it reaches >> "memory.max" for memory cgroup v2 or when the system is running out of >> free memory, the OOM killer will be invoked to kill some tasks to free >> up additional memory. However, one has little control of which tasks >> are going to be killed by an OOM killer. Killing tasks that hold some >> important resources without freeing them first can create other system >> problems down the road. >> >> Users who do not want the OOM killer to be invoked to kill random >> tasks in an out-of-memory situation can use the memory control >> facility provided by this new patchset via prctl(2) to better manage >> the mitigation action that needs to be performed to various tasks when >> the specified memory limit is exceeded with memory cgroup v2 being used. >> >> The currently supported mitigation actions include the followings: >> >> 1) Return ENOMEM for some syscalls that allocate or handle memory >> 2) Slow down the process for memory reclaim to catch up >> 3) Send a specific signal to the task >> 4) Kill the task >> >> The users that want better memory control for their applicatons can >> either modify their applications to call the prctl(2) syscall directly >> with the new memory control command code or write the desired action to >> the newly provided memctl procfs files of their applications provided >> that those applications run in a non-root v2 memory cgroup. > prctl is fundamentally about per-process control while cgroup (not only > memcg) is about group of processes interface. How do those two interact > together? In other words what is the semantic when different processes > have a different views on the same underlying memcg event? As said in a previous mail, this patchset is derived from a customer request and per-process control is exactly what the customer wants. That is why prctl() is used. This patchset is intended to supplement the existing memory cgroup features. Processes in a memory cgroup that don't use this new API will behave exactly like before. Only processes that opt to use this new API will have additional mitigation actions applied on them in case the additional limits are reached. > > Also the above description doesn't really describe any usecase which > struggles with the existing interface. We already do allow slow down and > along with PSI also provide user space control over close to OOM > situation. > The customer that request it was using Solaris. Solaris does allow per-process memory control and they have tools that rely on this capability. This patchset will help them migrate off Solaris easier. I will look closer into how PSI can help here. Thanks, Longman
On Mon 17-08-20 11:55:37, Waiman Long wrote: > On 8/17/20 11:26 AM, Michal Hocko wrote: > > On Mon 17-08-20 10:08:23, Waiman Long wrote: > > > Memory controller can be used to control and limit the amount of > > > physical memory used by a task. When a limit is set in "memory.high" in > > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > > memory if the limit has been exceeded. Normally, that will be enough > > > to keep the physical memory consumption of tasks in the memory cgroup > > > to be around or below the "memory.high" limit. > > > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > > that can catch up to the physical memory allocation rate. In this case, > > > the physical memory consumption will keep on increasing. When it reaches > > > "memory.max" for memory cgroup v2 or when the system is running out of > > > free memory, the OOM killer will be invoked to kill some tasks to free > > > up additional memory. However, one has little control of which tasks > > > are going to be killed by an OOM killer. Killing tasks that hold some > > > important resources without freeing them first can create other system > > > problems down the road. > > > > > > Users who do not want the OOM killer to be invoked to kill random > > > tasks in an out-of-memory situation can use the memory control > > > facility provided by this new patchset via prctl(2) to better manage > > > the mitigation action that needs to be performed to various tasks when > > > the specified memory limit is exceeded with memory cgroup v2 being used. > > > > > > The currently supported mitigation actions include the followings: > > > > > > 1) Return ENOMEM for some syscalls that allocate or handle memory > > > 2) Slow down the process for memory reclaim to catch up > > > 3) Send a specific signal to the task > > > 4) Kill the task > > > > > > The users that want better memory control for their applicatons can > > > either modify their applications to call the prctl(2) syscall directly > > > with the new memory control command code or write the desired action to > > > the newly provided memctl procfs files of their applications provided > > > that those applications run in a non-root v2 memory cgroup. > > prctl is fundamentally about per-process control while cgroup (not only > > memcg) is about group of processes interface. How do those two interact > > together? In other words what is the semantic when different processes > > have a different views on the same underlying memcg event? > As said in a previous mail, this patchset is derived from a customer request > and per-process control is exactly what the customer wants. That is why > prctl() is used. This patchset is intended to supplement the existing memory > cgroup features. Processes in a memory cgroup that don't use this new API > will behave exactly like before. Only processes that opt to use this new API > will have additional mitigation actions applied on them in case the > additional limits are reached. Please keep in mind that you are proposing a new user API that we will have to maintain for ever. That requires that the interface is consistent and well defined. As I've said the fundamental problem with this interface is that you are trying to hammer a process centric interface into a framework that is fundamentally process group oriented. Maybe there is a sensible way to do that without all sorts of weird corner cases but I haven't seen any of that explained here. Really just try to describe a semantic when two different tasks in the same memcg have a different opinion on the same event. One wants ENOMEM and other a specific signal to be delivered. Right now the behavior will be timing specific because who hits the oom path is non-deterministic from the userspace POV. Let's say that you can somehow handle that, now how are you going implement ENOMEM for any context other than current task? I am pretty sure the more specific questions you will have the more this will get awkward.
On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > Memory controller can be used to control and limit the amount of > physical memory used by a task. When a limit is set in "memory.high" in > a v2 non-root memory cgroup, the memory controller will try to reclaim > memory if the limit has been exceeded. Normally, that will be enough > to keep the physical memory consumption of tasks in the memory cgroup > to be around or below the "memory.high" limit. > > Sometimes, memory reclaim may not be able to recover memory in a rate > that can catch up to the physical memory allocation rate. In this case, > the physical memory consumption will keep on increasing. Then slow down the allocator? That's what we do for dirty pages too, we slow down the dirtier when we run against the limits.
On Tue 18-08-20 11:14:53, Peter Zijlstra wrote: > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > > Memory controller can be used to control and limit the amount of > > physical memory used by a task. When a limit is set in "memory.high" in > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > memory if the limit has been exceeded. Normally, that will be enough > > to keep the physical memory consumption of tasks in the memory cgroup > > to be around or below the "memory.high" limit. > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > that can catch up to the physical memory allocation rate. In this case, > > the physical memory consumption will keep on increasing. > > Then slow down the allocator? That's what we do for dirty pages too, we > slow down the dirtier when we run against the limits. This is what we actually do. Have a look at mem_cgroup_handle_over_high.
peterz@infradead.org writes: >On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: >> Memory controller can be used to control and limit the amount of >> physical memory used by a task. When a limit is set in "memory.high" in >> a v2 non-root memory cgroup, the memory controller will try to reclaim >> memory if the limit has been exceeded. Normally, that will be enough >> to keep the physical memory consumption of tasks in the memory cgroup >> to be around or below the "memory.high" limit. >> >> Sometimes, memory reclaim may not be able to recover memory in a rate >> that can catch up to the physical memory allocation rate. In this case, >> the physical memory consumption will keep on increasing. > >Then slow down the allocator? That's what we do for dirty pages too, we >slow down the dirtier when we run against the limits. We already do that since v5.4. I'm wondering whether Waiman's customer is just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle allocators when failing reclaim over memory.high") backported.
On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote: > On Tue 18-08-20 11:14:53, Peter Zijlstra wrote: > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > > > Memory controller can be used to control and limit the amount of > > > physical memory used by a task. When a limit is set in "memory.high" in > > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > > memory if the limit has been exceeded. Normally, that will be enough > > > to keep the physical memory consumption of tasks in the memory cgroup > > > to be around or below the "memory.high" limit. > > > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > > that can catch up to the physical memory allocation rate. In this case, > > > the physical memory consumption will keep on increasing. > > > > Then slow down the allocator? That's what we do for dirty pages too, we > > slow down the dirtier when we run against the limits. > > This is what we actually do. Have a look at mem_cgroup_handle_over_high. But then how can it run-away like Waiman suggested? /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES. That's a fail... :-(
On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote: > peterz@infradead.org writes: > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > > > Memory controller can be used to control and limit the amount of > > > physical memory used by a task. When a limit is set in "memory.high" in > > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > > memory if the limit has been exceeded. Normally, that will be enough > > > to keep the physical memory consumption of tasks in the memory cgroup > > > to be around or below the "memory.high" limit. > > > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > > that can catch up to the physical memory allocation rate. In this case, > > > the physical memory consumption will keep on increasing. > > > > Then slow down the allocator? That's what we do for dirty pages too, we > > slow down the dirtier when we run against the limits. > > We already do that since v5.4. I'm wondering whether Waiman's customer is > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle > allocators when failing reclaim over memory.high") backported. That commit is fundamentally broken, it doesn't guarantee anything. Please go read how the dirty throttling works (unless people wrecked that since..).
On Tue 18-08-20 11:59:10, Peter Zijlstra wrote: > On Tue, Aug 18, 2020 at 11:26:17AM +0200, Michal Hocko wrote: > > On Tue 18-08-20 11:14:53, Peter Zijlstra wrote: > > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > > > > Memory controller can be used to control and limit the amount of > > > > physical memory used by a task. When a limit is set in "memory.high" in > > > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > > > memory if the limit has been exceeded. Normally, that will be enough > > > > to keep the physical memory consumption of tasks in the memory cgroup > > > > to be around or below the "memory.high" limit. > > > > > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > > > that can catch up to the physical memory allocation rate. In this case, > > > > the physical memory consumption will keep on increasing. > > > > > > Then slow down the allocator? That's what we do for dirty pages too, we > > > slow down the dirtier when we run against the limits. > > > > This is what we actually do. Have a look at mem_cgroup_handle_over_high. > > But then how can it run-away like Waiman suggested? As Chris mentioned in other reply. This functionality is quite new. > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES. We can certainly tune a different backoff delays but I suspect this is not the problem here. > That's a fail... :-(
peterz@infradead.org writes: >But then how can it run-away like Waiman suggested? Probably because he's not running with that commit at all. We and others use this to prevent runaway allocation on a huge range of production and desktop use cases and it works just fine. >/me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES. > >That's a fail... :-( I'd ask that you understand a bit more about the tradeoffs and intentions of the patch before rushing in to declare its failure, considering it works just fine :-) Clamping the maximal time allows the application to take some action to remediate the situation, while still being slowed down significantly. 2 seconds per allocation batch is still absolutely plenty for any use case I've come across. If you have evidence it isn't, then present that instead of vague notions of "wrongness".
On Tue, Aug 18, 2020 at 12:05:16PM +0200, Michal Hocko wrote: > > But then how can it run-away like Waiman suggested? > > As Chris mentioned in other reply. This functionality is quite new. > > > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES. > > We can certainly tune a different backoff delays but I suspect this is > not the problem here. Tuning? That thing needs throwing out, it's fundamentally buggered. Why didn't anybody look at how the I/O drtying thing works first? What you need is a feeback loop against the rate of freeing pages, and when you near the saturation point, the allocation rate should exactly match the freeing rate. But this thing has nothing what so ever like that.
On Tue, Aug 18, 2020 at 11:17:56AM +0100, Chris Down wrote: > I'd ask that you understand a bit more about the tradeoffs and intentions of > the patch before rushing in to declare its failure, considering it works > just fine :-) > > Clamping the maximal time allows the application to take some action to > remediate the situation, while still being slowed down significantly. 2 > seconds per allocation batch is still absolutely plenty for any use case > I've come across. If you have evidence it isn't, then present that instead > of vague notions of "wrongness". There is no feedback from the freeing rate, therefore it cannot be correct in maintaining a maximum amount of pages. 0.5 pages / sec is still non-zero, and if the free rate is 0, you'll crawl across whatever limit was set without any bounds. This is math 101. It's true that I haven't been paying attention to mm in a while, but I was one of the original authors of the I/O dirty balancing, I do think I understand how these things work.
On Tue 18-08-20 12:18:44, Peter Zijlstra wrote: > On Tue, Aug 18, 2020 at 12:05:16PM +0200, Michal Hocko wrote: > > > But then how can it run-away like Waiman suggested? > > > > As Chris mentioned in other reply. This functionality is quite new. > > > > > /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES. > > > > We can certainly tune a different backoff delays but I suspect this is > > not the problem here. > > Tuning? That thing needs throwing out, it's fundamentally buggered. Why > didn't anybody look at how the I/O drtying thing works first? > > What you need is a feeback loop against the rate of freeing pages, and > when you near the saturation point, the allocation rate should exactly > match the freeing rate. > > But this thing has nothing what so ever like that. Existing usecases seem to be doing fine with the existing implementation. If we find out that this is insufficient then we can work on that but I believe this is tangent to this email thread. There are no indications that the current implementation doesn't throttle enough. The proposal also aims at much richer interface to define the oom behavior.
peterz@infradead.org writes: >On Tue, Aug 18, 2020 at 11:17:56AM +0100, Chris Down wrote: > >> I'd ask that you understand a bit more about the tradeoffs and intentions of >> the patch before rushing in to declare its failure, considering it works >> just fine :-) >> >> Clamping the maximal time allows the application to take some action to >> remediate the situation, while still being slowed down significantly. 2 >> seconds per allocation batch is still absolutely plenty for any use case >> I've come across. If you have evidence it isn't, then present that instead >> of vague notions of "wrongness". > >There is no feedback from the freeing rate, therefore it cannot be >correct in maintaining a maximum amount of pages. memory.high is not about maintaining a maximum amount of pages. It's strictly best-effort, and the ramifications of a breach are typically fundamentally different than for dirty throttling. >0.5 pages / sec is still non-zero, and if the free rate is 0, you'll >crawl across whatever limit was set without any bounds. This is math >101. > >It's true that I haven't been paying attention to mm in a while, but I >was one of the original authors of the I/O dirty balancing, I do think I >understand how these things work. You're suggesting we replace a well understood, easy to reason about model with something non-trivially more complex, all on the back of you suggesting that the current approach is "wrong" without any evidence or quantification. Peter, we're not going to throw out perfectly function memcg code simply because of your say so, especially when you've not asked for information or context about the tradeoffs involved, or presented any evidence that something perverse is actually happening. Prescribing a specific solution modelled on some other code path here without producing evidence or measurements specific to the nuances of this particular endpoint is not a recipe for success.
On Tue, Aug 18, 2020 at 12:30:59PM +0200, Michal Hocko wrote: > The proposal also aims at much richer interface to define the > oom behavior. Oh yeah, I'm not defending any of that prctl() nonsense. Just saying that from a math / control theory point of view, the current thing is a abhorrent failure.
On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz@infradead.org wrote: > On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote: > > peterz@infradead.org writes: > > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > > > > Memory controller can be used to control and limit the amount of > > > > physical memory used by a task. When a limit is set in "memory.high" in > > > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > > > memory if the limit has been exceeded. Normally, that will be enough > > > > to keep the physical memory consumption of tasks in the memory cgroup > > > > to be around or below the "memory.high" limit. > > > > > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > > > that can catch up to the physical memory allocation rate. In this case, > > > > the physical memory consumption will keep on increasing. > > > > > > Then slow down the allocator? That's what we do for dirty pages too, we > > > slow down the dirtier when we run against the limits. > > > > We already do that since v5.4. I'm wondering whether Waiman's customer is > > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle > > allocators when failing reclaim over memory.high") backported. > > That commit is fundamentally broken, it doesn't guarantee anything. > > Please go read how the dirty throttling works (unless people wrecked > that since..). Of course they did. https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk/
On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote: > What you need is a feeback loop against the rate of freeing pages, and > when you near the saturation point, the allocation rate should exactly > match the freeing rate. IO throttling solves a slightly different problem. IO occurs in parallel to the workload's execution stream, and you're trying to take the workload from dirtying at CPU speed to rate match to the independent IO stream. With memory allocations, though, freeing happens from inside the execution stream of the workload. If you throttle allocations, you're most likely throttling the freeing rate as well. And you'll slow down reclaim scanning by the same amount as the page references, so it's not making reclaim more successful either. The alloc/use/free (im)balance is an inherent property of the workload, regardless of the speed you're executing it at. So the goal here is different. We're not trying to pace the workload into some form of sustainability. Rather, it's for OOM handling. When we detect the workload's alloc/use/free pattern is unsustainable given available memory, we slow it down just enough to allow userspace to implement OOM policy and job priorities (on containerized hosts these tend to be too complex to express in the kernel's oom scoring system). The exponential curve makes it look like we're trying to do some type of feedback system, but it's really only to let minor infractions pass and throttle unsustainable expansion ruthlessly. Drop-behind reclaim can be a bit bumpy because we batch on the allocation side as well as on the reclaim side, hence the fuzz factor there.
On 8/17/20 3:26 PM, Michal Hocko wrote: > On Mon 17-08-20 11:55:37, Waiman Long wrote: >> On 8/17/20 11:26 AM, Michal Hocko wrote: >>> On Mon 17-08-20 10:08:23, Waiman Long wrote: >>>> Memory controller can be used to control and limit the amount of >>>> physical memory used by a task. When a limit is set in "memory.high" in >>>> a v2 non-root memory cgroup, the memory controller will try to reclaim >>>> memory if the limit has been exceeded. Normally, that will be enough >>>> to keep the physical memory consumption of tasks in the memory cgroup >>>> to be around or below the "memory.high" limit. >>>> >>>> Sometimes, memory reclaim may not be able to recover memory in a rate >>>> that can catch up to the physical memory allocation rate. In this case, >>>> the physical memory consumption will keep on increasing. When it reaches >>>> "memory.max" for memory cgroup v2 or when the system is running out of >>>> free memory, the OOM killer will be invoked to kill some tasks to free >>>> up additional memory. However, one has little control of which tasks >>>> are going to be killed by an OOM killer. Killing tasks that hold some >>>> important resources without freeing them first can create other system >>>> problems down the road. >>>> >>>> Users who do not want the OOM killer to be invoked to kill random >>>> tasks in an out-of-memory situation can use the memory control >>>> facility provided by this new patchset via prctl(2) to better manage >>>> the mitigation action that needs to be performed to various tasks when >>>> the specified memory limit is exceeded with memory cgroup v2 being used. >>>> >>>> The currently supported mitigation actions include the followings: >>>> >>>> 1) Return ENOMEM for some syscalls that allocate or handle memory >>>> 2) Slow down the process for memory reclaim to catch up >>>> 3) Send a specific signal to the task >>>> 4) Kill the task >>>> >>>> The users that want better memory control for their applicatons can >>>> either modify their applications to call the prctl(2) syscall directly >>>> with the new memory control command code or write the desired action to >>>> the newly provided memctl procfs files of their applications provided >>>> that those applications run in a non-root v2 memory cgroup. >>> prctl is fundamentally about per-process control while cgroup (not only >>> memcg) is about group of processes interface. How do those two interact >>> together? In other words what is the semantic when different processes >>> have a different views on the same underlying memcg event? >> As said in a previous mail, this patchset is derived from a customer request >> and per-process control is exactly what the customer wants. That is why >> prctl() is used. This patchset is intended to supplement the existing memory >> cgroup features. Processes in a memory cgroup that don't use this new API >> will behave exactly like before. Only processes that opt to use this new API >> will have additional mitigation actions applied on them in case the >> additional limits are reached. > Please keep in mind that you are proposing a new user API that we will > have to maintain for ever. That requires that the interface is > consistent and well defined. As I've said the fundamental problem with > this interface is that you are trying to hammer a process centric > interface into a framework that is fundamentally process group oriented. > Maybe there is a sensible way to do that without all sorts of weird > corner cases but I haven't seen any of that explained here. > > Really just try to describe a semantic when two different tasks in the > same memcg have a different opinion on the same event. One wants ENOMEM > and other a specific signal to be delivered. Right now the behavior will > be timing specific because who hits the oom path is non-deterministic > from the userspace POV. Let's say that you can somehow handle that, now > how are you going implement ENOMEM for any context other than current > task? I am pretty sure the more specific questions you will have the > more this will get awkward. The basic idea of triggering a user-specified memory-over-high mitigation is when the actual memory usage exceed a threshold which is supposed to be between "high" and "max". The additional limit that is passed in is for setting this additional threshold. We want to avoid OOM at all cost. The ENOMEM error may not be suitable for all applications as some of them may not be able to handle ENOMEM gracefully. That is for applications that are designed to handle that. Cheers, Longman
On 8/18/20 5:14 AM, peterz@infradead.org wrote: > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: >> Memory controller can be used to control and limit the amount of >> physical memory used by a task. When a limit is set in "memory.high" in >> a v2 non-root memory cgroup, the memory controller will try to reclaim >> memory if the limit has been exceeded. Normally, that will be enough >> to keep the physical memory consumption of tasks in the memory cgroup >> to be around or below the "memory.high" limit. >> >> Sometimes, memory reclaim may not be able to recover memory in a rate >> that can catch up to the physical memory allocation rate. In this case, >> the physical memory consumption will keep on increasing. > Then slow down the allocator? That's what we do for dirty pages too, we > slow down the dirtier when we run against the limits. > I missed that there are already allocator throttling done in upstream code. So I will need to reexamine if this patch is necessary or not. Thanks, Longman
On 8/18/20 5:27 AM, Chris Down wrote: > peterz@infradead.org writes: >> On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: >>> Memory controller can be used to control and limit the amount of >>> physical memory used by a task. When a limit is set in "memory.high" in >>> a v2 non-root memory cgroup, the memory controller will try to reclaim >>> memory if the limit has been exceeded. Normally, that will be enough >>> to keep the physical memory consumption of tasks in the memory cgroup >>> to be around or below the "memory.high" limit. >>> >>> Sometimes, memory reclaim may not be able to recover memory in a rate >>> that can catch up to the physical memory allocation rate. In this case, >>> the physical memory consumption will keep on increasing. >> >> Then slow down the allocator? That's what we do for dirty pages too, we >> slow down the dirtier when we run against the limits. > > We already do that since v5.4. I'm wondering whether Waiman's customer > is just running with a too-old kernel without 0e4b01df865 ("mm, memcg: > throttle allocators when failing reclaim over memory.high") backported. > The fact is that we don't have that in RHEL8 yet and cgroup v2 is still not the default at the moment. I am planning to backport the throttling patches to RHEL and hopefully can switch to use cgroup v2 soon. Cheers, Longman
On Tue, Aug 18, 2020 at 01:55:59PM +0100, Matthew Wilcox wrote: > On Tue, Aug 18, 2020 at 12:04:44PM +0200, peterz@infradead.org wrote: > > On Tue, Aug 18, 2020 at 10:27:37AM +0100, Chris Down wrote: > > > peterz@infradead.org writes: > > > > On Mon, Aug 17, 2020 at 10:08:23AM -0400, Waiman Long wrote: > > > > > Memory controller can be used to control and limit the amount of > > > > > physical memory used by a task. When a limit is set in "memory.high" in > > > > > a v2 non-root memory cgroup, the memory controller will try to reclaim > > > > > memory if the limit has been exceeded. Normally, that will be enough > > > > > to keep the physical memory consumption of tasks in the memory cgroup > > > > > to be around or below the "memory.high" limit. > > > > > > > > > > Sometimes, memory reclaim may not be able to recover memory in a rate > > > > > that can catch up to the physical memory allocation rate. In this case, > > > > > the physical memory consumption will keep on increasing. > > > > > > > > Then slow down the allocator? That's what we do for dirty pages too, we > > > > slow down the dirtier when we run against the limits. > > > > > > We already do that since v5.4. I'm wondering whether Waiman's customer is > > > just running with a too-old kernel without 0e4b01df865 ("mm, memcg: throttle > > > allocators when failing reclaim over memory.high") backported. > > > > That commit is fundamentally broken, it doesn't guarantee anything. > > > > Please go read how the dirty throttling works (unless people wrecked > > that since..). > > Of course they did. > > https://lore.kernel.org/linux-mm/ce7975cd-6353-3f29-b52c-7a81b1d07caa@kernel.dk/ Different thing. That's memory reclaim throttling, not dirty page throttling. balance_dirty_pages() still works just fine as it does not look at device congestion. page cleaning rate is accounted in test_clear_page_writeback(), page dirtying rate is accounted directly in balance_dirty_pages(). That feedback loop has not been broken... And I compeltely agree with Peter here - the control theory we applied to the dirty throttling problem is still 100% valid and so the algorithm still just works all these years later. I've only been saying that allocation should use the same feedback model for reclaim throttling since ~2011... Cheers, Dave.
On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote: > On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote: > > What you need is a feeback loop against the rate of freeing pages, and > > when you near the saturation point, the allocation rate should exactly > > match the freeing rate. > > IO throttling solves a slightly different problem. > > IO occurs in parallel to the workload's execution stream, and you're > trying to take the workload from dirtying at CPU speed to rate match > to the independent IO stream. > > With memory allocations, though, freeing happens from inside the > execution stream of the workload. If you throttle allocations, you're For a single task, but even then you're making the argument that we need to allocate memory to free memory, and we all know where that gets us. But we're actually talking about a cgroup here, which is a collection of tasks all doing things in parallel. > most likely throttling the freeing rate as well. And you'll slow down > reclaim scanning by the same amount as the page references, so it's > not making reclaim more successful either. The alloc/use/free > (im)balance is an inherent property of the workload, regardless of the > speed you're executing it at. Arguably seeing the rate drop to near 0 is a very good point to consider running cgroup-OOM.
On 8/18/20 6:17 AM, Chris Down wrote: > peterz@infradead.org writes: >> But then how can it run-away like Waiman suggested? > > Probably because he's not running with that commit at all. We and > others use this to prevent runaway allocation on a huge range of > production and desktop use cases and it works just fine. > >> /me goes look... and finds MEMCG_MAX_HIGH_DELAY_JIFFIES. >> >> That's a fail... :-( > > I'd ask that you understand a bit more about the tradeoffs and > intentions of the patch before rushing in to declare its failure, > considering it works just fine :-) > > Clamping the maximal time allows the application to take some action > to remediate the situation, while still being slowed down > significantly. 2 seconds per allocation batch is still absolutely > plenty for any use case I've come across. If you have evidence it > isn't, then present that instead of vague notions of "wrongness". > Sorry for the late reply. I ran some test on the latest kernel and and it seems to work as expected. I was running the test on an older kernel that doesn't have this patch and I was not aware of it before hand. Sorry for the confusion. Cheers, Longman
On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote: > On Tue, Aug 18, 2020 at 09:49:00AM -0400, Johannes Weiner wrote: > > On Tue, Aug 18, 2020 at 12:18:44PM +0200, peterz@infradead.org wrote: > > > What you need is a feeback loop against the rate of freeing pages, and > > > when you near the saturation point, the allocation rate should exactly > > > match the freeing rate. > > > > IO throttling solves a slightly different problem. > > > > IO occurs in parallel to the workload's execution stream, and you're > > trying to take the workload from dirtying at CPU speed to rate match > > to the independent IO stream. > > > > With memory allocations, though, freeing happens from inside the > > execution stream of the workload. If you throttle allocations, you're > > For a single task, but even then you're making the argument that we need > to allocate memory to free memory, and we all know where that gets us. > > But we're actually talking about a cgroup here, which is a collection of > tasks all doing things in parallel. Right, but sharing a memory cgroup means sharing an LRU list, and that transfers memory pressure and allocation burden between otherwise independent tasks - if nothing else through cache misses on the executables and libraries. I doubt that one task can go through several comprehensive reclaim cycles on a shared LRU without completely annihilating the latency or throughput targets of everybody else in the group in most real world applications. > > most likely throttling the freeing rate as well. And you'll slow down > > reclaim scanning by the same amount as the page references, so it's > > not making reclaim more successful either. The alloc/use/free > > (im)balance is an inherent property of the workload, regardless of the > > speed you're executing it at. > > Arguably seeing the rate drop to near 0 is a very good point to consider > running cgroup-OOM. Agreed. In the past, that's actually what we did: In cgroup1, you could disable the kernel OOM killer, and when reclaim failed at the limit, the allocating task would be put on a waitqueue until woken up by a freeing event. Conceptually this is clean & straight-forward. However, 1. Putting allocation contexts with unknown locks to indefinite sleep caused deadlocks, for obvious reasons. Userspace OOM killing tends to take a lot of task-specific locks when scanning through /proc files for kill candidates, and can easily get stuck. Using bounded over indefinite waits is simply acknowledging that the deadlock potential when connecting arbitrary task stacks in the system through free->alloc ordering is equally difficult to plan out as alloc->free ordering. The non-cgroup OOM killer actually has the same deadlock potential, where the allocating/killing task can hold resources that the OOM victim requires to exit. The OOM reaper hides it, the static emergency reserves hide it - but to truly solve this problem, you would have to have full knowledge of memory & lock ordering dependencies of those tasks. And then can still end up with scenarios where the only answer is panic(). 2. I don't recall ever seeing situations in cgroup1 where the precise matching of allocation rate to freeing rate has allowed cgroups to run sustainably after reclaim has failed. The practical benefit of a complicated feedback loop over something crude & robust once we're in an OOM situation is not apparent to me. [ That's different from the IO-throttling *while still doing reclaim* that Dave brought up. *That* justifies the same effort we put into dirty throttling. I'm only talking about the situation where reclaim has already failed and we need to facilitate userspace OOM handling. ] So that was the motivation for the bounded sleeps. They do not guarantee containment, but they provide a reasonable amount of time for the userspace OOM handler to intervene, without deadlocking. That all being said, the semantics of the new 'high' limit in cgroup2 have allowed us to move reclaim/limit enforcement out of the allocation context and into the userspace return path. See the call to mem_cgroup_handle_over_high() from tracehook_notify_resume(), and the comments in try_charge() around set_notify_resume(). This already solves the free->alloc ordering problem by allowing the allocation to exceed the limit temporarily until at least all locks are dropped, we know we can sleep etc., before performing enforcement. That means we may not need the timed sleeps anymore for that purpose, and could bring back directed waits for freeing-events again. What do you think? Any hazards around indefinite sleeps in that resume path? It's called before __rseq_handle_notify_resume and the arch-specific resume callback (which appears to be a no-op currently). Chris, Michal, what are your thoughts? It would certainly be simpler conceptually on the memcg side.
Johannes Weiner writes: >That all being said, the semantics of the new 'high' limit in cgroup2 >have allowed us to move reclaim/limit enforcement out of the >allocation context and into the userspace return path. > >See the call to mem_cgroup_handle_over_high() from >tracehook_notify_resume(), and the comments in try_charge() around >set_notify_resume(). > >This already solves the free->alloc ordering problem by allowing the >allocation to exceed the limit temporarily until at least all locks >are dropped, we know we can sleep etc., before performing enforcement. > >That means we may not need the timed sleeps anymore for that purpose, >and could bring back directed waits for freeing-events again. > >What do you think? Any hazards around indefinite sleeps in that resume >path? It's called before __rseq_handle_notify_resume and the >arch-specific resume callback (which appears to be a no-op currently). > >Chris, Michal, what are your thoughts? It would certainly be simpler >conceptually on the memcg side. I'm not against that, although I personally don't feel very strongly about it either way, since the current behaviour clearly works in practice.
[Sorry, this slipped through cracks] On Mon 24-08-20 12:58:50, Johannes Weiner wrote: > On Fri, Aug 21, 2020 at 09:37:16PM +0200, Peter Zijlstra wrote: [...] > > Arguably seeing the rate drop to near 0 is a very good point to consider > > running cgroup-OOM. > > Agreed. In the past, that's actually what we did: In cgroup1, you > could disable the kernel OOM killer, and when reclaim failed at the > limit, the allocating task would be put on a waitqueue until woken up > by a freeing event. Conceptually this is clean & straight-forward. > > However, > > 1. Putting allocation contexts with unknown locks to indefinite sleep > caused deadlocks, for obvious reasons. Userspace OOM killing tends > to take a lot of task-specific locks when scanning through /proc > files for kill candidates, and can easily get stuck. > > Using bounded over indefinite waits is simply acknowledging that > the deadlock potential when connecting arbitrary task stacks in the > system through free->alloc ordering is equally difficult to plan > out as alloc->free ordering. > > The non-cgroup OOM killer actually has the same deadlock potential, > where the allocating/killing task can hold resources that the OOM > victim requires to exit. The OOM reaper hides it, the static > emergency reserves hide it - but to truly solve this problem, you > would have to have full knowledge of memory & lock ordering > dependencies of those tasks. And then can still end up with > scenarios where the only answer is panic(). Yes. Even killing all eligible tasks is not guaranteed to help the situation because a) resources might be not bound to a process life time (e.g. tmpfs) or ineligible task might be holding resources that block others to do the proper cleanup. OOM reaper is here to make sure we reclaim some of the address space of the victim and we go over all eligible tasks rather than getting stuck at the first victim for ever. > 2. I don't recall ever seeing situations in cgroup1 where the precise > matching of allocation rate to freeing rate has allowed cgroups to > run sustainably after reclaim has failed. The practical benefit of > a complicated feedback loop over something crude & robust once > we're in an OOM situation is not apparent to me. Yes, this is usually go OOM and kill something. Running on a very edge of the (memcg) oom doesn't tend to be sustainable and I am not sure it makes sense to optimize for. > [ That's different from the IO-throttling *while still doing > reclaim* that Dave brought up. *That* justifies the same effort > we put into dirty throttling. I'm only talking about the > situation where reclaim has already failed and we need to > facilitate userspace OOM handling. ] > > So that was the motivation for the bounded sleeps. They do not > guarantee containment, but they provide a reasonable amount of time > for the userspace OOM handler to intervene, without deadlocking. Yes, memory.high is mostly a best effort containment. We do have the hard limit to put a stop on runaways or if you are watching for PSI then the high limit throttling would give you enough idea to take an action from the userspace. > That all being said, the semantics of the new 'high' limit in cgroup2 > have allowed us to move reclaim/limit enforcement out of the > allocation context and into the userspace return path. > > See the call to mem_cgroup_handle_over_high() from > tracehook_notify_resume(), and the comments in try_charge() around > set_notify_resume(). > > This already solves the free->alloc ordering problem by allowing the > allocation to exceed the limit temporarily until at least all locks > are dropped, we know we can sleep etc., before performing enforcement. > > That means we may not need the timed sleeps anymore for that purpose, > and could bring back directed waits for freeing-events again. > > What do you think? Any hazards around indefinite sleeps in that resume > path? It's called before __rseq_handle_notify_resume and the > arch-specific resume callback (which appears to be a no-op currently). > > Chris, Michal, what are your thoughts? It would certainly be simpler > conceptually on the memcg side. I would need a more specific description. But as I've already said. It doesn't seem that we are in a need to fix any practical problem here. High limit implementation has changed quite a lot recently. I would rather see it settled for a while and see how it behaves in wider variety of workloads before changing the implementation again.