Message ID | 20241211105336.380cb545@fangorn (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: allow exiting tasks to write back data to swap | expand |
On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com> wrote: > > A task already in exit can get stuck trying to allocate pages, if its > cgroup is at the memory.max limit, the cgroup is using zswap, but > zswap writeback is enabled, and the remaining memory in the cgroup is > not compressible. > > This seems like an unlikely confluence of events, but it can happen > quite easily if a cgroup is OOM killed due to exceeding its memory.max > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > When this happens, it can sometimes take hours for tasks to exit, > as they are all trying to squeeze things into zswap to bring the group's > memory consumption below memory.max. > > Allowing these exiting programs to push some memory from their own > cgroup into swap allows them to quickly bring the cgroup's memory > consumption below memory.max, and exit in seconds rather than hours. > > Loading this fix as a live patch on a system where a workload got stuck > exiting allowed the workload to exit within a fraction of a second. > > Signed-off-by: Rik van Riel <riel@surriel.com> > --- > mm/memcontrol.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7b3503d12aaf..03d77e93087e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > if (!zswap_is_enabled()) > return true; > > + /* > + * Always allow exiting tasks to push data to swap. A process in > + * the middle of exit cannot get OOM killed, but may need to push > + * uncompressible data to swap in order to get the cgroup memory > + * use below the limit, and make progress with the exit. > + */ > + if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current)) > + return true; > + I have a few questions: (a) If the task is being OOM killed it should be able to charge memory beyond memory.max, so why do we need to get the usage down below the limit? Looking at the other thread with Michal, it looks like it's because we have to go into reclaim first before we get to the point of force charging for dying tasks, and we spend too much time in reclaim. Is that correct? If that's the case, I am wondering if the real problem is that we check mem_cgroup_zswap_writeback_enabled() too late in the process. Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap entries, only to realize it cannot swap in swap_writepage(). Should we check for this in can_reclaim_anon_pages()? If zswap writeback is disabled and we are already at the memcg limit (or zswap limit for that matter), we should avoid scanning anon memory to begin with. The problem is that if we race with memory being freed we may have some extra OOM kills, but I am not sure how common this case would be. (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in case we are reclaiming from an ancestor and we hit the limit of that ancestor? (c) mem_cgroup_from_task() should be called in an RCU read section (or we need something like rcu_access_point() if we are not dereferencing the pointer). > for (; memcg; memcg = parent_mem_cgroup(memcg)) > if (!READ_ONCE(memcg->zswap_writeback)) > return false; > -- > 2.47.0 > > >
On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com> > wrote: > > > > +++ b/mm/memcontrol.c > > @@ -5371,6 +5371,15 @@ bool > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > if (!zswap_is_enabled()) > > return true; > > > > + /* > > + * Always allow exiting tasks to push data to swap. A > > process in > > + * the middle of exit cannot get OOM killed, but may need > > to push > > + * uncompressible data to swap in order to get the cgroup > > memory > > + * use below the limit, and make progress with the exit. > > + */ > > + if ((current->flags & PF_EXITING) && memcg == > > mem_cgroup_from_task(current)) > > + return true; > > + > > I have a few questions: > (a) If the task is being OOM killed it should be able to charge > memory > beyond memory.max, so why do we need to get the usage down below the > limit? > If it is a kernel directed memcg OOM kill, that is true. However, if the exit comes from somewhere else, like a userspace oomd kill, we might not hit that code path. > Looking at the other thread with Michal, it looks like it's because > we > have to go into reclaim first before we get to the point of force > charging for dying tasks, and we spend too much time in reclaim. Is > that correct? > > If that's the case, I am wondering if the real problem is that we > check mem_cgroup_zswap_writeback_enabled() too late in the process. > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap > entries, only to realize it cannot swap in swap_writepage(). > > Should we check for this in can_reclaim_anon_pages()? If zswap > writeback is disabled and we are already at the memcg limit (or zswap > limit for that matter), we should avoid scanning anon memory to begin > with. The problem is that if we race with memory being freed we may > have some extra OOM kills, but I am not sure how common this case > would be. However, we don't know until the attempted zswap write whether the memory is compressible, and whether doing a bunch of zswap writes will help us bring our memcg down below its memory.max limit. > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in > case we are reclaiming from an ancestor and we hit the limit of that > ancestor? > I don't know if we need or want to reclaim from any other memcgs than those of the exiting process itself. A small blast radius seems like it could be desirable, but I'm open to other ideas :) > (c) mem_cgroup_from_task() should be called in an RCU read section > (or > we need something like rcu_access_point() if we are not dereferencing > the pointer). > I'll add this in v2.
On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com> wrote: > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com> > > wrote: > > > > > > +++ b/mm/memcontrol.c > > > @@ -5371,6 +5371,15 @@ bool > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > if (!zswap_is_enabled()) > > > return true; > > > > > > + /* > > > + * Always allow exiting tasks to push data to swap. A > > > process in > > > + * the middle of exit cannot get OOM killed, but may need > > > to push > > > + * uncompressible data to swap in order to get the cgroup > > > memory > > > + * use below the limit, and make progress with the exit. > > > + */ > > > + if ((current->flags & PF_EXITING) && memcg == > > > mem_cgroup_from_task(current)) > > > + return true; > > > + > > > > I have a few questions: > > (a) If the task is being OOM killed it should be able to charge > > memory > > beyond memory.max, so why do we need to get the usage down below the > > limit? > > > If it is a kernel directed memcg OOM kill, that is > true. > > However, if the exit comes from somewhere else, > like a userspace oomd kill, we might not hit that > code path. Why do we treat dying tasks differently based on the source of the kill? > > > Looking at the other thread with Michal, it looks like it's because > > we > > have to go into reclaim first before we get to the point of force > > charging for dying tasks, and we spend too much time in reclaim. Is > > that correct? > > > > If that's the case, I am wondering if the real problem is that we > > check mem_cgroup_zswap_writeback_enabled() too late in the process. > > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap > > entries, only to realize it cannot swap in swap_writepage(). > > > > Should we check for this in can_reclaim_anon_pages()? If zswap > > writeback is disabled and we are already at the memcg limit (or zswap > > limit for that matter), we should avoid scanning anon memory to begin > > with. The problem is that if we race with memory being freed we may > > have some extra OOM kills, but I am not sure how common this case > > would be. > > However, we don't know until the attempted zswap write > whether the memory is compressible, and whether doing > a bunch of zswap writes will help us bring our memcg > down below its memory.max limit. If we are at memory.max (or memory.zswap.max), we can't compress pages into zswap anyway, regardless of their compressibility. So what I am saying is, if we are already at the limit (pages cannot go into zswap), and writeback is disabled (pages cannot go into swapfiles), then we should probably avoid scanning the anon LRUs and spending all those wasted cycles trying to isolate, unmap, and reclaim them only to fail at the last step. > > > > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in > > case we are reclaiming from an ancestor and we hit the limit of that > > ancestor? > > > I don't know if we need or want to reclaim from any > other memcgs than those of the exiting process itself. > > A small blast radius seems like it could be desirable, > but I'm open to other ideas :) The exiting process is part of all the ancestor cgroups by the hierarchy. If we have the following hierarchy: root | A | B Then a process in cgroup B could be getting OOM killed due to hitting the limit of A, not B. In which case, reclaiming from A helps us get below the limit. We can check if the cgroup is an ancestor and it hit its limit, but maybe that's an overkill. > > > (c) mem_cgroup_from_task() should be called in an RCU read section > > (or > > we need something like rcu_access_point() if we are not dereferencing > > the pointer). > > > I'll add this in v2. > > -- > All Rights Reversed.
On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com> > wrote: > > > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com> > > > wrote: > > > > > > > > +++ b/mm/memcontrol.c > > > > @@ -5371,6 +5371,15 @@ bool > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > > if (!zswap_is_enabled()) > > > > return true; > > > > > > > > + /* > > > > + * Always allow exiting tasks to push data to swap. A > > > > process in > > > > + * the middle of exit cannot get OOM killed, but may > > > > need > > > > to push > > > > + * uncompressible data to swap in order to get the > > > > cgroup > > > > memory > > > > + * use below the limit, and make progress with the > > > > exit. > > > > + */ > > > > + if ((current->flags & PF_EXITING) && memcg == > > > > mem_cgroup_from_task(current)) > > > > + return true; > > > > + > > > > > > I have a few questions: > > > (a) If the task is being OOM killed it should be able to charge > > > memory > > > beyond memory.max, so why do we need to get the usage down below > > > the > > > limit? > > > > > If it is a kernel directed memcg OOM kill, that is > > true. > > > > However, if the exit comes from somewhere else, > > like a userspace oomd kill, we might not hit that > > code path. > > Why do we treat dying tasks differently based on the source of the > kill? > Are you saying we should fail allocations for every dying task, and add a check for PF_EXITING in here? if (unlikely(task_in_memcg_oom(current))) goto nomem; > > However, we don't know until the attempted zswap write > > whether the memory is compressible, and whether doing > > a bunch of zswap writes will help us bring our memcg > > down below its memory.max limit. > > If we are at memory.max (or memory.zswap.max), we can't compress > pages > into zswap anyway, regardless of their compressibility. > Wait, this is news to me. This seems like something we should fix, rather than live with, since compressing the data to a smaller size could bring us below memory.max. Is this "cannot compress when at memory.max" behavior intentional, or just a side effect of how things happen to be? Won't the allocations made from zswap_store ignore the memory.max limit because PF_MEMALLOC is set? > > > > > > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() > > > in > > > case we are reclaiming from an ancestor and we hit the limit of > > > that > > > ancestor? > > > > > I don't know if we need or want to reclaim from any > > other memcgs than those of the exiting process itself. > > > > A small blast radius seems like it could be desirable, > > but I'm open to other ideas :) > > The exiting process is part of all the ancestor cgroups by the > hierarchy. > > If we have the following hierarchy: > root > | > A > | > B > > Then a process in cgroup B could be getting OOM killed due to hitting > the limit of A, not B. In which case, reclaiming from A helps us get > below the limit. We can check if the cgroup is an ancestor and it hit > its limit, but maybe that's an overkill. Since we're dealing with a corner case anyway, I suppose there's no harm using mm_match_cgroup, which also happens to be cleaner than the code I have right now.
On Wed, Dec 11, 2024 at 9:20 AM Rik van Riel <riel@surriel.com> wrote: > > On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote: > > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com> > > wrote: > > > > > > On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote: > > > > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@surriel.com> > > > > wrote: > > > > > > > > > > +++ b/mm/memcontrol.c > > > > > @@ -5371,6 +5371,15 @@ bool > > > > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > > > > > if (!zswap_is_enabled()) > > > > > return true; > > > > > > > > > > + /* > > > > > + * Always allow exiting tasks to push data to swap. A > > > > > process in > > > > > + * the middle of exit cannot get OOM killed, but may > > > > > need > > > > > to push > > > > > + * uncompressible data to swap in order to get the > > > > > cgroup > > > > > memory > > > > > + * use below the limit, and make progress with the > > > > > exit. > > > > > + */ > > > > > + if ((current->flags & PF_EXITING) && memcg == > > > > > mem_cgroup_from_task(current)) > > > > > + return true; > > > > > + > > > > > > > > I have a few questions: > > > > (a) If the task is being OOM killed it should be able to charge > > > > memory > > > > beyond memory.max, so why do we need to get the usage down below > > > > the > > > > limit? > > > > > > > If it is a kernel directed memcg OOM kill, that is > > > true. > > > > > > However, if the exit comes from somewhere else, > > > like a userspace oomd kill, we might not hit that > > > code path. > > > > Why do we treat dying tasks differently based on the source of the > > kill? > > > Are you saying we should fail allocations for > every dying task, and add a check for PF_EXITING > in here? I am asking, not really suggesting anything :) Does it matter from the kernel perspective if the task is dying due to a kernel OOM kill or a userspace SIGKILL? > > > if (unlikely(task_in_memcg_oom(current))) > goto nomem; > > > > > However, we don't know until the attempted zswap write > > > whether the memory is compressible, and whether doing > > > a bunch of zswap writes will help us bring our memcg > > > down below its memory.max limit. > > > > If we are at memory.max (or memory.zswap.max), we can't compress > > pages > > into zswap anyway, regardless of their compressibility. > > > Wait, this is news to me. > > This seems like something we should fix, rather > than live with, since compressing the data to > a smaller size could bring us below memory.max. > > Is this "cannot compress when at memory.max" > behavior intentional, or just a side effect of > how things happen to be? > > Won't the allocations made from zswap_store > ignore the memory.max limit because PF_MEMALLOC > is set? My bad, obj_cgroup_may_zswap() only checks the zswap limit, not memory.max. Please ignore this. The scenario I described where we scan the LRUs needlessly is if the *zswap limit* is hit, and writeback is disabled. I am guessing this is not the case you're running into. So yeah my only outstanding question is the one above about handling userspace OOM kills differently. Thanks for bearing with me.
On Wed, 2024-12-11 at 09:30 -0800, Yosry Ahmed wrote: > On Wed, Dec 11, 2024 at 9:20 AM Rik van Riel <riel@surriel.com> > wrote: > > > > On Wed, 2024-12-11 at 09:00 -0800, Yosry Ahmed wrote: > > > On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@surriel.com> > > > wrote: > > > > > > > > > If it is a kernel directed memcg OOM kill, that is > > > > true. > > > > > > > > However, if the exit comes from somewhere else, > > > > like a userspace oomd kill, we might not hit that > > > > code path. > > > > > > Why do we treat dying tasks differently based on the source of > > > the > > > kill? > > > > > Are you saying we should fail allocations for > > every dying task, and add a check for PF_EXITING > > in here? > > I am asking, not really suggesting anything :) > > Does it matter from the kernel perspective if the task is dying due > to > a kernel OOM kill or a userspace SIGKILL? > Currently, it does. I'm not sure it should, but currently it does :/ We are dealing with two conflicting demands here. On the one hand, we want the exit code to be able to access things like futex memory, so it can properly clean up everything the program left behind. On the other hand, we don't want the exiting program to drive up cgroup memory use, especially not with memory that won't be reclaimed by the exit. My patch is an attempt to satisfy both of these demands, in situations where we currently exhibit a rather pathological behavior (glacially slow exit).
On 12/12/24 02:53, Rik van Riel wrote: > A task already in exit can get stuck trying to allocate pages, if its > cgroup is at the memory.max limit, the cgroup is using zswap, but > zswap writeback is enabled, and the remaining memory in the cgroup is > not compressible. > > This seems like an unlikely confluence of events, but it can happen > quite easily if a cgroup is OOM killed due to exceeding its memory.max > limit, and all the tasks in the cgroup are trying to exit simultaneously. > > When this happens, it can sometimes take hours for tasks to exit, > as they are all trying to squeeze things into zswap to bring the group's > memory consumption below memory.max. > > Allowing these exiting programs to push some memory from their own > cgroup into swap allows them to quickly bring the cgroup's memory > consumption below memory.max, and exit in seconds rather than hours. > > Loading this fix as a live patch on a system where a workload got stuck > exiting allowed the workload to exit within a fraction of a second. > > Signed-off-by: Rik van Riel <riel@surriel.com> > --- > mm/memcontrol.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7b3503d12aaf..03d77e93087e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > if (!zswap_is_enabled()) > return true; > > + /* > + * Always allow exiting tasks to push data to swap. A process in > + * the middle of exit cannot get OOM killed, but may need to push > + * uncompressible data to swap in order to get the cgroup memory > + * use below the limit, and make progress with the exit. > + */ > + if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current)) > + return true; > + > for (; memcg; memcg = parent_mem_cgroup(memcg)) > if (!READ_ONCE(memcg->zswap_writeback)) > return false; Rik, I am unable to understand the motivation here, so we want mem_cgroup_zswap_writeback_enabled() to return true, it only returns false if a memcg in the hierarchy has zswap_writeback set to 0 (false). In my git-grep I can't seem to find how/why that may be the case. I can see memcg starts of with the value set to true, if CONFIG_ZSWAP is enabled. Your changelog above makes sense, but I am unable to map it to the code changes. Balbir
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7b3503d12aaf..03d77e93087e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5371,6 +5371,15 @@ bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) if (!zswap_is_enabled()) return true; + /* + * Always allow exiting tasks to push data to swap. A process in + * the middle of exit cannot get OOM killed, but may need to push + * uncompressible data to swap in order to get the cgroup memory + * use below the limit, and make progress with the exit. + */ + if ((current->flags & PF_EXITING) && memcg == mem_cgroup_from_task(current)) + return true; + for (; memcg; memcg = parent_mem_cgroup(memcg)) if (!READ_ONCE(memcg->zswap_writeback)) return false;
A task already in exit can get stuck trying to allocate pages, if its cgroup is at the memory.max limit, the cgroup is using zswap, but zswap writeback is enabled, and the remaining memory in the cgroup is not compressible. This seems like an unlikely confluence of events, but it can happen quite easily if a cgroup is OOM killed due to exceeding its memory.max limit, and all the tasks in the cgroup are trying to exit simultaneously. When this happens, it can sometimes take hours for tasks to exit, as they are all trying to squeeze things into zswap to bring the group's memory consumption below memory.max. Allowing these exiting programs to push some memory from their own cgroup into swap allows them to quickly bring the cgroup's memory consumption below memory.max, and exit in seconds rather than hours. Loading this fix as a live patch on a system where a workload got stuck exiting allowed the workload to exit within a fraction of a second. Signed-off-by: Rik van Riel <riel@surriel.com> --- mm/memcontrol.c | 9 +++++++++ 1 file changed, 9 insertions(+)