Message ID | 5b06a490-55bc-a6a0-6c85-690254f86fad@virtuozzo.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [memcg] memcg: prohibit unconditional exceeding the limit of dying tasks | expand |
On 2021/09/10 21:39, Vasily Averin wrote: > The kernel currently allows dying tasks to exceed the memcg limits. > The allocation is expected to be the last one and the occupied memory > will be freed soon. > This is not always true because it can be part of the huge vmalloc > allocation. Allowed once, they will repeat over and over again. > Moreover lifetime of the allocated object can differ from > In addition the lifetime of the dying task. Can't we add fatal_signal_pending(current) test to vmalloc() loop? > Multiple such allocations running concurrently can not only overuse > the memcg limit, but can lead to a global out of memory and, > in the worst case, cause the host to panic. > > Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
On 9/10/21 3:39 PM, Vasily Averin wrote: > The kernel currently allows dying tasks to exceed the memcg limits. > The allocation is expected to be the last one and the occupied memory > will be freed soon. > This is not always true because it can be part of the huge vmalloc > allocation. Allowed once, they will repeat over and over again. > Moreover lifetime of the allocated object can differ from > In addition the lifetime of the dying task. > Multiple such allocations running concurrently can not only overuse > the memcg limit, but can lead to a global out of memory and, > in the worst case, cause the host to panic. btw should_force_charge() function name become wrong with this. Is it make sense to replace it by something like is_task_dying() ? > Signed-off-by: Vasily Averin <vvs@virtuozzo.com> > --- > mm/memcontrol.c | 23 +++++------------------ > 1 file changed, 5 insertions(+), 18 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 389b5766e74f..67195fcfbddf 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int > return OOM_ASYNC; > } > > + if (should_force_charge()) > + return OOM_SKIPPED; > + > mem_cgroup_mark_under_oom(memcg); > > locked = mem_cgroup_oom_trylock(memcg); > @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_ATOMIC) > goto force; > > - /* > - * Unlike in global OOM situations, memcg is not in a physical > - * memory shortage. Allow dying and OOM-killed tasks to > - * bypass the last charges so that they can exit quickly and > - * free their memory. > - */ > - if (unlikely(should_force_charge())) > - goto force; > - > /* > * Prevent unbounded recursion when reclaim operations need to > * allocate memory. This might exceed the limits temporarily, > @@ -2688,9 +2682,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_RETRY_MAYFAIL) > goto nomem; > > - if (fatal_signal_pending(current)) > - goto force; > - > /* > * keep retrying as long as the memcg oom killer is able to make > * a forward progress or bypass the charge if the oom killer > @@ -2698,15 +2689,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > */ > oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, > get_order(nr_pages * PAGE_SIZE)); > - switch (oom_status) { > - case OOM_SUCCESS: > + if (oom_status == OOM_SUCCESS) { > nr_retries = MAX_RECLAIM_RETRIES; > goto retry; > - case OOM_FAILED: > + } else if (oom_status == OOM_FAILED) > goto force; > - default: > - goto nomem; > - } > nomem: > if (!(gfp_mask & __GFP_NOFAIL)) > return -ENOMEM; >
On 9/10/21 4:04 PM, Tetsuo Handa wrote: > On 2021/09/10 21:39, Vasily Averin wrote: >> The kernel currently allows dying tasks to exceed the memcg limits. >> The allocation is expected to be the last one and the occupied memory >> will be freed soon. >> This is not always true because it can be part of the huge vmalloc >> allocation. Allowed once, they will repeat over and over again. >> Moreover lifetime of the allocated object can differ from >> In addition the lifetime of the dying task. > > Can't we add fatal_signal_pending(current) test to vmalloc() loop? 1) this has been done in the past but has been reverted later. 2) any vmalloc changes will affect non-memcg allocations too. If we're doing memcg-related checks it's better to do it in one place. 3) it is not vmalloc-only issue. Huge number of kmalloc page allocations from N concurrent threads will lead to the same problem. >> Multiple such allocations running concurrently can not only overuse >> the memcg limit, but can lead to a global out of memory and, >> in the worst case, cause the host to panic. >> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
On Fri 10-09-21 16:20:58, Vasily Averin wrote: > On 9/10/21 4:04 PM, Tetsuo Handa wrote: > > On 2021/09/10 21:39, Vasily Averin wrote: > >> The kernel currently allows dying tasks to exceed the memcg limits. > >> The allocation is expected to be the last one and the occupied memory > >> will be freed soon. > >> This is not always true because it can be part of the huge vmalloc > >> allocation. Allowed once, they will repeat over and over again. > >> Moreover lifetime of the allocated object can differ from > >> In addition the lifetime of the dying task. > > > > Can't we add fatal_signal_pending(current) test to vmalloc() loop? We can and we should. > 1) this has been done in the past but has been reverted later. The reason for that should be addressed IIRC. > 2) any vmalloc changes will affect non-memcg allocations too. > If we're doing memcg-related checks it's better to do it in one place. I think those two things are just orthogonal. Bailing out from vmalloc early sounds reasonable to me on its own. Allocating a large thing that is likely to go away with the allocating context is just a waste of resources and potential reason to disruptions to others. > 3) it is not vmalloc-only issue. Huge number of kmalloc page allocations > from N concurrent threads will lead to the same problem. Agreed. I do not think it is viable to sprinkle fatal_signal_pending or similar checks all over the code. This should better be limited to allocators and the charging function. Our assumption that is described in the code simply doesn't hold and it is close to impossible to check all charging paths to bail out properly so I think we should just remove that optimistic attitude and do not force charges unless that is absolutely necessary (e.g. __GFP_NOFAIL) or impractical (e.g. atomic allocations) and bound. I didn't get to review the patch yet. This is a tricky area with many unobvious corner cases. I will try find some time next week. Thanks!
On 9/10/21 3:39 PM, Vasily Averin wrote: > The kernel currently allows dying tasks to exceed the memcg limits. > The allocation is expected to be the last one and the occupied memory > will be freed soon. > This is not always true because it can be part of the huge vmalloc > allocation. Allowed once, they will repeat over and over again. > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 389b5766e74f..67195fcfbddf 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > if (gfp_mask & __GFP_ATOMIC) > goto force; > > - /* > - * Unlike in global OOM situations, memcg is not in a physical > - * memory shortage. Allow dying and OOM-killed tasks to > - * bypass the last charges so that they can exit quickly and > - * free their memory. > - */ > - if (unlikely(should_force_charge())) > - goto force; > - Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps? It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations. Thank you, Vasily Averin
On 9/10/21 5:55 PM, Michal Hocko wrote: > On Fri 10-09-21 16:20:58, Vasily Averin wrote: >> On 9/10/21 4:04 PM, Tetsuo Handa wrote: >>> Can't we add fatal_signal_pending(current) test to vmalloc() loop? > > We can and we should. > >> 1) this has been done in the past but has been reverted later. > > The reason for that should be addressed IIRC. I don't know the details of this, and I need some time to investigate it. >> 2) any vmalloc changes will affect non-memcg allocations too. >> If we're doing memcg-related checks it's better to do it in one place. > > I think those two things are just orthogonal. Bailing out from vmalloc > early sounds reasonable to me on its own. Allocating a large thing that > is likely to go away with the allocating context is just a waste of > resources and potential reason to disruptions to others. I doubt that fatal signal should block any vmalloc allocations. I assume there are situations where rollback of some cancelled operation uses vmalloc. Or coredump saving on some remote storage can uses vmalloc. However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation. So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() to break vmalloc cycle. Thank you, Vasily Averin
On Mon 13-09-21 10:51:37, Vasily Averin wrote: > On 9/10/21 3:39 PM, Vasily Averin wrote: > > The kernel currently allows dying tasks to exceed the memcg limits. > > The allocation is expected to be the last one and the occupied memory > > will be freed soon. > > This is not always true because it can be part of the huge vmalloc > > allocation. Allowed once, they will repeat over and over again. > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 389b5766e74f..67195fcfbddf 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > > if (gfp_mask & __GFP_ATOMIC) > > goto force; > > > > - /* > > - * Unlike in global OOM situations, memcg is not in a physical > > - * memory shortage. Allow dying and OOM-killed tasks to > > - * bypass the last charges so that they can exit quickly and > > - * free their memory. > > - */ > > - if (unlikely(should_force_charge())) > > - goto force; > > - > > Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps? Why? > It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations. Allocations in this code path should be rare but it is not like they are non-existent. This is rather hard to review area spread at many places so if we are deciding to make the existing model simpler (no bypassing) then I would rather have no exceptions unless they are reaaly necessary and document them if they are.
On Mon 13-09-21 11:29:37, Vasily Averin wrote: > On 9/10/21 5:55 PM, Michal Hocko wrote: > > On Fri 10-09-21 16:20:58, Vasily Averin wrote: > >> On 9/10/21 4:04 PM, Tetsuo Handa wrote: > >>> Can't we add fatal_signal_pending(current) test to vmalloc() loop? > > > > We can and we should. > > > >> 1) this has been done in the past but has been reverted later. > > > > The reason for that should be addressed IIRC. > > I don't know the details of this, and I need some time to investigate it. b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed"") should give a good insight and references. > >> 2) any vmalloc changes will affect non-memcg allocations too. > >> If we're doing memcg-related checks it's better to do it in one place. > > > > I think those two things are just orthogonal. Bailing out from vmalloc > > early sounds reasonable to me on its own. Allocating a large thing that > > is likely to go away with the allocating context is just a waste of > > resources and potential reason to disruptions to others. > > I doubt that fatal signal should block any vmalloc allocations. > I assume there are situations where rollback of some cancelled operation uses vmalloc. > Or coredump saving on some remote storage can uses vmalloc. If there really are any such requirements then this should be really documented. > However for me it's abnormal that even OOM-killer cannot cancel huge vmalloc allocation. > So I think tsk_is_oom_victim(current) check should be added to vm_area_alloc_pages() > to break vmalloc cycle. Why should oom killed task behave any different than any other task killed without a way to handle the signal?
On Fri 10-09-21 15:39:28, Vasily Averin wrote: > The kernel currently allows dying tasks to exceed the memcg limits. > The allocation is expected to be the last one and the occupied memory > will be freed soon. > This is not always true because it can be part of the huge vmalloc > allocation. Allowed once, they will repeat over and over again. > Moreover lifetime of the allocated object can differ from > In addition the lifetime of the dying task. > Multiple such allocations running concurrently can not only overuse > the memcg limit, but can lead to a global out of memory and, > in the worst case, cause the host to panic. > > Signed-off-by: Vasily Averin <vvs@virtuozzo.com> > --- > mm/memcontrol.c | 23 +++++------------------ > 1 file changed, 5 insertions(+), 18 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 389b5766e74f..67195fcfbddf 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int > return OOM_ASYNC; > } > > + if (should_force_charge()) > + return OOM_SKIPPED; mem_cgroup_out_of_memory already check for the bypass, now you are duplicating that check with a different answer to the caller. This is really messy. One of the two has to go away.
On 9/13/21 11:39 AM, Michal Hocko wrote: > On Mon 13-09-21 10:51:37, Vasily Averin wrote: >> On 9/10/21 3:39 PM, Vasily Averin wrote: >>> The kernel currently allows dying tasks to exceed the memcg limits. >>> The allocation is expected to be the last one and the occupied memory >>> will be freed soon. >>> This is not always true because it can be part of the huge vmalloc >>> allocation. Allowed once, they will repeat over and over again. >> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 389b5766e74f..67195fcfbddf 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, >>> if (gfp_mask & __GFP_ATOMIC) >>> goto force; >>> >>> - /* >>> - * Unlike in global OOM situations, memcg is not in a physical >>> - * memory shortage. Allow dying and OOM-killed tasks to >>> - * bypass the last charges so that they can exit quickly and >>> - * free their memory. >>> - */ >>> - if (unlikely(should_force_charge())) >>> - goto force; >>> - >> >> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps? > > Why? On this stage task really dies and mostly releases taken resources. It can allocate though, and this allocation can reach memcg limit due to the activity of parallel memcg threads. Noting bad should happen if we reject this allocation, because the same thing can happen in non-memcg case too. However I doubt misuse is possible here and we have possibility to allow graceful shutdown here. In other words: we are not obliged to allow such allocations, but we CAN do it because we hope that it is safe and cannot be misused. >> It is set inside do_exit only and (I hope) cannot trigger huge vmalloc allocations. > > Allocations in this code path should be rare but it is not like they are > non-existent. This is rather hard to review area spread at many places > so if we are deciding to make the existing model simpler (no bypassing) > then I would rather have no exceptions unless they are reaaly necessary > and document them if they are. Thank you, vasily Averin
On Mon 13-09-21 12:37:56, Vasily Averin wrote: > On 9/13/21 11:39 AM, Michal Hocko wrote: > > On Mon 13-09-21 10:51:37, Vasily Averin wrote: > >> On 9/10/21 3:39 PM, Vasily Averin wrote: > >>> The kernel currently allows dying tasks to exceed the memcg limits. > >>> The allocation is expected to be the last one and the occupied memory > >>> will be freed soon. > >>> This is not always true because it can be part of the huge vmalloc > >>> allocation. Allowed once, they will repeat over and over again. > >> > >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>> index 389b5766e74f..67195fcfbddf 100644 > >>> --- a/mm/memcontrol.c > >>> +++ b/mm/memcontrol.c > >>> @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, > >>> if (gfp_mask & __GFP_ATOMIC) > >>> goto force; > >>> > >>> - /* > >>> - * Unlike in global OOM situations, memcg is not in a physical > >>> - * memory shortage. Allow dying and OOM-killed tasks to > >>> - * bypass the last charges so that they can exit quickly and > >>> - * free their memory. > >>> - */ > >>> - if (unlikely(should_force_charge())) > >>> - goto force; > >>> - > >> > >> Should we keep current behaviour for (current->flags & PF_EXITING) case perhaps? > > > > Why? > > On this stage task really dies and mostly releases taken resources. > It can allocate though, and this allocation can reach memcg limit due to the activity > of parallel memcg threads. > > Noting bad should happen if we reject this allocation, > because the same thing can happen in non-memcg case too. > However I doubt misuse is possible here and we have possibility to allow graceful shutdown here. > > In other words: we are not obliged to allow such allocations, but we CAN do it because > we hope that it is safe and cannot be misused. This is a lot of hoping that has turned out to be a bad strategy in the existing code. So let's stop hoping and if we are shown that an exit path really benefits from a special treatment then we can add it with a good reasoning rathat than "we hope it's gonna be ok".
On 9/13/21 11:53 AM, Michal Hocko wrote: > On Fri 10-09-21 15:39:28, Vasily Averin wrote: >> The kernel currently allows dying tasks to exceed the memcg limits. >> The allocation is expected to be the last one and the occupied memory >> will be freed soon. >> This is not always true because it can be part of the huge vmalloc >> allocation. Allowed once, they will repeat over and over again. >> Moreover lifetime of the allocated object can differ from >> In addition the lifetime of the dying task. >> Multiple such allocations running concurrently can not only overuse >> the memcg limit, but can lead to a global out of memory and, >> in the worst case, cause the host to panic. >> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> >> --- >> mm/memcontrol.c | 23 +++++------------------ >> 1 file changed, 5 insertions(+), 18 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 389b5766e74f..67195fcfbddf 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int >> return OOM_ASYNC; >> } >> >> + if (should_force_charge()) >> + return OOM_SKIPPED; > > mem_cgroup_out_of_memory already check for the bypass, now you are > duplicating that check with a different answer to the caller. This is > really messy. One of the two has to go away. In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing useful and its success causes try_charge_memcg() to repeat the loop unnecessarily. I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after check added into mem_cgroup_oom(). Though I got your argument, and will think how to improve the patch. Anyway we'll need to do something with name of should_force_charge() function that will NOT lead to forced charge. Thank you, Vasily Averin Thank you,
On Mon 13-09-21 13:35:00, Vasily Averin wrote: > On 9/13/21 11:53 AM, Michal Hocko wrote: > > On Fri 10-09-21 15:39:28, Vasily Averin wrote: > >> The kernel currently allows dying tasks to exceed the memcg limits. > >> The allocation is expected to be the last one and the occupied memory > >> will be freed soon. > >> This is not always true because it can be part of the huge vmalloc > >> allocation. Allowed once, they will repeat over and over again. > >> Moreover lifetime of the allocated object can differ from > >> In addition the lifetime of the dying task. > >> Multiple such allocations running concurrently can not only overuse > >> the memcg limit, but can lead to a global out of memory and, > >> in the worst case, cause the host to panic. > >> > >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> > >> --- > >> mm/memcontrol.c | 23 +++++------------------ > >> 1 file changed, 5 insertions(+), 18 deletions(-) > >> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> index 389b5766e74f..67195fcfbddf 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int > >> return OOM_ASYNC; > >> } > >> > >> + if (should_force_charge()) > >> + return OOM_SKIPPED; > > > > mem_cgroup_out_of_memory already check for the bypass, now you are > > duplicating that check with a different answer to the caller. This is > > really messy. One of the two has to go away. > > In this case mem_cgroup_out_of_memory() takes locks and mutexes but doing nothing > useful and its success causes try_charge_memcg() to repeat the loop unnecessarily. > > I cannot change mem_cgroup_out_of_memory internals, because it is used in other places too.The check inside mem_cgroup_out_of_memory is required because situation can be changed after > check added into mem_cgroup_oom(). > > Though I got your argument, and will think how to improve the patch. > Anyway we'll need to do something with name of should_force_charge() function > that will NOT lead to forced charge. Here is what I would do. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 702a81dfe72d..58269721d438 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2588,6 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2622,15 +2623,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2688,8 +2680,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && should_force_charge()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2698,14 +2691,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL))
On 9/13/21 1:55 PM, Michal Hocko wrote:
> Here is what I would do.
Your patch fixes the problem, I just want to add rename of should_force_charge()
helper to task_is_dying() and will submit v2 version soon.
I think this patch is not optimal, however its improvements can be discussed later.
Thank you,
Vasily Averin
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 389b5766e74f..67195fcfbddf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1834,6 +1834,9 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int return OOM_ASYNC; } + if (should_force_charge()) + return OOM_SKIPPED; + mem_cgroup_mark_under_oom(memcg); locked = mem_cgroup_oom_trylock(memcg); @@ -2622,15 +2625,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2688,9 +2682,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; - /* * keep retrying as long as the memcg oom killer is able to make * a forward progress or bypass the charge if the oom killer @@ -2698,15 +2689,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: + } else if (oom_status == OOM_FAILED) goto force; - default: - goto nomem; - } nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM;
The kernel currently allows dying tasks to exceed the memcg limits. The allocation is expected to be the last one and the occupied memory will be freed soon. This is not always true because it can be part of the huge vmalloc allocation. Allowed once, they will repeat over and over again. Moreover lifetime of the allocated object can differ from In addition the lifetime of the dying task. Multiple such allocations running concurrently can not only overuse the memcg limit, but can lead to a global out of memory and, in the worst case, cause the host to panic. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> --- mm/memcontrol.c | 23 +++++------------------ 1 file changed, 5 insertions(+), 18 deletions(-)