Message ID | 20210607163103.632681-1-atomlin@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm/oom_kill: allow oom kill allocating task for non-global case | expand |
On 6/7/21 12:31 PM, Aaron Tomlin wrote: > At the present time, in the context of memcg OOM, even when > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > task cannot be selected, as a target for the OOM killer. > > This patch removes the restriction entirely. > > Signed-off-by: Aaron Tomlin <atomlin@redhat.com> > --- > mm/oom_kill.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index eefd3f5fde46..3bae33e2d9c2 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) > oc->nodemask = NULL; > check_panic_on_oom(oc); > > - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && > - current->mm && !oom_unkillable_task(current) && > - oom_cpuset_eligible(current, oc) && > + if (sysctl_oom_kill_allocating_task && current->mm && > + !oom_unkillable_task(current) && > + oom_cpuset_eligible(current, oc) && > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > get_task_struct(current); > oc->chosen = current; To provide more context for this patch, we are actually seeing that in a customer report about OOM happened in a container where the dominating task used up most of the memory and it happened to be the task that triggered the OOM with the result that no killable process could be found. I don't see a reason why this should be limited to a global OOM only. Acked-by: Waiman Long <longman@redhat.com> Cheers, Longman
On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: > > On 6/7/21 12:31 PM, Aaron Tomlin wrote: > > At the present time, in the context of memcg OOM, even when > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > > task cannot be selected, as a target for the OOM killer. > > > > This patch removes the restriction entirely. > > > > Signed-off-by: Aaron Tomlin <atomlin@redhat.com> > > --- > > mm/oom_kill.c | 6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index eefd3f5fde46..3bae33e2d9c2 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) > > oc->nodemask = NULL; > > check_panic_on_oom(oc); > > > > - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && > > - current->mm && !oom_unkillable_task(current) && > > - oom_cpuset_eligible(current, oc) && > > + if (sysctl_oom_kill_allocating_task && current->mm && > > + !oom_unkillable_task(current) && > > + oom_cpuset_eligible(current, oc) && > > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > > get_task_struct(current); > > oc->chosen = current; > > To provide more context for this patch, we are actually seeing that in a > customer report about OOM happened in a container where the dominating > task used up most of the memory and it happened to be the task that > triggered the OOM with the result that no killable process could be > found. Why was there no killable process? What about the process allocating the memory or is this remote memcg charging? > I don't see a reason why this should be limited to a global OOM only. > > Acked-by: Waiman Long <longman@redhat.com> > > Cheers, > Longman >
On 6/7/21 2:43 PM, Shakeel Butt wrote: > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: >> On 6/7/21 12:31 PM, Aaron Tomlin wrote: >>> At the present time, in the context of memcg OOM, even when >>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" >>> task cannot be selected, as a target for the OOM killer. >>> >>> This patch removes the restriction entirely. >>> >>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com> >>> --- >>> mm/oom_kill.c | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>> index eefd3f5fde46..3bae33e2d9c2 100644 >>> --- a/mm/oom_kill.c >>> +++ b/mm/oom_kill.c >>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) >>> oc->nodemask = NULL; >>> check_panic_on_oom(oc); >>> >>> - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && >>> - current->mm && !oom_unkillable_task(current) && >>> - oom_cpuset_eligible(current, oc) && >>> + if (sysctl_oom_kill_allocating_task && current->mm && >>> + !oom_unkillable_task(current) && >>> + oom_cpuset_eligible(current, oc) && >>> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { >>> get_task_struct(current); >>> oc->chosen = current; >> To provide more context for this patch, we are actually seeing that in a >> customer report about OOM happened in a container where the dominating >> task used up most of the memory and it happened to be the task that >> triggered the OOM with the result that no killable process could be >> found. > Why was there no killable process? What about the process allocating > the memory or is this remote memcg charging? It is because the other processes have a oom_adjust_score of -1000. So they are non-killable. Anyway, they don't consume that much memory and killing them won't free up that much. The other process that uses most of the memory is the one that trigger the OOM kill in the first place because the memory limit has been reached in new memory allocation. Based on the current logic, this process cannot be killed at all even if we set the oom_kill_allocating_task to 1 if the OOM happens only within the memcg context, not in a global OOM situation. This patch is to allow this process to be killed under this circumstance. Cheers, Longman
On Mon 07-06-21 17:31:03, Aaron Tomlin wrote: > At the present time, in the context of memcg OOM, even when > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > task cannot be selected, as a target for the OOM killer. > > This patch removes the restriction entirely. This is a global oom policy not a memcg specific one so a historical behavior would change. So I do not think we can change that. The policy can be implemented on the memcg level but this would require a much more detailed explanation of the usecase and the semantic (e.g. wrt. hierarchical behavior etc). > Signed-off-by: Aaron Tomlin <atomlin@redhat.com> > --- > mm/oom_kill.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index eefd3f5fde46..3bae33e2d9c2 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) > oc->nodemask = NULL; > check_panic_on_oom(oc); > > - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && > - current->mm && !oom_unkillable_task(current) && > - oom_cpuset_eligible(current, oc) && > + if (sysctl_oom_kill_allocating_task && current->mm && > + !oom_unkillable_task(current) && > + oom_cpuset_eligible(current, oc) && > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > get_task_struct(current); > oc->chosen = current; > -- > 2.26.3
On Mon 07-06-21 14:51:05, Waiman Long wrote: > On 6/7/21 2:43 PM, Shakeel Butt wrote: > > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: > > > On 6/7/21 12:31 PM, Aaron Tomlin wrote: > > > > At the present time, in the context of memcg OOM, even when > > > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > > > > task cannot be selected, as a target for the OOM killer. > > > > > > > > This patch removes the restriction entirely. > > > > > > > > Signed-off-by: Aaron Tomlin <atomlin@redhat.com> > > > > --- > > > > mm/oom_kill.c | 6 +++--- > > > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > index eefd3f5fde46..3bae33e2d9c2 100644 > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) > > > > oc->nodemask = NULL; > > > > check_panic_on_oom(oc); > > > > > > > > - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && > > > > - current->mm && !oom_unkillable_task(current) && > > > > - oom_cpuset_eligible(current, oc) && > > > > + if (sysctl_oom_kill_allocating_task && current->mm && > > > > + !oom_unkillable_task(current) && > > > > + oom_cpuset_eligible(current, oc) && > > > > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > > > > get_task_struct(current); > > > > oc->chosen = current; > > > To provide more context for this patch, we are actually seeing that in a > > > customer report about OOM happened in a container where the dominating > > > task used up most of the memory and it happened to be the task that > > > triggered the OOM with the result that no killable process could be > > > found. > > Why was there no killable process? What about the process allocating > > the memory or is this remote memcg charging? > > It is because the other processes have a oom_adjust_score of -1000. So they > are non-killable. Anyway, they don't consume that much memory and killing > them won't free up that much. > > The other process that uses most of the memory is the one that trigger the > OOM kill in the first place because the memory limit has been reached in new > memory allocation. Based on the current logic, this process cannot be killed > at all even if we set the oom_kill_allocating_task to 1 if the OOM happens > only within the memcg context, not in a global OOM situation. This patch is > to allow this process to be killed under this circumstance. Do you have the oom report? I do not see why the allocating task hasn't been chosen.
On Mon, Jun 7, 2021 at 11:51 AM Waiman Long <llong@redhat.com> wrote: > > On 6/7/21 2:43 PM, Shakeel Butt wrote: > > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: > >> On 6/7/21 12:31 PM, Aaron Tomlin wrote: > >>> At the present time, in the context of memcg OOM, even when > >>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > >>> task cannot be selected, as a target for the OOM killer. > >>> > >>> This patch removes the restriction entirely. > >>> > >>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com> > >>> --- > >>> mm/oom_kill.c | 6 +++--- > >>> 1 file changed, 3 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c > >>> index eefd3f5fde46..3bae33e2d9c2 100644 > >>> --- a/mm/oom_kill.c > >>> +++ b/mm/oom_kill.c > >>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) > >>> oc->nodemask = NULL; > >>> check_panic_on_oom(oc); > >>> > >>> - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && > >>> - current->mm && !oom_unkillable_task(current) && > >>> - oom_cpuset_eligible(current, oc) && > >>> + if (sysctl_oom_kill_allocating_task && current->mm && > >>> + !oom_unkillable_task(current) && > >>> + oom_cpuset_eligible(current, oc) && > >>> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > >>> get_task_struct(current); > >>> oc->chosen = current; > >> To provide more context for this patch, we are actually seeing that in a > >> customer report about OOM happened in a container where the dominating > >> task used up most of the memory and it happened to be the task that > >> triggered the OOM with the result that no killable process could be > >> found. > > Why was there no killable process? What about the process allocating > > the memory or is this remote memcg charging? > > It is because the other processes have a oom_adjust_score of -1000. So > they are non-killable. Anyway, they don't consume that much memory and > killing them won't free up that much. > > The other process that uses most of the memory is the one that trigger > the OOM kill in the first place because the memory limit has been > reached in new memory allocation. Based on the current logic, this > process cannot be killed at all even if we set the > oom_kill_allocating_task to 1 if the OOM happens only within the memcg > context, not in a global OOM situation. I am not really against the patch but I am still not able to understand why select_bad_process() was not able to select the current process. mem_cgroup_scan_tasks() traverses all the processes in the target memcg hierarchy, so why the current was skipped. > This patch is to allow this > process to be killed under this circumstance. > > Cheers, > Longman >
On 6/7/21 3:04 PM, Michal Hocko wrote: > On Mon 07-06-21 14:51:05, Waiman Long wrote: >> On 6/7/21 2:43 PM, Shakeel Butt wrote: >>> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: >>>> On 6/7/21 12:31 PM, Aaron Tomlin wrote: >>>>> At the present time, in the context of memcg OOM, even when >>>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" >>>>> task cannot be selected, as a target for the OOM killer. >>>>> >>>>> This patch removes the restriction entirely. >>>>> >>>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com> >>>>> --- >>>>> mm/oom_kill.c | 6 +++--- >>>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>>> index eefd3f5fde46..3bae33e2d9c2 100644 >>>>> --- a/mm/oom_kill.c >>>>> +++ b/mm/oom_kill.c >>>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) >>>>> oc->nodemask = NULL; >>>>> check_panic_on_oom(oc); >>>>> >>>>> - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && >>>>> - current->mm && !oom_unkillable_task(current) && >>>>> - oom_cpuset_eligible(current, oc) && >>>>> + if (sysctl_oom_kill_allocating_task && current->mm && >>>>> + !oom_unkillable_task(current) && >>>>> + oom_cpuset_eligible(current, oc) && >>>>> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { >>>>> get_task_struct(current); >>>>> oc->chosen = current; >>>> To provide more context for this patch, we are actually seeing that in a >>>> customer report about OOM happened in a container where the dominating >>>> task used up most of the memory and it happened to be the task that >>>> triggered the OOM with the result that no killable process could be >>>> found. >>> Why was there no killable process? What about the process allocating >>> the memory or is this remote memcg charging? >> It is because the other processes have a oom_adjust_score of -1000. So they >> are non-killable. Anyway, they don't consume that much memory and killing >> them won't free up that much. >> >> The other process that uses most of the memory is the one that trigger the >> OOM kill in the first place because the memory limit has been reached in new >> memory allocation. Based on the current logic, this process cannot be killed >> at all even if we set the oom_kill_allocating_task to 1 if the OOM happens >> only within the memcg context, not in a global OOM situation. This patch is >> to allow this process to be killed under this circumstance. > Do you have the oom report? I do not see why the allocating task hasn't > been chosen. A partial OOM report below: [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 : [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node [ 8227.242706] Out of memory and no killable processes... [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 [ 8227.242732] node cpuset=crio-b8ac7e23f7b520c0365461defb66738231918243586e287bfb9e206bb3a0227a.scope mems_allowed=0-1 So in this case, node cannot kill itself and no other processes are available to be killed. Cheers, Longman
On 6/7/21 3:01 PM, Michal Hocko wrote: > On Mon 07-06-21 17:31:03, Aaron Tomlin wrote: >> At the present time, in the context of memcg OOM, even when >> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" >> task cannot be selected, as a target for the OOM killer. >> >> This patch removes the restriction entirely. > This is a global oom policy not a memcg specific one so a historical > behavior would change. So I do not think we can change that. The policy > can be implemented on the memcg level but this would require a much more > detailed explanation of the usecase and the semantic (e.g. wrt. > hierarchical behavior etc). Maybe we can extend the meaning of oom_kill_allocating_task such that memcg OOM killing of allocating task is only enabled when bit 1 is set. So if an existing application just set oom_kill_allocating_task to 1, it will not be impacted. Cheers, Longman
On Mon 07-06-21 15:18:38, Waiman Long wrote: > On 6/7/21 3:04 PM, Michal Hocko wrote: > > On Mon 07-06-21 14:51:05, Waiman Long wrote: > > > On 6/7/21 2:43 PM, Shakeel Butt wrote: > > > > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: > > > > > On 6/7/21 12:31 PM, Aaron Tomlin wrote: > > > > > > At the present time, in the context of memcg OOM, even when > > > > > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > > > > > > task cannot be selected, as a target for the OOM killer. > > > > > > > > > > > > This patch removes the restriction entirely. > > > > > > > > > > > > Signed-off-by: Aaron Tomlin <atomlin@redhat.com> > > > > > > --- > > > > > > mm/oom_kill.c | 6 +++--- > > > > > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > > > index eefd3f5fde46..3bae33e2d9c2 100644 > > > > > > --- a/mm/oom_kill.c > > > > > > +++ b/mm/oom_kill.c > > > > > > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) > > > > > > oc->nodemask = NULL; > > > > > > check_panic_on_oom(oc); > > > > > > > > > > > > - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && > > > > > > - current->mm && !oom_unkillable_task(current) && > > > > > > - oom_cpuset_eligible(current, oc) && > > > > > > + if (sysctl_oom_kill_allocating_task && current->mm && > > > > > > + !oom_unkillable_task(current) && > > > > > > + oom_cpuset_eligible(current, oc) && > > > > > > current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { > > > > > > get_task_struct(current); > > > > > > oc->chosen = current; > > > > > To provide more context for this patch, we are actually seeing that in a > > > > > customer report about OOM happened in a container where the dominating > > > > > task used up most of the memory and it happened to be the task that > > > > > triggered the OOM with the result that no killable process could be > > > > > found. > > > > Why was there no killable process? What about the process allocating > > > > the memory or is this remote memcg charging? > > > It is because the other processes have a oom_adjust_score of -1000. So they > > > are non-killable. Anyway, they don't consume that much memory and killing > > > them won't free up that much. > > > > > > The other process that uses most of the memory is the one that trigger the > > > OOM kill in the first place because the memory limit has been reached in new > > > memory allocation. Based on the current logic, this process cannot be killed > > > at all even if we set the oom_kill_allocating_task to 1 if the OOM happens > > > only within the memcg context, not in a global OOM situation. This patch is > > > to allow this process to be killed under this circumstance. > > Do you have the oom report? I do not see why the allocating task hasn't > > been chosen. > > A partial OOM report below: Do you happen to have the full report? > [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 > : > [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon > [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon > [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node > [ 8227.242706] Out of memory and no killable processes... > [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 > [ 8227.242732] node cpuset=crio-b8ac7e23f7b520c0365461defb66738231918243586e287bfb9e206bb3a0227a.scope mems_allowed=0-1 > > So in this case, node cannot kill itself and no other processes are > available to be killed. The process is clearly listed as eligible so the oom killer should find it and if it hasn't then this should be investigated. Which kernel is this?
On Mon 07-06-21 15:26:00, Waiman Long wrote: > On 6/7/21 3:01 PM, Michal Hocko wrote: > > On Mon 07-06-21 17:31:03, Aaron Tomlin wrote: > > > At the present time, in the context of memcg OOM, even when > > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" > > > task cannot be selected, as a target for the OOM killer. > > > > > > This patch removes the restriction entirely. > > This is a global oom policy not a memcg specific one so a historical > > behavior would change. So I do not think we can change that. The policy > > can be implemented on the memcg level but this would require a much more > > detailed explanation of the usecase and the semantic (e.g. wrt. > > hierarchical behavior etc). > > Maybe we can extend the meaning of oom_kill_allocating_task such that memcg > OOM killing of allocating task is only enabled when bit 1 is set. So if an > existing application just set oom_kill_allocating_task to 1, it will not be > impacted. panic_on_oom is already allowing to implement originally global policy to memcg. So if anything this policy should follow the same interface but still I think what you are seeing is either a bug or something else (e.g. the task being migrated while the oom is ongoing) and this should be properly investigated and explained. We cannot simply paper it over by telling people to use oom_kill_allocating_task to work it around. If there is a real usecase for such a policy for memcg oom killing can be discussed of course.
On Mon 07-06-21 21:36:47, Michal Hocko wrote: > On Mon 07-06-21 15:18:38, Waiman Long wrote: [...] > > A partial OOM report below: > > Do you happen to have the full report? > > > [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 > > : > > [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > > [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon > > [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon > > [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node > > [ 8227.242706] Out of memory and no killable processes... > > [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 Btw it is surprising to not see _GFP_ACCOUNT here.
On 6/7/21 3:04 PM, Shakeel Butt wrote: > On Mon, Jun 7, 2021 at 11:51 AM Waiman Long <llong@redhat.com> wrote: >> On 6/7/21 2:43 PM, Shakeel Butt wrote: >>> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: >>>> On 6/7/21 12:31 PM, Aaron Tomlin wrote: >>>>> At the present time, in the context of memcg OOM, even when >>>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" >>>>> task cannot be selected, as a target for the OOM killer. >>>>> >>>>> This patch removes the restriction entirely. >>>>> >>>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com> >>>>> --- >>>>> mm/oom_kill.c | 6 +++--- >>>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>>> index eefd3f5fde46..3bae33e2d9c2 100644 >>>>> --- a/mm/oom_kill.c >>>>> +++ b/mm/oom_kill.c >>>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) >>>>> oc->nodemask = NULL; >>>>> check_panic_on_oom(oc); >>>>> >>>>> - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && >>>>> - current->mm && !oom_unkillable_task(current) && >>>>> - oom_cpuset_eligible(current, oc) && >>>>> + if (sysctl_oom_kill_allocating_task && current->mm && >>>>> + !oom_unkillable_task(current) && >>>>> + oom_cpuset_eligible(current, oc) && >>>>> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { >>>>> get_task_struct(current); >>>>> oc->chosen = current; >>>> To provide more context for this patch, we are actually seeing that in a >>>> customer report about OOM happened in a container where the dominating >>>> task used up most of the memory and it happened to be the task that >>>> triggered the OOM with the result that no killable process could be >>>> found. >>> Why was there no killable process? What about the process allocating >>> the memory or is this remote memcg charging? >> It is because the other processes have a oom_adjust_score of -1000. So >> they are non-killable. Anyway, they don't consume that much memory and >> killing them won't free up that much. >> >> The other process that uses most of the memory is the one that trigger >> the OOM kill in the first place because the memory limit has been >> reached in new memory allocation. Based on the current logic, this >> process cannot be killed at all even if we set the >> oom_kill_allocating_task to 1 if the OOM happens only within the memcg >> context, not in a global OOM situation. > I am not really against the patch but I am still not able to > understand why select_bad_process() was not able to select the current > process. mem_cgroup_scan_tasks() traverses all the processes in the > target memcg hierarchy, so why the current was skipped. Yes, you are right. Probably there is some problem with reaping so that the MMF_OOM_SKIP bit gets set. I don't have the answer yet. Regards, Longman
On 6/7/21 3:36 PM, Michal Hocko wrote: > On Mon 07-06-21 15:18:38, Waiman Long wrote: >> On 6/7/21 3:04 PM, Michal Hocko wrote: >>> On Mon 07-06-21 14:51:05, Waiman Long wrote: >>>> On 6/7/21 2:43 PM, Shakeel Butt wrote: >>>>> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote: >>>>>> On 6/7/21 12:31 PM, Aaron Tomlin wrote: >>>>>>> At the present time, in the context of memcg OOM, even when >>>>>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" >>>>>>> task cannot be selected, as a target for the OOM killer. >>>>>>> >>>>>>> This patch removes the restriction entirely. >>>>>>> >>>>>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com> >>>>>>> --- >>>>>>> mm/oom_kill.c | 6 +++--- >>>>>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>>>>> index eefd3f5fde46..3bae33e2d9c2 100644 >>>>>>> --- a/mm/oom_kill.c >>>>>>> +++ b/mm/oom_kill.c >>>>>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) >>>>>>> oc->nodemask = NULL; >>>>>>> check_panic_on_oom(oc); >>>>>>> >>>>>>> - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && >>>>>>> - current->mm && !oom_unkillable_task(current) && >>>>>>> - oom_cpuset_eligible(current, oc) && >>>>>>> + if (sysctl_oom_kill_allocating_task && current->mm && >>>>>>> + !oom_unkillable_task(current) && >>>>>>> + oom_cpuset_eligible(current, oc) && >>>>>>> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { >>>>>>> get_task_struct(current); >>>>>>> oc->chosen = current; >>>>>> To provide more context for this patch, we are actually seeing that in a >>>>>> customer report about OOM happened in a container where the dominating >>>>>> task used up most of the memory and it happened to be the task that >>>>>> triggered the OOM with the result that no killable process could be >>>>>> found. >>>>> Why was there no killable process? What about the process allocating >>>>> the memory or is this remote memcg charging? >>>> It is because the other processes have a oom_adjust_score of -1000. So they >>>> are non-killable. Anyway, they don't consume that much memory and killing >>>> them won't free up that much. >>>> >>>> The other process that uses most of the memory is the one that trigger the >>>> OOM kill in the first place because the memory limit has been reached in new >>>> memory allocation. Based on the current logic, this process cannot be killed >>>> at all even if we set the oom_kill_allocating_task to 1 if the OOM happens >>>> only within the memcg context, not in a global OOM situation. This patch is >>>> to allow this process to be killed under this circumstance. >>> Do you have the oom report? I do not see why the allocating task hasn't >>> been chosen. >> A partial OOM report below: > Do you happen to have the full report? I need to ask to see if I can release the full report. > >> [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 >> : >> [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name >> [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon >> [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon >> [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node >> [ 8227.242706] Out of memory and no killable processes... >> [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 >> [ 8227.242732] node cpuset=crio-b8ac7e23f7b520c0365461defb66738231918243586e287bfb9e206bb3a0227a.scope mems_allowed=0-1 >> >> So in this case, node cannot kill itself and no other processes are >> available to be killed. > The process is clearly listed as eligible so the oom killer should find > it and if it hasn't then this should be investigated. Which kernel is > this? Right. I don't know why the current cannot be selected. I think we may need to enhance the OOM but no killable process report to list the reason a task is skipped other than oom_score_adj. The kernel is a RHEL8.2 kernel which has OOM code pretty close to upstream. Cheers, Longman
On 6/7/21 4:03 PM, Michal Hocko wrote: > On Mon 07-06-21 21:36:47, Michal Hocko wrote: >> On Mon 07-06-21 15:18:38, Waiman Long wrote: > [...] >>> A partial OOM report below: >> Do you happen to have the full report? >> >>> [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 >>> : >>> [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name >>> [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon >>> [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon >>> [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node >>> [ 8227.242706] Out of memory and no killable processes... >>> [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 > Btw it is surprising to not see _GFP_ACCOUNT here. There are a number of OOM kills in the kernel log and none of the tasks that invoke oom-killer has _GFP_ACCOUNT flag set. Regards, Longman
On Mon 2021-06-07 16:42 -0400, Waiman Long wrote: > Right. I don't know why the current cannot be selected. I think we may need > to enhance the OOM but no killable process report to list the reason a task > is skipped other than oom_score_adj. Actually, this might be a good idea.
On Mon 2021-06-07 21:01 +0200, Michal Hocko wrote: > On Mon 07-06-21 17:31:03, Aaron Tomlin wrote: > This is a global oom policy not a memcg specific one so a historical > behavior would change. So I do not think we can change that. I see. Fair enough.
On Mon 07-06-21 16:44:09, Waiman Long wrote: > On 6/7/21 4:03 PM, Michal Hocko wrote: > > On Mon 07-06-21 21:36:47, Michal Hocko wrote: > > > On Mon 07-06-21 15:18:38, Waiman Long wrote: > > [...] > > > > A partial OOM report below: > > > Do you happen to have the full report? > > > > > > > [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 > > > > : > > > > [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > > > > [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon > > > > [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon > > > > [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node > > > > [ 8227.242706] Out of memory and no killable processes... > > > > [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 > > Btw it is surprising to not see _GFP_ACCOUNT here. > > There are a number of OOM kills in the kernel log and none of the tasks that > invoke oom-killer has _GFP_ACCOUNT flag set. OK. A full report (including the backtrace) would tell us more what is the source of the charge. I thought that most #PF charging paths use the same gfp mask as the allocation (which would include other flags on top of GFP_KERNEL) but it seems we just use GFP_KERNEL at many places. There are also some direct callers of the charging API for kernel allocations. Not that this would be super important but it caught my attention. You are saying that there are other OOM kills going on. Are they all for the same memcg? Is it possible the only eligible task has been killed and oom reaped already?
On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote: > You are saying that there are other OOM kills going on. Yes. > Are they all for the same memcg? No. > Is it possible the only eligible task has been killed and oom reaped > already? Unfortunately, as mentioned by Waiman, no other task is suitable for "selection" due to the OOM score (i.e. OOM_SCORE_ADJ_MIN;) that said, I need access to the vmcore to check i.e. mm.flags - I'll get back to you as soon as I can. Kind regards,
On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote: > OK. A full report (including the backtrace) would tell us more what is > the source of the charge. I thought that most #PF charging paths use the > same gfp mask as the allocation (which would include other flags on top > of GFP_KERNEL) but it seems we just use GFP_KERNEL at many places. The following is what I can provide for now: [ 8227.242706] Out of memory and no killable processes... [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 [ 8227.242732] node cpuset=XXXX mems_allowed=0-1 [ 8227.242736] CPU: 12 PID: 1703347 Comm: node Kdump: loaded Not tainted 4.18.0-193.51.1.el8_2.x86_64 #1 [ 8227.242737] Hardware name: XXXX [ 8227.242738] Call Trace: [ 8227.242746] dump_stack+0x5c/0x80 [ 8227.242751] dump_header+0x6e/0x27a [ 8227.242753] out_of_memory.cold.31+0x39/0x8d [ 8227.242756] mem_cgroup_out_of_memory+0x49/0x80 [ 8227.242758] try_charge+0x58c/0x780 [ 8227.242761] ? __alloc_pages_nodemask+0xef/0x280 [ 8227.242763] mem_cgroup_try_charge+0x8b/0x1a0 [ 8227.242764] mem_cgroup_try_charge_delay+0x1c/0x40 [ 8227.242767] do_anonymous_page+0xb5/0x360 [ 8227.242770] ? __switch_to_asm+0x35/0x70 [ 8227.242772] __handle_mm_fault+0x662/0x6a0 [ 8227.242774] handle_mm_fault+0xda/0x200 [ 8227.242778] __do_page_fault+0x22d/0x4e0 [ 8227.242780] do_page_fault+0x32/0x110 [ 8227.242782] ? page_fault+0x8/0x30 [ 8227.242783] page_fault+0x1e/0x30
On Tue 08-06-21 11:00:22, Aaron Tomlin wrote: > On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote: > > OK. A full report (including the backtrace) would tell us more what is > > the source of the charge. I thought that most #PF charging paths use the > > same gfp mask as the allocation (which would include other flags on top > > of GFP_KERNEL) but it seems we just use GFP_KERNEL at many places. > > The following is what I can provide for now: > Let me add what we have from previous email > [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116 > : > [ 8227.239769] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name > [ 8227.242495] [1611298] 0 1611298 35869 635 167936 0 -1000 conmon > [ 8227.242518] [1702509] 0 1702509 35869 701 176128 0 -1000 conmon > [ 8227.242522] [1703345] 1001050000 1703294 183440 0 2125824 0 999 node > [ 8227.242706] Out of memory and no killable processes... I do not see this message to be ever printed on 4.18 for memcg oom: /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { dump_header(oc, NULL); panic("Out of memory and no killable processes...\n"); } So how come it got triggered here? Is it possible that there is a global oom killer somehow going on along with the memcg OOM? Because the below stack clearly points to a memcg OOM and a new one AFAICS. That being said, a full chain of oom events would be definitely useful to get a better idea. > [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999 > [ 8227.242732] node cpuset=XXXX mems_allowed=0-1 > [ 8227.242736] CPU: 12 PID: 1703347 Comm: node Kdump: loaded Not tainted 4.18.0-193.51.1.el8_2.x86_64 #1 > [ 8227.242737] Hardware name: XXXX > [ 8227.242738] Call Trace: > [ 8227.242746] dump_stack+0x5c/0x80 > [ 8227.242751] dump_header+0x6e/0x27a > [ 8227.242753] out_of_memory.cold.31+0x39/0x8d > [ 8227.242756] mem_cgroup_out_of_memory+0x49/0x80 > [ 8227.242758] try_charge+0x58c/0x780 > [ 8227.242761] ? __alloc_pages_nodemask+0xef/0x280 > [ 8227.242763] mem_cgroup_try_charge+0x8b/0x1a0 > [ 8227.242764] mem_cgroup_try_charge_delay+0x1c/0x40 > [ 8227.242767] do_anonymous_page+0xb5/0x360 > [ 8227.242770] ? __switch_to_asm+0x35/0x70 > [ 8227.242772] __handle_mm_fault+0x662/0x6a0 > [ 8227.242774] handle_mm_fault+0xda/0x200 > [ 8227.242778] __do_page_fault+0x22d/0x4e0 > [ 8227.242780] do_page_fault+0x32/0x110 > [ 8227.242782] ? page_fault+0x8/0x30 > [ 8227.242783] page_fault+0x1e/0x30 > > -- > Aaron Tomlin
On 2021/06/08 22:58, Michal Hocko wrote: > I do not see this message to be ever printed on 4.18 for memcg oom: > /* Found nothing?!?! Either we hang forever, or we panic. */ > if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { > dump_header(oc, NULL); > panic("Out of memory and no killable processes...\n"); > } > > So how come it got triggered here? Is it possible that there is a global > oom killer somehow going on along with the memcg OOM? Because the below > stack clearly points to a memcg OOM and a new one AFAICS. 4.18 does print this message, and panic() will be called if global OOM killer invocation were in progress. 4.18.0-193.51.1.el8.x86_64 is doing ---------- select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* * If we got here due to an actual allocation at the * system level, we cannot survive this and will enter * an endless loop in the allocator. Bail out now. */ if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); } ---------- and this message is printed when oom_evaluate_task() found that MMF_OOM_SKIP was already set on all (possibly the only and the last) OOM victims. ---------- static int oom_evaluate_task(struct task_struct *task, void *arg) { (...snipped...) /* * This task already has access to memory reserves and is being killed. * Don't allow any other task to have access to the reserves unless * the task has MMF_OOM_SKIP because chances that it would release * any memory is quite low. */ if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) { if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) goto next; goto abort; } (...snipped...) next: return 0; (...snipped...) } ---------- Since dump_tasks() from dump_header(oc, NULL) does not exclude tasks which already has MMF_OOM_SKIP set, it is possible that the last OOM killable victim was already OOM killed but the OOM reaper failed to reclaim memory and set MMF_OOM_SKIP. (Well, maybe we want to exclude (or annotate) MMF_OOM_SKIP tasks when showing OOM victim candidates...) Therefore, > > That being said, a full chain of oom events would be definitely useful > to get a better idea. I think checking whether pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); and/or pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES))); message was already printed prior to starting infinite "Out of memory and no killable processes..." looping (this message is repeated forever, isn't it?) will be useful. Note that neither of these messages will be printed if hitting ---------- /* * If the mm has invalidate_{start,end}() notifiers that could block, * sleep to give the oom victim some more time. * TODO: we really want to get rid of this ugly hack and make sure that * notifiers cannot block for unbounded amount of time */ if (mm_has_blockable_invalidate_notifiers(mm)) { up_read(&mm->mmap_sem); schedule_timeout_idle(HZ); return true; } ---------- case, and also dmesg available in the vmcore might be too late to examine. Maybe better to check /var/log/messages instead of vmcore file.
On Wed 09-06-21 00:22:13, Tetsuo Handa wrote: > On 2021/06/08 22:58, Michal Hocko wrote: > > I do not see this message to be ever printed on 4.18 for memcg oom: > > /* Found nothing?!?! Either we hang forever, or we panic. */ > > if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { > > dump_header(oc, NULL); > > panic("Out of memory and no killable processes...\n"); > > } > > > > So how come it got triggered here? Is it possible that there is a global > > oom killer somehow going on along with the memcg OOM? Because the below > > stack clearly points to a memcg OOM and a new one AFAICS. > > 4.18 does print this message, and panic() will be called if global OOM > killer invocation were in progress. > > 4.18.0-193.51.1.el8.x86_64 is doing > > ---------- > select_bad_process(oc); > /* Found nothing?!?! */ > if (!oc->chosen) { > dump_header(oc, NULL); > pr_warn("Out of memory and no killable processes...\n"); > /* > * If we got here due to an actual allocation at the > * system level, we cannot survive this and will enter > * an endless loop in the allocator. Bail out now. > */ > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > panic("System is deadlocked on memory\n"); > } > ---------- Ahh, OK. That would explain that. I have looked at 4.18 Vanilla kernel. I do not have RHEL sources handy and neither checked the 4.18 stable tree. Thanks for the clarification! [...] > Since dump_tasks() from dump_header(oc, NULL) does not exclude tasks > which already has MMF_OOM_SKIP set, it is possible that the last OOM > killable victim was already OOM killed but the OOM reaper failed to reclaim > memory and set MMF_OOM_SKIP. (Well, maybe we want to exclude (or annotate) > MMF_OOM_SKIP tasks when showing OOM victim candidates...) Well, the allocating task was clearly alive and whether it has been reaped or not is not all that important as it should force the charge as an oom victim. This is actually the most puzzling part. Because the allocating task either is not a preexisting OOM victim and therefore could become one or it has been and should have skipped the memcg killer altogether. But I fail to see how it could be missed completely while looking for a victim.
On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote: > Is it possible the only eligible task has been killed and oom reaped > already? Yes, I suspect so; and I had a look at the vmcore, the task in the OOM report is no longer present. Therefore, I suspect the task namely "node" (i.e. PID 1703345) was OOM killed i.e. a SIGKILL was sent and was granted access to memory reserves and selected/or choosen by the OOM reaper for termination; the victim then raised a page fault that triggered yet another "charge" in the memcg that exceeded the memory limit set on the container; and since no other task in the memcg had a suitable OOM score and the allocating task/or victim was "unkillable" i.e. already selected for termination by the OOM reaper, we got the message: "Out of memory and no killable processes...".
On Wed 09-06-21 15:35:34, Aaron Tomlin wrote: > On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote: > > Is it possible the only eligible task has been killed and oom reaped > > already? > > Yes, I suspect so; and I had a look at the vmcore, the task in the OOM > report is no longer present. Therefore, I suspect the task namely "node" > (i.e. PID 1703345) was OOM killed i.e. a SIGKILL was sent and was granted > access to memory reserves and selected/or choosen by the OOM reaper for > termination; the victim then raised a page fault that triggered yet > another "charge" in the memcg that exceeded the memory limit set on the > container; If that was the case then the allocating (charging) task would not hit the oom path at all stable/linux-4.18.y:mm/memcontrol.c try_charge() /* * Unlike in global OOM situations, memcg is not in a physical * memory shortage. Allow dying and OOM-killed tasks to * bypass the last charges so that they can exit quickly and * free their memory. */ if (unlikely(tsk_is_oom_victim(current) || fatal_signal_pending(current) || current->flags & PF_EXITING)) goto force; If you have a crash dump available then you can check the memcg associate with the allocating task and check whether it is really marked as OOM victim. > and since no other task in the memcg had a suitable OOM score > and the allocating task/or victim was "unkillable" i.e. already selected > for termination by the OOM reaper, we got the message: "Out of memory and > no killable processes...". What do you mean by allocating task being unkillable?
On Thu 2021-06-10 12:00 +0200, Michal Hocko wrote: > If that was the case then the allocating (charging) task would not hit > the oom path at all Yes, you are correct. I was looking at another version of the source code. I does not make sense to consider the OOM code path at all, in this context. The allocating task is selected/or marked as an "OOM vicitm" after SIGKILL is sent (see __oom_kill_process()). > What do you mean by allocating task being unkillable? Please disregard this statement, as per the above. Anyhow, I think we should exclude tasks that have MMF_OOM_SKIP applied in dump_tasks() as it could be misleading.
On Thu 10-06-21 13:23:23, Aaron Tomlin wrote: [...] > Anyhow, I think we should exclude tasks that have MMF_OOM_SKIP applied in > dump_tasks() as it could be misleading. Well, I am not sure this is a good thing in general. We do not want to hide tasks. We already do display oom_score_adj_min even though they are not eligible and that can serve a good purpose from my experience. It gives us some picture at least. Maybe a flag to mark all uneligible tasks would be something useful but I wouldn't drop them from the list.
On Thu 2021-06-10 14:43 +0200, Michal Hocko wrote: > Well, I am not sure this is a good thing in general. We do not want to > hide tasks. We already do display oom_score_adj_min even though they are > not eligible and that can serve a good purpose from my experience. It > gives us some picture at least. Maybe a flag to mark all uneligible > tasks would be something useful but I wouldn't drop them from the list. Fair enough.
On 2021/06/10 22:36, Aaron Tomlin wrote: > On Thu 2021-06-10 14:43 +0200, Michal Hocko wrote: >> Well, I am not sure this is a good thing in general. We do not want to >> hide tasks. We already do display oom_score_adj_min even though they are >> not eligible and that can serve a good purpose from my experience. It >> gives us some picture at least. Maybe a flag to mark all uneligible >> tasks would be something useful but I wouldn't drop them from the list. > > Fair enough. Yes, marking ineligible (i.e. oom_badness() returning LONG_MIN) tasks would be useful. By the way, was the task namely "node" (i.e. PID 1703345) multithreaded program? Your kernel might want commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index eefd3f5fde46..3bae33e2d9c2 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc) oc->nodemask = NULL; check_panic_on_oom(oc); - if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && - current->mm && !oom_unkillable_task(current) && - oom_cpuset_eligible(current, oc) && + if (sysctl_oom_kill_allocating_task && current->mm && + !oom_unkillable_task(current) && + oom_cpuset_eligible(current, oc) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); oc->chosen = current;
At the present time, in the context of memcg OOM, even when sysctl_oom_kill_allocating_task is enabled/or set, the "allocating" task cannot be selected, as a target for the OOM killer. This patch removes the restriction entirely. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> --- mm/oom_kill.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)