diff mbox series

[RFC] mm/oom_kill: allow oom kill allocating task for non-global case

Message ID 20210607163103.632681-1-atomlin@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm/oom_kill: allow oom kill allocating task for non-global case | expand

Commit Message

Aaron Tomlin June 7, 2021, 4:31 p.m. UTC
At the present time, in the context of memcg OOM, even when
sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
task cannot be selected, as a target for the OOM killer.

This patch removes the restriction entirely.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 mm/oom_kill.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Waiman Long June 7, 2021, 4:42 p.m. UTC | #1
On 6/7/21 12:31 PM, Aaron Tomlin wrote:
> At the present time, in the context of memcg OOM, even when
> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> task cannot be selected, as a target for the OOM killer.
>
> This patch removes the restriction entirely.
>
> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> ---
>   mm/oom_kill.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..3bae33e2d9c2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
>   		oc->nodemask = NULL;
>   	check_panic_on_oom(oc);
>   
> -	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> -	    current->mm && !oom_unkillable_task(current) &&
> -	    oom_cpuset_eligible(current, oc) &&
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +            !oom_unkillable_task(current) &&
> +            oom_cpuset_eligible(current, oc) &&
>   	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>   		get_task_struct(current);
>   		oc->chosen = current;

To provide more context for this patch, we are actually seeing that in a 
customer report about OOM happened in a container where the dominating 
task used up most of the memory and it happened to be the task that 
triggered the OOM with the result that no killable process could be 
found. I don't see a reason why this should be limited to a global OOM only.

Acked-by: Waiman Long <longman@redhat.com>

Cheers,
Longman
Shakeel Butt June 7, 2021, 6:43 p.m. UTC | #2
On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
>
> On 6/7/21 12:31 PM, Aaron Tomlin wrote:
> > At the present time, in the context of memcg OOM, even when
> > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> > task cannot be selected, as a target for the OOM killer.
> >
> > This patch removes the restriction entirely.
> >
> > Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> > ---
> >   mm/oom_kill.c | 6 +++---
> >   1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index eefd3f5fde46..3bae33e2d9c2 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
> >               oc->nodemask = NULL;
> >       check_panic_on_oom(oc);
> >
> > -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> > -         current->mm && !oom_unkillable_task(current) &&
> > -         oom_cpuset_eligible(current, oc) &&
> > +     if (sysctl_oom_kill_allocating_task && current->mm &&
> > +            !oom_unkillable_task(current) &&
> > +            oom_cpuset_eligible(current, oc) &&
> >           current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> >               get_task_struct(current);
> >               oc->chosen = current;
>
> To provide more context for this patch, we are actually seeing that in a
> customer report about OOM happened in a container where the dominating
> task used up most of the memory and it happened to be the task that
> triggered the OOM with the result that no killable process could be
> found.

Why was there no killable process? What about the process allocating
the memory or is this remote memcg charging?

> I don't see a reason why this should be limited to a global OOM only.
>
> Acked-by: Waiman Long <longman@redhat.com>
>
> Cheers,
> Longman
>
Waiman Long June 7, 2021, 6:51 p.m. UTC | #3
On 6/7/21 2:43 PM, Shakeel Butt wrote:
> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
>> On 6/7/21 12:31 PM, Aaron Tomlin wrote:
>>> At the present time, in the context of memcg OOM, even when
>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
>>> task cannot be selected, as a target for the OOM killer.
>>>
>>> This patch removes the restriction entirely.
>>>
>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
>>> ---
>>>    mm/oom_kill.c | 6 +++---
>>>    1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index eefd3f5fde46..3bae33e2d9c2 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
>>>                oc->nodemask = NULL;
>>>        check_panic_on_oom(oc);
>>>
>>> -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>>> -         current->mm && !oom_unkillable_task(current) &&
>>> -         oom_cpuset_eligible(current, oc) &&
>>> +     if (sysctl_oom_kill_allocating_task && current->mm &&
>>> +            !oom_unkillable_task(current) &&
>>> +            oom_cpuset_eligible(current, oc) &&
>>>            current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>>>                get_task_struct(current);
>>>                oc->chosen = current;
>> To provide more context for this patch, we are actually seeing that in a
>> customer report about OOM happened in a container where the dominating
>> task used up most of the memory and it happened to be the task that
>> triggered the OOM with the result that no killable process could be
>> found.
> Why was there no killable process? What about the process allocating
> the memory or is this remote memcg charging?

It is because the other processes have a oom_adjust_score of -1000. So 
they are non-killable. Anyway, they don't consume that much memory and 
killing them won't free up that much.

The other process that uses most of the memory is the one that trigger 
the OOM kill in the first place because the memory limit has been 
reached in new memory allocation. Based on the current logic, this 
process cannot be killed at all even if we set the 
oom_kill_allocating_task to 1 if the OOM happens only within the memcg 
context, not in a global OOM situation. This patch is to allow this 
process to be killed under this circumstance.

Cheers,
Longman
Michal Hocko June 7, 2021, 7:01 p.m. UTC | #4
On Mon 07-06-21 17:31:03, Aaron Tomlin wrote:
> At the present time, in the context of memcg OOM, even when
> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> task cannot be selected, as a target for the OOM killer.
> 
> This patch removes the restriction entirely.

This is a global oom policy not a memcg specific one so a historical
behavior would change. So I do not think we can change that. The policy
can be implemented on the memcg level but this would require a much more
detailed explanation of the usecase and the semantic (e.g. wrt.
hierarchical behavior etc).
 
> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> ---
>  mm/oom_kill.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..3bae33e2d9c2 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
>  		oc->nodemask = NULL;
>  	check_panic_on_oom(oc);
>  
> -	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> -	    current->mm && !oom_unkillable_task(current) &&
> -	    oom_cpuset_eligible(current, oc) &&
> +	if (sysctl_oom_kill_allocating_task && current->mm &&
> +            !oom_unkillable_task(current) &&
> +            oom_cpuset_eligible(current, oc) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>  		get_task_struct(current);
>  		oc->chosen = current;
> -- 
> 2.26.3
Michal Hocko June 7, 2021, 7:04 p.m. UTC | #5
On Mon 07-06-21 14:51:05, Waiman Long wrote:
> On 6/7/21 2:43 PM, Shakeel Butt wrote:
> > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
> > > On 6/7/21 12:31 PM, Aaron Tomlin wrote:
> > > > At the present time, in the context of memcg OOM, even when
> > > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> > > > task cannot be selected, as a target for the OOM killer.
> > > > 
> > > > This patch removes the restriction entirely.
> > > > 
> > > > Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> > > > ---
> > > >    mm/oom_kill.c | 6 +++---
> > > >    1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index eefd3f5fde46..3bae33e2d9c2 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
> > > >                oc->nodemask = NULL;
> > > >        check_panic_on_oom(oc);
> > > > 
> > > > -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> > > > -         current->mm && !oom_unkillable_task(current) &&
> > > > -         oom_cpuset_eligible(current, oc) &&
> > > > +     if (sysctl_oom_kill_allocating_task && current->mm &&
> > > > +            !oom_unkillable_task(current) &&
> > > > +            oom_cpuset_eligible(current, oc) &&
> > > >            current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> > > >                get_task_struct(current);
> > > >                oc->chosen = current;
> > > To provide more context for this patch, we are actually seeing that in a
> > > customer report about OOM happened in a container where the dominating
> > > task used up most of the memory and it happened to be the task that
> > > triggered the OOM with the result that no killable process could be
> > > found.
> > Why was there no killable process? What about the process allocating
> > the memory or is this remote memcg charging?
> 
> It is because the other processes have a oom_adjust_score of -1000. So they
> are non-killable. Anyway, they don't consume that much memory and killing
> them won't free up that much.
> 
> The other process that uses most of the memory is the one that trigger the
> OOM kill in the first place because the memory limit has been reached in new
> memory allocation. Based on the current logic, this process cannot be killed
> at all even if we set the oom_kill_allocating_task to 1 if the OOM happens
> only within the memcg context, not in a global OOM situation. This patch is
> to allow this process to be killed under this circumstance.

Do you have the oom report? I do not see why the allocating task hasn't
been chosen.
Shakeel Butt June 7, 2021, 7:04 p.m. UTC | #6
On Mon, Jun 7, 2021 at 11:51 AM Waiman Long <llong@redhat.com> wrote:
>
> On 6/7/21 2:43 PM, Shakeel Butt wrote:
> > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
> >> On 6/7/21 12:31 PM, Aaron Tomlin wrote:
> >>> At the present time, in the context of memcg OOM, even when
> >>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> >>> task cannot be selected, as a target for the OOM killer.
> >>>
> >>> This patch removes the restriction entirely.
> >>>
> >>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> >>> ---
> >>>    mm/oom_kill.c | 6 +++---
> >>>    1 file changed, 3 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>> index eefd3f5fde46..3bae33e2d9c2 100644
> >>> --- a/mm/oom_kill.c
> >>> +++ b/mm/oom_kill.c
> >>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
> >>>                oc->nodemask = NULL;
> >>>        check_panic_on_oom(oc);
> >>>
> >>> -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> >>> -         current->mm && !oom_unkillable_task(current) &&
> >>> -         oom_cpuset_eligible(current, oc) &&
> >>> +     if (sysctl_oom_kill_allocating_task && current->mm &&
> >>> +            !oom_unkillable_task(current) &&
> >>> +            oom_cpuset_eligible(current, oc) &&
> >>>            current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> >>>                get_task_struct(current);
> >>>                oc->chosen = current;
> >> To provide more context for this patch, we are actually seeing that in a
> >> customer report about OOM happened in a container where the dominating
> >> task used up most of the memory and it happened to be the task that
> >> triggered the OOM with the result that no killable process could be
> >> found.
> > Why was there no killable process? What about the process allocating
> > the memory or is this remote memcg charging?
>
> It is because the other processes have a oom_adjust_score of -1000. So
> they are non-killable. Anyway, they don't consume that much memory and
> killing them won't free up that much.
>
> The other process that uses most of the memory is the one that trigger
> the OOM kill in the first place because the memory limit has been
> reached in new memory allocation. Based on the current logic, this
> process cannot be killed at all even if we set the
> oom_kill_allocating_task to 1 if the OOM happens only within the memcg
> context, not in a global OOM situation.

I am not really against the patch but I am still not able to
understand why select_bad_process() was not able to select the current
process. mem_cgroup_scan_tasks() traverses all the processes in the
target memcg hierarchy, so why the current was skipped.

> This patch is to allow this
> process to be killed under this circumstance.
>
> Cheers,
> Longman
>
Waiman Long June 7, 2021, 7:18 p.m. UTC | #7
On 6/7/21 3:04 PM, Michal Hocko wrote:
> On Mon 07-06-21 14:51:05, Waiman Long wrote:
>> On 6/7/21 2:43 PM, Shakeel Butt wrote:
>>> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
>>>> On 6/7/21 12:31 PM, Aaron Tomlin wrote:
>>>>> At the present time, in the context of memcg OOM, even when
>>>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
>>>>> task cannot be selected, as a target for the OOM killer.
>>>>>
>>>>> This patch removes the restriction entirely.
>>>>>
>>>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
>>>>> ---
>>>>>     mm/oom_kill.c | 6 +++---
>>>>>     1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>> index eefd3f5fde46..3bae33e2d9c2 100644
>>>>> --- a/mm/oom_kill.c
>>>>> +++ b/mm/oom_kill.c
>>>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
>>>>>                 oc->nodemask = NULL;
>>>>>         check_panic_on_oom(oc);
>>>>>
>>>>> -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>>>>> -         current->mm && !oom_unkillable_task(current) &&
>>>>> -         oom_cpuset_eligible(current, oc) &&
>>>>> +     if (sysctl_oom_kill_allocating_task && current->mm &&
>>>>> +            !oom_unkillable_task(current) &&
>>>>> +            oom_cpuset_eligible(current, oc) &&
>>>>>             current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>>>>>                 get_task_struct(current);
>>>>>                 oc->chosen = current;
>>>> To provide more context for this patch, we are actually seeing that in a
>>>> customer report about OOM happened in a container where the dominating
>>>> task used up most of the memory and it happened to be the task that
>>>> triggered the OOM with the result that no killable process could be
>>>> found.
>>> Why was there no killable process? What about the process allocating
>>> the memory or is this remote memcg charging?
>> It is because the other processes have a oom_adjust_score of -1000. So they
>> are non-killable. Anyway, they don't consume that much memory and killing
>> them won't free up that much.
>>
>> The other process that uses most of the memory is the one that trigger the
>> OOM kill in the first place because the memory limit has been reached in new
>> memory allocation. Based on the current logic, this process cannot be killed
>> at all even if we set the oom_kill_allocating_task to 1 if the OOM happens
>> only within the memcg context, not in a global OOM situation. This patch is
>> to allow this process to be killed under this circumstance.
> Do you have the oom report? I do not see why the allocating task hasn't
> been chosen.

A partial OOM report below:

[ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
   :
[ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes 
swapents oom_score_adj name
[ 8227.242495] [1611298]     0 1611298    35869      635 167936        
0         -1000 conmon
[ 8227.242518] [1702509]     0 1702509    35869      701 176128        
0         -1000 conmon
[ 8227.242522] [1703345] 1001050000 1703294   183440        0 
2125824        0           999 node
[ 8227.242706] Out of memory and no killable processes...
[ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), 
nodemask=(null), order=0, oom_score_adj=999
[ 8227.242732] node 
cpuset=crio-b8ac7e23f7b520c0365461defb66738231918243586e287bfb9e206bb3a0227a.scope 
mems_allowed=0-1

So in this case, node cannot kill itself and no other processes are 
available to be killed.

Cheers,
Longman
Waiman Long June 7, 2021, 7:26 p.m. UTC | #8
On 6/7/21 3:01 PM, Michal Hocko wrote:
> On Mon 07-06-21 17:31:03, Aaron Tomlin wrote:
>> At the present time, in the context of memcg OOM, even when
>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
>> task cannot be selected, as a target for the OOM killer.
>>
>> This patch removes the restriction entirely.
> This is a global oom policy not a memcg specific one so a historical
> behavior would change. So I do not think we can change that. The policy
> can be implemented on the memcg level but this would require a much more
> detailed explanation of the usecase and the semantic (e.g. wrt.
> hierarchical behavior etc).

Maybe we can extend the meaning of oom_kill_allocating_task such that 
memcg OOM killing of allocating task is only enabled when bit 1 is set. 
So if an existing application just set oom_kill_allocating_task to 1, it 
will not be impacted.

Cheers,
Longman
Michal Hocko June 7, 2021, 7:36 p.m. UTC | #9
On Mon 07-06-21 15:18:38, Waiman Long wrote:
> On 6/7/21 3:04 PM, Michal Hocko wrote:
> > On Mon 07-06-21 14:51:05, Waiman Long wrote:
> > > On 6/7/21 2:43 PM, Shakeel Butt wrote:
> > > > On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
> > > > > On 6/7/21 12:31 PM, Aaron Tomlin wrote:
> > > > > > At the present time, in the context of memcg OOM, even when
> > > > > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> > > > > > task cannot be selected, as a target for the OOM killer.
> > > > > > 
> > > > > > This patch removes the restriction entirely.
> > > > > > 
> > > > > > Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
> > > > > > ---
> > > > > >     mm/oom_kill.c | 6 +++---
> > > > > >     1 file changed, 3 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > index eefd3f5fde46..3bae33e2d9c2 100644
> > > > > > --- a/mm/oom_kill.c
> > > > > > +++ b/mm/oom_kill.c
> > > > > > @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
> > > > > >                 oc->nodemask = NULL;
> > > > > >         check_panic_on_oom(oc);
> > > > > > 
> > > > > > -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> > > > > > -         current->mm && !oom_unkillable_task(current) &&
> > > > > > -         oom_cpuset_eligible(current, oc) &&
> > > > > > +     if (sysctl_oom_kill_allocating_task && current->mm &&
> > > > > > +            !oom_unkillable_task(current) &&
> > > > > > +            oom_cpuset_eligible(current, oc) &&
> > > > > >             current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> > > > > >                 get_task_struct(current);
> > > > > >                 oc->chosen = current;
> > > > > To provide more context for this patch, we are actually seeing that in a
> > > > > customer report about OOM happened in a container where the dominating
> > > > > task used up most of the memory and it happened to be the task that
> > > > > triggered the OOM with the result that no killable process could be
> > > > > found.
> > > > Why was there no killable process? What about the process allocating
> > > > the memory or is this remote memcg charging?
> > > It is because the other processes have a oom_adjust_score of -1000. So they
> > > are non-killable. Anyway, they don't consume that much memory and killing
> > > them won't free up that much.
> > > 
> > > The other process that uses most of the memory is the one that trigger the
> > > OOM kill in the first place because the memory limit has been reached in new
> > > memory allocation. Based on the current logic, this process cannot be killed
> > > at all even if we set the oom_kill_allocating_task to 1 if the OOM happens
> > > only within the memcg context, not in a global OOM situation. This patch is
> > > to allow this process to be killed under this circumstance.
> > Do you have the oom report? I do not see why the allocating task hasn't
> > been chosen.
> 
> A partial OOM report below:

Do you happen to have the full report?

> [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
>   :
> [ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents  oom_score_adj name
> [ 8227.242495] [1611298]     0 1611298    35869      635 167936        0         -1000 conmon
> [ 8227.242518] [1702509]     0 1702509    35869      701 176128        0         -1000 conmon
> [ 8227.242522] [1703345] 1001050000 1703294   183440        0 2125824        0           999 node
> [ 8227.242706] Out of memory and no killable processes...
> [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
> [ 8227.242732] node cpuset=crio-b8ac7e23f7b520c0365461defb66738231918243586e287bfb9e206bb3a0227a.scope mems_allowed=0-1
> 
> So in this case, node cannot kill itself and no other processes are
> available to be killed.

The process is clearly listed as eligible so the oom killer should find
it and if it hasn't then this should be investigated. Which kernel is
this?
Michal Hocko June 7, 2021, 7:47 p.m. UTC | #10
On Mon 07-06-21 15:26:00, Waiman Long wrote:
> On 6/7/21 3:01 PM, Michal Hocko wrote:
> > On Mon 07-06-21 17:31:03, Aaron Tomlin wrote:
> > > At the present time, in the context of memcg OOM, even when
> > > sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
> > > task cannot be selected, as a target for the OOM killer.
> > > 
> > > This patch removes the restriction entirely.
> > This is a global oom policy not a memcg specific one so a historical
> > behavior would change. So I do not think we can change that. The policy
> > can be implemented on the memcg level but this would require a much more
> > detailed explanation of the usecase and the semantic (e.g. wrt.
> > hierarchical behavior etc).
> 
> Maybe we can extend the meaning of oom_kill_allocating_task such that memcg
> OOM killing of allocating task is only enabled when bit 1 is set. So if an
> existing application just set oom_kill_allocating_task to 1, it will not be
> impacted.

panic_on_oom is already allowing to implement originally global policy
to memcg. So if anything this policy should follow the same interface
but still I think what you are seeing is either a bug or something else
(e.g. the task being migrated while the oom is ongoing) and this should
be properly investigated and explained. We cannot simply paper it over
by telling people to use oom_kill_allocating_task to work it around.

If there is a real usecase for such a policy for memcg oom killing can
be discussed of course.
Michal Hocko June 7, 2021, 8:03 p.m. UTC | #11
On Mon 07-06-21 21:36:47, Michal Hocko wrote:
> On Mon 07-06-21 15:18:38, Waiman Long wrote:
[...]
> > A partial OOM report below:
> 
> Do you happen to have the full report?
> 
> > [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
> >   :
> > [ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents  oom_score_adj name
> > [ 8227.242495] [1611298]     0 1611298    35869      635 167936        0         -1000 conmon
> > [ 8227.242518] [1702509]     0 1702509    35869      701 176128        0         -1000 conmon
> > [ 8227.242522] [1703345] 1001050000 1703294   183440        0 2125824        0           999 node
> > [ 8227.242706] Out of memory and no killable processes...
> > [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999

Btw it is surprising to not see _GFP_ACCOUNT here.
Waiman Long June 7, 2021, 8:07 p.m. UTC | #12
On 6/7/21 3:04 PM, Shakeel Butt wrote:
> On Mon, Jun 7, 2021 at 11:51 AM Waiman Long <llong@redhat.com> wrote:
>> On 6/7/21 2:43 PM, Shakeel Butt wrote:
>>> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
>>>> On 6/7/21 12:31 PM, Aaron Tomlin wrote:
>>>>> At the present time, in the context of memcg OOM, even when
>>>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
>>>>> task cannot be selected, as a target for the OOM killer.
>>>>>
>>>>> This patch removes the restriction entirely.
>>>>>
>>>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
>>>>> ---
>>>>>     mm/oom_kill.c | 6 +++---
>>>>>     1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>> index eefd3f5fde46..3bae33e2d9c2 100644
>>>>> --- a/mm/oom_kill.c
>>>>> +++ b/mm/oom_kill.c
>>>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
>>>>>                 oc->nodemask = NULL;
>>>>>         check_panic_on_oom(oc);
>>>>>
>>>>> -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>>>>> -         current->mm && !oom_unkillable_task(current) &&
>>>>> -         oom_cpuset_eligible(current, oc) &&
>>>>> +     if (sysctl_oom_kill_allocating_task && current->mm &&
>>>>> +            !oom_unkillable_task(current) &&
>>>>> +            oom_cpuset_eligible(current, oc) &&
>>>>>             current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>>>>>                 get_task_struct(current);
>>>>>                 oc->chosen = current;
>>>> To provide more context for this patch, we are actually seeing that in a
>>>> customer report about OOM happened in a container where the dominating
>>>> task used up most of the memory and it happened to be the task that
>>>> triggered the OOM with the result that no killable process could be
>>>> found.
>>> Why was there no killable process? What about the process allocating
>>> the memory or is this remote memcg charging?
>> It is because the other processes have a oom_adjust_score of -1000. So
>> they are non-killable. Anyway, they don't consume that much memory and
>> killing them won't free up that much.
>>
>> The other process that uses most of the memory is the one that trigger
>> the OOM kill in the first place because the memory limit has been
>> reached in new memory allocation. Based on the current logic, this
>> process cannot be killed at all even if we set the
>> oom_kill_allocating_task to 1 if the OOM happens only within the memcg
>> context, not in a global OOM situation.
> I am not really against the patch but I am still not able to
> understand why select_bad_process() was not able to select the current
> process. mem_cgroup_scan_tasks() traverses all the processes in the
> target memcg hierarchy, so why the current was skipped.

Yes, you are right. Probably there is some problem with reaping so that 
the MMF_OOM_SKIP bit gets set. I don't have the answer yet.

Regards,
Longman
Waiman Long June 7, 2021, 8:42 p.m. UTC | #13
On 6/7/21 3:36 PM, Michal Hocko wrote:
> On Mon 07-06-21 15:18:38, Waiman Long wrote:
>> On 6/7/21 3:04 PM, Michal Hocko wrote:
>>> On Mon 07-06-21 14:51:05, Waiman Long wrote:
>>>> On 6/7/21 2:43 PM, Shakeel Butt wrote:
>>>>> On Mon, Jun 7, 2021 at 9:45 AM Waiman Long <llong@redhat.com> wrote:
>>>>>> On 6/7/21 12:31 PM, Aaron Tomlin wrote:
>>>>>>> At the present time, in the context of memcg OOM, even when
>>>>>>> sysctl_oom_kill_allocating_task is enabled/or set, the "allocating"
>>>>>>> task cannot be selected, as a target for the OOM killer.
>>>>>>>
>>>>>>> This patch removes the restriction entirely.
>>>>>>>
>>>>>>> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
>>>>>>> ---
>>>>>>>      mm/oom_kill.c | 6 +++---
>>>>>>>      1 file changed, 3 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>>>> index eefd3f5fde46..3bae33e2d9c2 100644
>>>>>>> --- a/mm/oom_kill.c
>>>>>>> +++ b/mm/oom_kill.c
>>>>>>> @@ -1089,9 +1089,9 @@ bool out_of_memory(struct oom_control *oc)
>>>>>>>                  oc->nodemask = NULL;
>>>>>>>          check_panic_on_oom(oc);
>>>>>>>
>>>>>>> -     if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>>>>>>> -         current->mm && !oom_unkillable_task(current) &&
>>>>>>> -         oom_cpuset_eligible(current, oc) &&
>>>>>>> +     if (sysctl_oom_kill_allocating_task && current->mm &&
>>>>>>> +            !oom_unkillable_task(current) &&
>>>>>>> +            oom_cpuset_eligible(current, oc) &&
>>>>>>>              current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>>>>>>>                  get_task_struct(current);
>>>>>>>                  oc->chosen = current;
>>>>>> To provide more context for this patch, we are actually seeing that in a
>>>>>> customer report about OOM happened in a container where the dominating
>>>>>> task used up most of the memory and it happened to be the task that
>>>>>> triggered the OOM with the result that no killable process could be
>>>>>> found.
>>>>> Why was there no killable process? What about the process allocating
>>>>> the memory or is this remote memcg charging?
>>>> It is because the other processes have a oom_adjust_score of -1000. So they
>>>> are non-killable. Anyway, they don't consume that much memory and killing
>>>> them won't free up that much.
>>>>
>>>> The other process that uses most of the memory is the one that trigger the
>>>> OOM kill in the first place because the memory limit has been reached in new
>>>> memory allocation. Based on the current logic, this process cannot be killed
>>>> at all even if we set the oom_kill_allocating_task to 1 if the OOM happens
>>>> only within the memcg context, not in a global OOM situation. This patch is
>>>> to allow this process to be killed under this circumstance.
>>> Do you have the oom report? I do not see why the allocating task hasn't
>>> been chosen.
>> A partial OOM report below:
> Do you happen to have the full report?
I need to ask to see if I can release the full report.
>
>> [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
>>    :
>> [ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents  oom_score_adj name
>> [ 8227.242495] [1611298]     0 1611298    35869      635 167936        0         -1000 conmon
>> [ 8227.242518] [1702509]     0 1702509    35869      701 176128        0         -1000 conmon
>> [ 8227.242522] [1703345] 1001050000 1703294   183440        0 2125824        0           999 node
>> [ 8227.242706] Out of memory and no killable processes...
>> [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
>> [ 8227.242732] node cpuset=crio-b8ac7e23f7b520c0365461defb66738231918243586e287bfb9e206bb3a0227a.scope mems_allowed=0-1
>>
>> So in this case, node cannot kill itself and no other processes are
>> available to be killed.
> The process is clearly listed as eligible so the oom killer should find
> it and if it hasn't then this should be investigated. Which kernel is
> this?

Right. I don't know why the current cannot be selected. I think we may 
need to enhance the OOM but no killable process report to list the 
reason a task is skipped other than oom_score_adj. The kernel is a 
RHEL8.2 kernel which has OOM code pretty close to upstream.

Cheers,
Longman
Waiman Long June 7, 2021, 8:44 p.m. UTC | #14
On 6/7/21 4:03 PM, Michal Hocko wrote:
> On Mon 07-06-21 21:36:47, Michal Hocko wrote:
>> On Mon 07-06-21 15:18:38, Waiman Long wrote:
> [...]
>>> A partial OOM report below:
>> Do you happen to have the full report?
>>
>>> [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
>>>    :
>>> [ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents  oom_score_adj name
>>> [ 8227.242495] [1611298]     0 1611298    35869      635 167936        0         -1000 conmon
>>> [ 8227.242518] [1702509]     0 1702509    35869      701 176128        0         -1000 conmon
>>> [ 8227.242522] [1703345] 1001050000 1703294   183440        0 2125824        0           999 node
>>> [ 8227.242706] Out of memory and no killable processes...
>>> [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
> Btw it is surprising to not see _GFP_ACCOUNT here.

There are a number of OOM kills in the kernel log and none of the tasks 
that invoke oom-killer has _GFP_ACCOUNT flag set.

Regards,
Longman
Aaron Tomlin June 7, 2021, 9:16 p.m. UTC | #15
On Mon 2021-06-07 16:42 -0400, Waiman Long wrote:
> Right. I don't know why the current cannot be selected. I think we may need
> to enhance the OOM but no killable process report to list the reason a task
> is skipped other than oom_score_adj.

Actually, this might be a good idea.
Aaron Tomlin June 7, 2021, 9:17 p.m. UTC | #16
On Mon 2021-06-07 21:01 +0200, Michal Hocko wrote:
> On Mon 07-06-21 17:31:03, Aaron Tomlin wrote:
> This is a global oom policy not a memcg specific one so a historical
> behavior would change. So I do not think we can change that.

I see. Fair enough.
Michal Hocko June 8, 2021, 6:22 a.m. UTC | #17
On Mon 07-06-21 16:44:09, Waiman Long wrote:
> On 6/7/21 4:03 PM, Michal Hocko wrote:
> > On Mon 07-06-21 21:36:47, Michal Hocko wrote:
> > > On Mon 07-06-21 15:18:38, Waiman Long wrote:
> > [...]
> > > > A partial OOM report below:
> > > Do you happen to have the full report?
> > > 
> > > > [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
> > > >    :
> > > > [ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents  oom_score_adj name
> > > > [ 8227.242495] [1611298]     0 1611298    35869      635 167936        0         -1000 conmon
> > > > [ 8227.242518] [1702509]     0 1702509    35869      701 176128        0         -1000 conmon
> > > > [ 8227.242522] [1703345] 1001050000 1703294   183440        0 2125824        0           999 node
> > > > [ 8227.242706] Out of memory and no killable processes...
> > > > [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
> > Btw it is surprising to not see _GFP_ACCOUNT here.
> 
> There are a number of OOM kills in the kernel log and none of the tasks that
> invoke oom-killer has _GFP_ACCOUNT flag set.

OK. A full report (including the backtrace) would tell us more what is
the source of the charge. I thought that most #PF charging paths use the
same gfp mask as the allocation (which would include other flags on top
of GFP_KERNEL) but it seems we just use GFP_KERNEL at many places. There
are also some direct callers of the charging API for kernel allocations.
Not that this would be super important but it caught my attention.

You are saying that there are other OOM kills going on. Are they all for
the same memcg? Is it possible the only eligible task has been killed
and oom reaped already?
Aaron Tomlin June 8, 2021, 9:39 a.m. UTC | #18
On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote:
> You are saying that there are other OOM kills going on.

Yes.

> Are they all for the same memcg?

No.

> Is it possible the only eligible task has been killed and oom reaped
> already?

Unfortunately, as mentioned by Waiman, no other task is suitable for
"selection" due to the OOM score (i.e. OOM_SCORE_ADJ_MIN;) that said, I
need access to the vmcore to check i.e. mm.flags - I'll get back to you as
soon as I can.


Kind regards,
Aaron Tomlin June 8, 2021, 10 a.m. UTC | #19
On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote:
> OK. A full report (including the backtrace) would tell us more what is
> the source of the charge. I thought that most #PF charging paths use the
> same gfp mask as the allocation (which would include other flags on top
> of GFP_KERNEL) but it seems we just use GFP_KERNEL at many places.

The following is what I can provide for now:

  [ 8227.242706] Out of memory and no killable processes...
  [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
  [ 8227.242732] node cpuset=XXXX mems_allowed=0-1
  [ 8227.242736] CPU: 12 PID: 1703347 Comm: node Kdump: loaded Not tainted 4.18.0-193.51.1.el8_2.x86_64 #1
  [ 8227.242737] Hardware name: XXXX
  [ 8227.242738] Call Trace:
  [ 8227.242746]  dump_stack+0x5c/0x80
  [ 8227.242751]  dump_header+0x6e/0x27a
  [ 8227.242753]  out_of_memory.cold.31+0x39/0x8d
  [ 8227.242756]  mem_cgroup_out_of_memory+0x49/0x80
  [ 8227.242758]  try_charge+0x58c/0x780
  [ 8227.242761]  ? __alloc_pages_nodemask+0xef/0x280
  [ 8227.242763]  mem_cgroup_try_charge+0x8b/0x1a0
  [ 8227.242764]  mem_cgroup_try_charge_delay+0x1c/0x40
  [ 8227.242767]  do_anonymous_page+0xb5/0x360
  [ 8227.242770]  ? __switch_to_asm+0x35/0x70
  [ 8227.242772]  __handle_mm_fault+0x662/0x6a0
  [ 8227.242774]  handle_mm_fault+0xda/0x200
  [ 8227.242778]  __do_page_fault+0x22d/0x4e0
  [ 8227.242780]  do_page_fault+0x32/0x110
  [ 8227.242782]  ? page_fault+0x8/0x30
  [ 8227.242783]  page_fault+0x1e/0x30
Michal Hocko June 8, 2021, 1:58 p.m. UTC | #20
On Tue 08-06-21 11:00:22, Aaron Tomlin wrote:
> On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote:
> > OK. A full report (including the backtrace) would tell us more what is
> > the source of the charge. I thought that most #PF charging paths use the
> > same gfp mask as the allocation (which would include other flags on top
> > of GFP_KERNEL) but it seems we just use GFP_KERNEL at many places.
> 
> The following is what I can provide for now:
> 
Let me add what we have from previous email

> [ 8221.433608] memory: usage 21280kB, limit 204800kB, failcnt 49116
>   :
> [ 8227.239769] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents  oom_score_adj name
> [ 8227.242495] [1611298]     0 1611298    35869      635 167936        0         -1000 conmon
> [ 8227.242518] [1702509]     0 1702509    35869      701 176128        0         -1000 conmon
> [ 8227.242522] [1703345] 1001050000 1703294   183440        0 2125824        0           999 node
> [ 8227.242706] Out of memory and no killable processes...

I do not see this message to be ever printed on 4.18 for memcg oom:
        /* Found nothing?!?! Either we hang forever, or we panic. */
        if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
                dump_header(oc, NULL);
                panic("Out of memory and no killable processes...\n");
        }

So how come it got triggered here? Is it possible that there is a global
oom killer somehow going on along with the memcg OOM? Because the below
stack clearly points to a memcg OOM and a new one AFAICS.

That being said, a full chain of oom events would be definitely useful
to get a better idea.

> [ 8227.242731] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
> [ 8227.242732] node cpuset=XXXX mems_allowed=0-1
> [ 8227.242736] CPU: 12 PID: 1703347 Comm: node Kdump: loaded Not tainted 4.18.0-193.51.1.el8_2.x86_64 #1
> [ 8227.242737] Hardware name: XXXX
> [ 8227.242738] Call Trace:
> [ 8227.242746]  dump_stack+0x5c/0x80
> [ 8227.242751]  dump_header+0x6e/0x27a
> [ 8227.242753]  out_of_memory.cold.31+0x39/0x8d
> [ 8227.242756]  mem_cgroup_out_of_memory+0x49/0x80
> [ 8227.242758]  try_charge+0x58c/0x780
> [ 8227.242761]  ? __alloc_pages_nodemask+0xef/0x280
> [ 8227.242763]  mem_cgroup_try_charge+0x8b/0x1a0
> [ 8227.242764]  mem_cgroup_try_charge_delay+0x1c/0x40
> [ 8227.242767]  do_anonymous_page+0xb5/0x360
> [ 8227.242770]  ? __switch_to_asm+0x35/0x70
> [ 8227.242772]  __handle_mm_fault+0x662/0x6a0
> [ 8227.242774]  handle_mm_fault+0xda/0x200
> [ 8227.242778]  __do_page_fault+0x22d/0x4e0
> [ 8227.242780]  do_page_fault+0x32/0x110
> [ 8227.242782]  ? page_fault+0x8/0x30
> [ 8227.242783]  page_fault+0x1e/0x30
> 
> -- 
> Aaron Tomlin
Tetsuo Handa June 8, 2021, 3:22 p.m. UTC | #21
On 2021/06/08 22:58, Michal Hocko wrote:
> I do not see this message to be ever printed on 4.18 for memcg oom:
>         /* Found nothing?!?! Either we hang forever, or we panic. */
>         if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
>                 dump_header(oc, NULL);
>                 panic("Out of memory and no killable processes...\n");
>         }
> 
> So how come it got triggered here? Is it possible that there is a global
> oom killer somehow going on along with the memcg OOM? Because the below
> stack clearly points to a memcg OOM and a new one AFAICS.

4.18 does print this message, and panic() will be called if global OOM
killer invocation were in progress. 

4.18.0-193.51.1.el8.x86_64 is doing

----------
        select_bad_process(oc);
        /* Found nothing?!?! */
        if (!oc->chosen) {
                dump_header(oc, NULL);
                pr_warn("Out of memory and no killable processes...\n");
                /*
                 * If we got here due to an actual allocation at the
                 * system level, we cannot survive this and will enter
                 * an endless loop in the allocator. Bail out now.
                 */
                if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
                        panic("System is deadlocked on memory\n");
        }
----------

and this message is printed when oom_evaluate_task() found that MMF_OOM_SKIP
was already set on all (possibly the only and the last) OOM victims.

----------
static int oom_evaluate_task(struct task_struct *task, void *arg)
{
(...snipped...)
        /*
         * This task already has access to memory reserves and is being killed.
         * Don't allow any other task to have access to the reserves unless
         * the task has MMF_OOM_SKIP because chances that it would release
         * any memory is quite low.
         */
        if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
                if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
                        goto next;
                goto abort;
        }
(...snipped...)
next:
        return 0;
(...snipped...)
}
----------

Since dump_tasks() from dump_header(oc, NULL) does not exclude tasks
which already has MMF_OOM_SKIP set, it is possible that the last OOM
killable victim was already OOM killed but the OOM reaper failed to reclaim
memory and set MMF_OOM_SKIP. (Well, maybe we want to exclude (or annotate)
MMF_OOM_SKIP tasks when showing OOM victim candidates...)

Therefore,

> 
> That being said, a full chain of oom events would be definitely useful
> to get a better idea.

I think checking whether

        pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
                task_pid_nr(tsk), tsk->comm);

and/or

        pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
                        task_pid_nr(tsk), tsk->comm,
                        K(get_mm_counter(mm, MM_ANONPAGES)),
                        K(get_mm_counter(mm, MM_FILEPAGES)),
                        K(get_mm_counter(mm, MM_SHMEMPAGES)));

message was already printed prior to starting infinite
"Out of memory and no killable processes..." looping
(this message is repeated forever, isn't it?) will be useful.

Note that neither of these messages will be printed if hitting

----------
        /*
         * If the mm has invalidate_{start,end}() notifiers that could block,
         * sleep to give the oom victim some more time.
         * TODO: we really want to get rid of this ugly hack and make sure that
         * notifiers cannot block for unbounded amount of time
         */
        if (mm_has_blockable_invalidate_notifiers(mm)) {
                up_read(&mm->mmap_sem);
                schedule_timeout_idle(HZ);
                return true;
        }
----------

case, and also dmesg available in the vmcore might be too late to examine.
Maybe better to check /var/log/messages instead of vmcore file.
Michal Hocko June 8, 2021, 4:17 p.m. UTC | #22
On Wed 09-06-21 00:22:13, Tetsuo Handa wrote:
> On 2021/06/08 22:58, Michal Hocko wrote:
> > I do not see this message to be ever printed on 4.18 for memcg oom:
> >         /* Found nothing?!?! Either we hang forever, or we panic. */
> >         if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
> >                 dump_header(oc, NULL);
> >                 panic("Out of memory and no killable processes...\n");
> >         }
> > 
> > So how come it got triggered here? Is it possible that there is a global
> > oom killer somehow going on along with the memcg OOM? Because the below
> > stack clearly points to a memcg OOM and a new one AFAICS.
> 
> 4.18 does print this message, and panic() will be called if global OOM
> killer invocation were in progress. 
> 
> 4.18.0-193.51.1.el8.x86_64 is doing
> 
> ----------
>         select_bad_process(oc);
>         /* Found nothing?!?! */
>         if (!oc->chosen) {
>                 dump_header(oc, NULL);
>                 pr_warn("Out of memory and no killable processes...\n");
>                 /*
>                  * If we got here due to an actual allocation at the
>                  * system level, we cannot survive this and will enter
>                  * an endless loop in the allocator. Bail out now.
>                  */
>                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
>                         panic("System is deadlocked on memory\n");
>         }
> ----------

Ahh, OK. That would explain that. I have looked at 4.18 Vanilla kernel.
I do not have RHEL sources handy and neither checked the 4.18 stable
tree. Thanks for the clarification!
 
[...]
> Since dump_tasks() from dump_header(oc, NULL) does not exclude tasks
> which already has MMF_OOM_SKIP set, it is possible that the last OOM
> killable victim was already OOM killed but the OOM reaper failed to reclaim
> memory and set MMF_OOM_SKIP. (Well, maybe we want to exclude (or annotate)
> MMF_OOM_SKIP tasks when showing OOM victim candidates...)

Well, the allocating task was clearly alive and whether it has been
reaped or not is not all that important as it should force the charge as
an oom victim. This is actually the most puzzling part. Because the
allocating task either is not a preexisting OOM victim and therefore
could become one or it has been and should have skipped the memcg killer
altogether. But I fail to see how it could be missed completely while
looking for a victim.
Aaron Tomlin June 9, 2021, 2:35 p.m. UTC | #23
On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote:
> Is it possible the only eligible task has been killed and oom reaped
> already?

Yes, I suspect so; and I had a look at the vmcore, the task in the OOM
report is no longer present. Therefore, I suspect the task namely "node"
(i.e. PID 1703345) was OOM killed i.e. a SIGKILL was sent and was granted
access to memory reserves and selected/or choosen by the OOM reaper for
termination; the victim then raised a page fault that triggered yet
another "charge" in the memcg that exceeded the memory limit set on the
container; and since no other task in the memcg had a suitable OOM score
and the allocating task/or victim was "unkillable" i.e. already selected
for termination by the OOM reaper, we got the message: "Out of memory and
no killable processes...".
Michal Hocko June 10, 2021, 10 a.m. UTC | #24
On Wed 09-06-21 15:35:34, Aaron Tomlin wrote:
> On Tue 2021-06-08 08:22 +0200, Michal Hocko wrote:
> > Is it possible the only eligible task has been killed and oom reaped
> > already?
> 
> Yes, I suspect so; and I had a look at the vmcore, the task in the OOM
> report is no longer present. Therefore, I suspect the task namely "node"
> (i.e. PID 1703345) was OOM killed i.e. a SIGKILL was sent and was granted
> access to memory reserves and selected/or choosen by the OOM reaper for
> termination; the victim then raised a page fault that triggered yet
> another "charge" in the memcg that exceeded the memory limit set on the
> container;

If that was the case then the allocating (charging) task would not hit
the oom path at all
stable/linux-4.18.y:mm/memcontrol.c
try_charge()
        /*
         * Unlike in global OOM situations, memcg is not in a physical
         * memory shortage.  Allow dying and OOM-killed tasks to
         * bypass the last charges so that they can exit quickly and
         * free their memory.
         */
        if (unlikely(tsk_is_oom_victim(current) ||
                     fatal_signal_pending(current) ||
                     current->flags & PF_EXITING))
                goto force;

If you have a crash dump available then you can check the memcg
associate with the allocating task and check whether it is really marked
as OOM victim.

> and since no other task in the memcg had a suitable OOM score
> and the allocating task/or victim was "unkillable" i.e. already selected
> for termination by the OOM reaper, we got the message: "Out of memory and
> no killable processes...".

What do you mean by allocating task being unkillable?
Aaron Tomlin June 10, 2021, 12:23 p.m. UTC | #25
On Thu 2021-06-10 12:00 +0200, Michal Hocko wrote:
> If that was the case then the allocating (charging) task would not hit
> the oom path at all

Yes, you are correct. I was looking at another version of the source code.
I does not make sense to consider the OOM code path at all, in this
context. The allocating task is selected/or marked as an "OOM vicitm" after
SIGKILL is sent (see __oom_kill_process()).

> What do you mean by allocating task being unkillable?

Please disregard this statement, as per the above.
Anyhow, I think we should exclude tasks that have MMF_OOM_SKIP applied in
dump_tasks() as it could be misleading.
Michal Hocko June 10, 2021, 12:43 p.m. UTC | #26
On Thu 10-06-21 13:23:23, Aaron Tomlin wrote:
[...]
> Anyhow, I think we should exclude tasks that have MMF_OOM_SKIP applied in
> dump_tasks() as it could be misleading.

Well, I am not sure this is a good thing in general. We do not want to
hide tasks. We already do display oom_score_adj_min even though they are
not eligible and that can serve a good purpose from my experience. It
gives us some picture at least. Maybe a flag to mark all uneligible
tasks would be something useful but I wouldn't drop them from the list.
Aaron Tomlin June 10, 2021, 1:36 p.m. UTC | #27
On Thu 2021-06-10 14:43 +0200, Michal Hocko wrote:
> Well, I am not sure this is a good thing in general. We do not want to
> hide tasks. We already do display oom_score_adj_min even though they are
> not eligible and that can serve a good purpose from my experience. It
> gives us some picture at least. Maybe a flag to mark all uneligible
> tasks would be something useful but I wouldn't drop them from the list.

Fair enough.
Tetsuo Handa June 10, 2021, 2:06 p.m. UTC | #28
On 2021/06/10 22:36, Aaron Tomlin wrote:
> On Thu 2021-06-10 14:43 +0200, Michal Hocko wrote:
>> Well, I am not sure this is a good thing in general. We do not want to
>> hide tasks. We already do display oom_score_adj_min even though they are
>> not eligible and that can serve a good purpose from my experience. It
>> gives us some picture at least. Maybe a flag to mark all uneligible
>> tasks would be something useful but I wouldn't drop them from the list.
> 
> Fair enough.

Yes, marking ineligible (i.e. oom_badness() returning LONG_MIN) tasks would be useful.

By the way, was the task namely "node" (i.e. PID 1703345) multithreaded program?
Your kernel might want commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eefd3f5fde46..3bae33e2d9c2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1089,9 +1089,9 @@  bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc);
 
-	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
-	    current->mm && !oom_unkillable_task(current) &&
-	    oom_cpuset_eligible(current, oc) &&
+	if (sysctl_oom_kill_allocating_task && current->mm &&
+            !oom_unkillable_task(current) &&
+            oom_cpuset_eligible(current, oc) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 		get_task_struct(current);
 		oc->chosen = current;