[memcg,0/1] false global OOM triggered by memcg-limited task

Message ID	9d10df01-0127-fb40-81c3-cc53c9733c3e@virtuozzo.com (mailing list archive)
Headers	show Return-Path: <SRS0=mZRB=PG=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8E64C60ED4 From: Vasily Averin <vvs@virtuozzo.com> Subject: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task To: Michal Hocko <mhocko@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Roman Gushchin <guro@fb.com>, Uladzislau Rezki <urezki@gmail.com>, Vlastimil Babka <vbabka@suse.cz>, Shakeel Butt <shakeelb@google.com>, Mel Gorman <mgorman@techsingularity.net>, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel@openvz.org Message-ID: <9d10df01-0127-fb40-81c3-cc53c9733c3e@virtuozzo.com> Date: Mon, 18 Oct 2021 11:13:52 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------36653423250F868642503A58" Content-Language: en-US Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	false global OOM triggered by memcg-limited task \| expand [memcg,0/1] false global OOM triggered by memcg-limited task [memcg,1/1] memcg: prevent false global OOM triggered by memcg limited task

Vasily Averin Oct. 18, 2021, 8:13 a.m. UTC

While checking the patches fixed broken memcg accounting in vmalloc I found
another issue: a false global OOM triggered by memcg-limited user space task.

I executed vmalloc-eater inside a memcg limited LXC container in a loop, checked
that it does not consume host memory beyond the assigned limit, triggers memcg OOM
and generates "Memory cgroup out of memory" messages.  Everything was as expected.

However I was surprised to find quite rare global OOM messages too.
I set sysctl vm.panic_on_oom to 1, repeated the test and successfully
crashed the node.

Dmesg showed that global OOM was detected on 16 GB node with ~10 GB of free memory.

 syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
 CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Call Trace:
  dump_stack_lvl+0x57/0x72
  dump_header+0x4a/0x2c1
  out_of_memory.cold+0xa/0x7e
  pagefault_out_of_memory+0x46/0x60
  exc_page_fault+0x79/0x2b0
  asm_exc_page_fault+0x1e/0x30
...
 Mem-Info:
 Node 0 DMA: 0*4kB 0*8kB <...> = 13296kB
 Node 0 DMA32: 705*4kB (UM) <...> = 2586964kB
 Node 0 Normal: 2743*4kB (UME) <...> = 6904828kB
...
 4095866 pages RAM
...
 Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

Full dmesg can be found in attached file.

How could this happen?

User-space task inside the memcg-limited container generated a page fault,
its handler do_user_addr_fault() called handle_mm_fault which could not
allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
Then do_user_addr_fault() called pagefault_out_of_memory() which executed
out_of_memory() without set of memcg.

Partially this problem depends on one of my recent patches, disabled unlimited
memory allocation for dying tasks. However I think the problem can happen
on non-killed tasks too, for example because of kmem limit.

At present do_user_addr_fault() does not know why page allocation was failed,
i.e. was it global or memcg OOM. I propose to save this information in new flag
on task_struct. It can be set in case of memcg restrictons in
obj_cgroup_charge_pages() (for memory controller) and in try_charge_memcg() 
(for kmem controller).  Then it can be used in mem_cgroup_oom_synchronize()
called inside pagefault_out_of_memory():
in case of memcg-related restrictions it will not trigger fake global OOM and
returns to user space which will retry the fault or kill the process if it got
a fatal signal.

Thank you,
	Vasily Averin

Vasily Averin (1):
  memcg: prevent false global OOM trigggerd by memcg limited task.

 include/linux/sched.h |  1 +
 mm/memcontrol.c       | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

Michal Hocko Oct. 18, 2021, 9:04 a.m. UTC | #1

On Mon 18-10-21 11:13:52, Vasily Averin wrote:
[...]
> How could this happen?
> 
> User-space task inside the memcg-limited container generated a page fault,
> its handler do_user_addr_fault() called handle_mm_fault which could not
> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> out_of_memory() without set of memcg.
> 
> Partially this problem depends on one of my recent patches, disabled unlimited
> memory allocation for dying tasks. However I think the problem can happen
> on non-killed tasks too, for example because of kmem limit.

Could you be more specific on how this can happen without your patch? I
have to say I haven't realized this side effect when discussing it.

I will be honest that I am not really happy about pagefault_out_of_memory.
I have tried to remove it in the past. Without much success back then,
unfortunately[1]. 
Maybe we should get rid of it finally. The OOM is always triggered from
inside the allocator where we have much more infromation about the
allocation context. A first step would be to skip pagefault_out_of_memory
for killed or exiting processes.

[1] I do not have msg-id so I cannot provide a lore link but google
pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html

Vasily Averin Oct. 18, 2021, 10:05 a.m. UTC | #2

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.
>>
>> Partially this problem depends on one of my recent patches, disabled unlimited
>> memory allocation for dying tasks. However I think the problem can happen
>> on non-killed tasks too, for example because of kmem limit.
> 
> Could you be more specific on how this can happen without your patch? I
> have to say I haven't realized this side effect when discussing it.

We can reach obj_cgroup_charge_pages() for example via

do_user_addr_fault
handle_mm_fault
__handle_mm_fault
p4d_alloc
__p4d_alloc
p4d_alloc_one
get_zeroed_page
__get_free_pages
alloc_pages
__alloc_pages
__memcg_kmem_charge_page
obj_cgroup_charge_pages

Here we call try_charge_memcg() that return success and approve the allocation,
however then we hit into kmem limit and fail the allocation.

If required I can try to search how try_charge_memcg() can reject page allocation 
of non-dying task too.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1]. 
> Maybe we should get rid of it finally. The OOM is always triggered from
> inside the allocator where we have much more infromation about the
> allocation context. A first step would be to skip pagefault_out_of_memory
> for killed or exiting processes.

I like this idea, however it may be not enough, at least in scenario described above.

> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html

Thank you, I'll read this discussion.
	Vasily Averin

Vasily Averin Oct. 18, 2021, 10:12 a.m. UTC | #3

On 18.10.2021 13:05, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>> memory allocation for dying tasks. However I think the problem can happen
>>> on non-killed tasks too, for example because of kmem limit.
>>
>> Could you be more specific on how this can happen without your patch? I
>> have to say I haven't realized this side effect when discussing it.
> 
> We can reach obj_cgroup_charge_pages() for example via
> 
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
> 
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit,
therefore we moved kmem limit check into try_charge_memcg.

Are any improvements for kmem controller interesting for upstream?

Thank you,
	Vasily Averin

Michal Hocko Oct. 18, 2021, 11:53 a.m. UTC | #4

On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> >>
> >> Partially this problem depends on one of my recent patches, disabled unlimited
> >> memory allocation for dying tasks. However I think the problem can happen
> >> on non-killed tasks too, for example because of kmem limit.
> > 
> > Could you be more specific on how this can happen without your patch? I
> > have to say I haven't realized this side effect when discussing it.
> 
> We can reach obj_cgroup_charge_pages() for example via
> 
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
> 
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

Just to make sure I understand this would be for the v1 kmem explicit
limit, correct?

> If required I can try to search how try_charge_memcg() can reject page allocation 
> of non-dying task too.

Yes.

> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1]. 
> > Maybe we should get rid of it finally. The OOM is always triggered from
> > inside the allocator where we have much more infromation about the
> > allocation context. A first step would be to skip pagefault_out_of_memory
> > for killed or exiting processes.
> 
> I like this idea, however it may be not enough, at least in scenario described above.

I original patch has removed the oom killer completely.

Michal Hocko Oct. 18, 2021, 12:27 p.m. UTC | #5

[restore the cc list]

On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> On 18.10.2021 14:53, Michal Hocko wrote:
> > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >> On 18.10.2021 12:04, Michal Hocko wrote:
> >> Here we call try_charge_memcg() that return success and approve the allocation,
> >> however then we hit into kmem limit and fail the allocation.
> > 
> > Just to make sure I understand this would be for the v1 kmem explicit
> > limit, correct?
> 
> yes, I mean this limit.

OK, thanks for the clarification. This is a known problem. Have a look
at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
status since 2019 without any actual report sugested by the kernel
message. Maybe we should try and remove it and see whether that prompts
some pushback.

Shakeel Butt Oct. 18, 2021, 3:07 p.m. UTC | #6

On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [restore the cc list]
>
> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > On 18.10.2021 14:53, Michal Hocko wrote:
> > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > >> however then we hit into kmem limit and fail the allocation.
> > >
> > > Just to make sure I understand this would be for the v1 kmem explicit
> > > limit, correct?
> >
> > yes, I mean this limit.
>
> OK, thanks for the clarification. This is a known problem. Have a look
> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> status since 2019 without any actual report sugested by the kernel
> message. Maybe we should try and remove it and see whether that prompts
> some pushback.
>

Yes, I think now should be the right time to take the next step for
deprecation of kmem limits:
https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/

Michal Hocko Oct. 18, 2021, 4:51 p.m. UTC | #7

On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [restore the cc list]
> >
> > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > >> however then we hit into kmem limit and fail the allocation.
> > > >
> > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > limit, correct?
> > >
> > > yes, I mean this limit.
> >
> > OK, thanks for the clarification. This is a known problem. Have a look
> > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > status since 2019 without any actual report sugested by the kernel
> > message. Maybe we should try and remove it and see whether that prompts
> > some pushback.
> >
> 
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/

I completely forgot about your patch.  Anyway, it usually takes us years
to deprecate something so let's stick with it and consider 2 years as
years ;)

Shakeel Butt Oct. 18, 2021, 5:13 p.m. UTC | #8

On Mon, Oct 18, 2021 at 9:51 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [restore the cc list]
> > >
> > > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > > >> however then we hit into kmem limit and fail the allocation.
> > > > >
> > > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > > limit, correct?
> > > >
> > > > yes, I mean this limit.
> > >
> > > OK, thanks for the clarification. This is a known problem. Have a look
> > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > > status since 2019 without any actual report sugested by the kernel
> > > message. Maybe we should try and remove it and see whether that prompts
> > > some pushback.
> > >
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> I completely forgot about your patch.  Anyway, it usually takes us years
> to deprecate something so let's stick with it and consider 2 years as
> years ;)
>

Sure, I will rebase and resend.

Vasily Averin Oct. 18, 2021, 6:52 p.m. UTC | #9

On 18.10.2021 18:07, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> [restore the cc list]
>>
>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>> however then we hit into kmem limit and fail the allocation.
>>>>
>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>> limit, correct?
>>>
>>> yes, I mean this limit.
>>
>> OK, thanks for the clarification. This is a known problem. Have a look
>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>> status since 2019 without any actual report sugested by the kernel
>> message. Maybe we should try and remove it and see whether that prompts
>> some pushback.
>>
> 
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/

Are you going to push it to stable kernels too?

Thank you,
	Vasily Averin

Vasily Averin Oct. 18, 2021, 7:18 p.m. UTC | #10

On 18.10.2021 21:52, Vasily Averin wrote:
> On 18.10.2021 18:07, Shakeel Butt wrote:
>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>>
>>> [restore the cc list]
>>>
>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>
>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>> limit, correct?
>>>>
>>>> yes, I mean this limit.
>>>
>>> OK, thanks for the clarification. This is a known problem. Have a look
>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>> status since 2019 without any actual report sugested by the kernel
>>> message. Maybe we should try and remove it and see whether that prompts
>>> some pushback.
>>>
>>
>> Yes, I think now should be the right time to take the next step for
>> deprecation of kmem limits:
>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> 
> Are you going to push it to stable kernels too?

Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.

Shakeel Butt Oct. 19, 2021, 5:33 a.m. UTC | #11

On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 18.10.2021 18:07, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> [restore the cc list]
> >>
> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>> however then we hit into kmem limit and fail the allocation.
> >>>>
> >>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>> limit, correct?
> >>>
> >>> yes, I mean this limit.
> >>
> >> OK, thanks for the clarification. This is a known problem. Have a look
> >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >> status since 2019 without any actual report sugested by the kernel
> >> message. Maybe we should try and remove it and see whether that prompts
> >> some pushback.
> >>
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> Are you going to push it to stable kernels too?
>

Not really. Is there a reason I should? More exposure to catch breakage?

Shakeel Butt Oct. 19, 2021, 5:34 a.m. UTC | #12

On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 18.10.2021 21:52, Vasily Averin wrote:
> > On 18.10.2021 18:07, Shakeel Butt wrote:
> >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>
> >>> [restore the cc list]
> >>>
> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>
> >>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>> limit, correct?
> >>>>
> >>>> yes, I mean this limit.
> >>>
> >>> OK, thanks for the clarification. This is a known problem. Have a look
> >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>> status since 2019 without any actual report sugested by the kernel
> >>> message. Maybe we should try and remove it and see whether that prompts
> >>> some pushback.
> >>>
> >>
> >> Yes, I think now should be the right time to take the next step for
> >> deprecation of kmem limits:
> >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> >
> > Are you going to push it to stable kernels too?
>
> Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
>

CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not
disabling the kmem accounting.

Vasily Averin Oct. 19, 2021, 6:30 a.m. UTC | #13

On 18.10.2021 14:53, Michal Hocko wrote:
> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>>>
>>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>>> memory allocation for dying tasks. However I think the problem can happen
>>>> on non-killed tasks too, for example because of kmem limit.
>>>
>>> Could you be more specific on how this can happen without your patch? I
>>> have to say I haven't realized this side effect when discussing it.

>> If required I can try to search how try_charge_memcg() can reject page allocation 
>> of non-dying task too.
> 
> Yes.

Now I think that such failure was very unlikely (w/o my patch and kmem limit).
I cannot exclude it completely, because I did not finished this review and perhaps I missed something,
but I checked most part of code and found nothing.

With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
a) due to fatal signal
b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)

To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

However all these cases can be successfully handled by my new patch
"memcg: prevent false global OOM triggered by memcg limited task"
and I think it is better solution.

Thank you,
	Vasily Averin

Vasily Averin Oct. 19, 2021, 6:42 a.m. UTC | #14

On 19.10.2021 08:33, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> On 18.10.2021 18:07, Shakeel Butt wrote:
>>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>
>>>> [restore the cc list]
>>>>
>>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>>
>>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>>> limit, correct?
>>>>>
>>>>> yes, I mean this limit.
>>>>
>>>> OK, thanks for the clarification. This is a known problem. Have a look
>>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>>> status since 2019 without any actual report sugested by the kernel
>>>> message. Maybe we should try and remove it and see whether that prompts
>>>> some pushback.
>>>
>>> Yes, I think now should be the right time to take the next step for
>>> deprecation of kmem limits:
>>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>>
>> Are you going to push it to stable kernels too?
> 
> Not really. Is there a reason I should? More exposure to catch breakage?

There is a problem: kmem limit can trigger fake global OOM.
To fix it in upstream you can remove kmem controller.

However how to handle this problem in stable and LTS kernels?
My current patch resolves the problem too, and it can be backported.
However I afraid nobody will do it if teh patch will not be approved in upsteam.

Thank you,
	Vasily Averin

Michal Hocko Oct. 19, 2021, 8:47 a.m. UTC | #15

On Tue 19-10-21 09:42:38, Vasily Averin wrote:
> On 19.10.2021 08:33, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> On 18.10.2021 18:07, Shakeel Butt wrote:
> >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>
> >>>> [restore the cc list]
> >>>>
> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>>
> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>>> limit, correct?
> >>>>>
> >>>>> yes, I mean this limit.
> >>>>
> >>>> OK, thanks for the clarification. This is a known problem. Have a look
> >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>>> status since 2019 without any actual report sugested by the kernel
> >>>> message. Maybe we should try and remove it and see whether that prompts
> >>>> some pushback.
> >>>
> >>> Yes, I think now should be the right time to take the next step for
> >>> deprecation of kmem limits:
> >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> >>
> >> Are you going to push it to stable kernels too?
> > 
> > Not really. Is there a reason I should? More exposure to catch breakage?
> 
> There is a problem: kmem limit can trigger fake global OOM.
> To fix it in upstream you can remove kmem controller.
> 
> However how to handle this problem in stable and LTS kernels?

I do not see any bug reports coming in and I strongly suspect this is
because the functionality is simply not used wildly enough or in the
mode where it would matter (U != 0, K < U from our documentation).
If there are relevant usecases for this setup then we really want to
hear about those because we do not want to break userspace. Handling
that setup would far from trivial and the global oom killer is not the
only one of those.

> My current patch resolves the problem too, and it can be backported.
> However I afraid nobody will do it if teh patch will not be approved in upsteam.

I do not think your current approach is the right one. We do not really
want yet another flag to tell we are in a memcg OOM. We already have
one.

A proper way is to deal with the pagefault oom killer trigger but that
might be just too risky for stable kernels.

Michal Hocko Oct. 19, 2021, 8:49 a.m. UTC | #16

On Tue 19-10-21 09:30:18, Vasily Averin wrote:
[...]
> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> a) due to fatal signal
> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> 
> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

How is b) possible without current being killed? Do we allow remote
charging?

> However all these cases can be successfully handled by my new patch
> "memcg: prevent false global OOM triggered by memcg limited task"
> and I think it is better solution.

I have already replied to your approach in other email. Sorry our
replies have crossed.

Vasily Averin Oct. 19, 2021, 10:30 a.m. UTC | #17

On 19.10.2021 11:49, Michal Hocko wrote:
> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> [...]
>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>> a) due to fatal signal
>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>
>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

> How is b) possible without current being killed? Do we allow remote
> charging?

out_of_memory for memcg_oom
 select_bad_process
  mem_cgroup_scan_tasks
   oom_evaluate_task
    oom_badness

        /*
         * Do not even consider tasks which are explicitly marked oom
         * unkillable or have been already oom reaped or the are in
         * the middle of vfork
         */
        adj = (long)p->signal->oom_score_adj;
        if (adj == OOM_SCORE_ADJ_MIN ||
                        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
                        in_vfork(p)) {
                task_unlock(p);
                return LONG_MIN;
        }

This time we handle userspace page fault, so we cannot be kenrel thread,
and cannot be in_vfork().
However task can be marked as oom unkillable, 
i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

It can be set via oom_score_adj_write().

Thank you,
	Vasily Averin

Michal Hocko Oct. 19, 2021, 11:54 a.m. UTC | #18

On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> On 19.10.2021 11:49, Michal Hocko wrote:
> > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > [...]
> >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >> a) due to fatal signal
> >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>
> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> 
> > How is b) possible without current being killed? Do we allow remote
> > charging?
> 
> out_of_memory for memcg_oom
>  select_bad_process
>   mem_cgroup_scan_tasks
>    oom_evaluate_task
>     oom_badness
> 
>         /*
>          * Do not even consider tasks which are explicitly marked oom
>          * unkillable or have been already oom reaped or the are in
>          * the middle of vfork
>          */
>         adj = (long)p->signal->oom_score_adj;
>         if (adj == OOM_SCORE_ADJ_MIN ||
>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>                         in_vfork(p)) {
>                 task_unlock(p);
>                 return LONG_MIN;
>         }
> 
> This time we handle userspace page fault, so we cannot be kenrel thread,
> and cannot be in_vfork().
> However task can be marked as oom unkillable, 
> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

You are right. I am not sure there is a way out of this though. The task
can only retry for ever in this case. There is nothing actionable here.
We cannot kill the task and there is no other way to release the memory.

Michal Hocko Oct. 19, 2021, 12:04 p.m. UTC | #19

On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> > On 19.10.2021 11:49, Michal Hocko wrote:
> > > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > > [...]
> > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> > >> a) due to fatal signal
> > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> > >>
> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> > 
> > > How is b) possible without current being killed? Do we allow remote
> > > charging?
> > 
> > out_of_memory for memcg_oom
> >  select_bad_process
> >   mem_cgroup_scan_tasks
> >    oom_evaluate_task
> >     oom_badness
> > 
> >         /*
> >          * Do not even consider tasks which are explicitly marked oom
> >          * unkillable or have been already oom reaped or the are in
> >          * the middle of vfork
> >          */
> >         adj = (long)p->signal->oom_score_adj;
> >         if (adj == OOM_SCORE_ADJ_MIN ||
> >                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >                         in_vfork(p)) {
> >                 task_unlock(p);
> >                 return LONG_MIN;
> >         }
> > 
> > This time we handle userspace page fault, so we cannot be kenrel thread,
> > and cannot be in_vfork().
> > However task can be marked as oom unkillable, 
> > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> 
> You are right. I am not sure there is a way out of this though. The task
> can only retry for ever in this case. There is nothing actionable here.
> We cannot kill the task and there is no other way to release the memory.

Btw. don't we force the charge in that case?

Vasily Averin Oct. 19, 2021, 1:26 p.m. UTC | #20

On 19.10.2021 15:04, Michal Hocko wrote:
> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>> [...]
>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>> a) due to fatal signal
>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>
>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>
>>>> How is b) possible without current being killed? Do we allow remote
>>>> charging?
>>>
>>> out_of_memory for memcg_oom
>>>  select_bad_process
>>>   mem_cgroup_scan_tasks
>>>    oom_evaluate_task
>>>     oom_badness
>>>
>>>         /*
>>>          * Do not even consider tasks which are explicitly marked oom
>>>          * unkillable or have been already oom reaped or the are in
>>>          * the middle of vfork
>>>          */
>>>         adj = (long)p->signal->oom_score_adj;
>>>         if (adj == OOM_SCORE_ADJ_MIN ||
>>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>                         in_vfork(p)) {
>>>                 task_unlock(p);
>>>                 return LONG_MIN;
>>>         }
>>>
>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>> and cannot be in_vfork().
>>> However task can be marked as oom unkillable, 
>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>
>> You are right. I am not sure there is a way out of this though. The task
>> can only retry for ever in this case. There is nothing actionable here.
>> We cannot kill the task and there is no other way to release the memory.
> 
> Btw. don't we force the charge in that case?

We should force charge for allocation from inside page fault handler,
to prevent endless cycle in retried page faults.
However we should not do it for allocations from task context,
to prevent memcg-limited vmalloc-eaters from to consume all host memory.

Also I would like to return to the following hunk.
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);

Now I think it's better to keep task_is_dying() check here.
if task is dying, it is not necessary to push other task to free the memory.
We broke vmalloc cycle already, so it looks like nothing should prevent us
from returning to userspace, handle fatal signal, exit and free the memory.

Thank you,
	Vasily Averin

Michal Hocko Oct. 19, 2021, 2:13 p.m. UTC | #21

On Tue 19-10-21 16:26:50, Vasily Averin wrote:
> On 19.10.2021 15:04, Michal Hocko wrote:
> > On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> >> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> >>> On 19.10.2021 11:49, Michal Hocko wrote:
> >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> >>>> [...]
> >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >>>>> a) due to fatal signal
> >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>>>>
> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >>>
> >>>> How is b) possible without current being killed? Do we allow remote
> >>>> charging?
> >>>
> >>> out_of_memory for memcg_oom
> >>>  select_bad_process
> >>>   mem_cgroup_scan_tasks
> >>>    oom_evaluate_task
> >>>     oom_badness
> >>>
> >>>         /*
> >>>          * Do not even consider tasks which are explicitly marked oom
> >>>          * unkillable or have been already oom reaped or the are in
> >>>          * the middle of vfork
> >>>          */
> >>>         adj = (long)p->signal->oom_score_adj;
> >>>         if (adj == OOM_SCORE_ADJ_MIN ||
> >>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >>>                         in_vfork(p)) {
> >>>                 task_unlock(p);
> >>>                 return LONG_MIN;
> >>>         }
> >>>
> >>> This time we handle userspace page fault, so we cannot be kenrel thread,
> >>> and cannot be in_vfork().
> >>> However task can be marked as oom unkillable, 
> >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> >>
> >> You are right. I am not sure there is a way out of this though. The task
> >> can only retry for ever in this case. There is nothing actionable here.
> >> We cannot kill the task and there is no other way to release the memory.
> > 
> > Btw. don't we force the charge in that case?
> 
> We should force charge for allocation from inside page fault handler,
> to prevent endless cycle in retried page faults.
> However we should not do it for allocations from task context,
> to prevent memcg-limited vmalloc-eaters from to consume all host memory.

I don't see a big difference between those two. Because the #PF could
result into the very same situation depleting all the memory by
overcharging. A different behavior just leads to a confusion and
unexpected behavior. E.g. in the past we only triggered memcg OOM killer
from the #PF path and failed the charge otherwise. That is something
different but it shows problems we haven't anticipated and had user
visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
to the charge path").

> Also I would like to return to the following hunk.
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> 
> Now I think it's better to keep task_is_dying() check here.
> if task is dying, it is not necessary to push other task to free the memory.
> We broke vmalloc cycle already, so it looks like nothing should prevent us
> from returning to userspace, handle fatal signal, exit and free the memory.

That patch has to be discuss in its full length. There were other
details I have brought up AFAIU.

Michal Hocko Oct. 19, 2021, 2:19 p.m. UTC | #22

On Tue 19-10-21 16:13:19, Michal Hocko wrote:
[...]
> That patch has to be discuss in its full length. There were other
> details I have brought up AFAIU.

And btw. thanks for walking through that maze! It is far from trivial
with side effects all over the place which far from obvious and heavily
underdocumented.

Vasily Averin Oct. 19, 2021, 7:09 p.m. UTC | #23

On 19.10.2021 17:13, Michal Hocko wrote:
> On Tue 19-10-21 16:26:50, Vasily Averin wrote:
>> On 19.10.2021 15:04, Michal Hocko wrote:
>>> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>>>> [...]
>>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>>>> a) due to fatal signal
>>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>>>
>>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>>>
>>>>>> How is b) possible without current being killed? Do we allow remote
>>>>>> charging?
>>>>>
>>>>> out_of_memory for memcg_oom
>>>>>  select_bad_process
>>>>>   mem_cgroup_scan_tasks
>>>>>    oom_evaluate_task
>>>>>     oom_badness
>>>>>
>>>>>         /*
>>>>>          * Do not even consider tasks which are explicitly marked oom
>>>>>          * unkillable or have been already oom reaped or the are in
>>>>>          * the middle of vfork
>>>>>          */
>>>>>         adj = (long)p->signal->oom_score_adj;
>>>>>         if (adj == OOM_SCORE_ADJ_MIN ||
>>>>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>>>                         in_vfork(p)) {
>>>>>                 task_unlock(p);
>>>>>                 return LONG_MIN;
>>>>>         }
>>>>>
>>>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>>>> and cannot be in_vfork().
>>>>> However task can be marked as oom unkillable, 
>>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>>>
>>>> You are right. I am not sure there is a way out of this though. The task
>>>> can only retry for ever in this case. There is nothing actionable here.
>>>> We cannot kill the task and there is no other way to release the memory.
>>>
>>> Btw. don't we force the charge in that case?
>>
>> We should force charge for allocation from inside page fault handler,
>> to prevent endless cycle in retried page faults.
>> However we should not do it for allocations from task context,
>> to prevent memcg-limited vmalloc-eaters from to consume all host memory.
> 
> I don't see a big difference between those two. Because the #PF could
> result into the very same situation depleting all the memory by
> overcharging. A different behavior just leads to a confusion and
> unexpected behavior. E.g. in the past we only triggered memcg OOM killer
> from the #PF path and failed the charge otherwise. That is something
> different but it shows problems we haven't anticipated and had user
> visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
> to the charge path").

In this case I think we should fail this allocation.
It's better do not allow overcharge, neither in #PF not in regular allocations.

However this failure will trigger false global OOM in pagefault_out_of_memory(),
and we need to find some way to prevent it.

Thank you,
	Vasily Averin

Vasily Averin Oct. 21, 2021, 8:03 a.m. UTC | #24

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1]. 
> 
> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html

I re-read this discussion and in general I support your position.
As far as I understand your opponents cannot explain why "random kill" is mandatory here,
they are just afraid that it might be useful here and do not want to remove it completely.

Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.

However I would like to have some choice in this point.

In general we can:
- continue to use "random kill" and rely on the wisdom of the ancestors.
- do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
- add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
- mark the current task as cycled in #PF and somehow use this mark in allocator
- make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
- implement any better ideas,
- use any combination of previous points

We can select required strategy for example via sysctl.

For me "random kill" is worst choice, 
Why can't we just kill the looped process instead?
It can be marked as oom-unkillable, so OOM-killer was unable to select it.
However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

Thank you,
	Vasily Averin

Michal Hocko Oct. 21, 2021, 11:49 a.m. UTC | #25

On Thu 21-10-21 11:03:43, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> 
> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1]. 
> > 
> > [1] I do not have msg-id so I cannot provide a lore link but google
> > pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
> 
> I re-read this discussion and in general I support your position.
> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
> they are just afraid that it might be useful here and do not want to remove it completely.

That aligns with my recollection.

> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
> 
> However I would like to have some choice in this point.
> 
> In general we can:
> - continue to use "random kill" and rely on the wisdom of the ancestors.

I do not follow. Does that mean to preserve existing oom killer from
#PF?

> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.

Again, not really sure what you mean

> - mark the current task as cycled in #PF and somehow use this mark in allocator

How?

> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.

No! We cannot really kill the task if we could we would have done it by
the oom killer already

> - implement any better ideas,
> - use any combination of previous points
> 
> We can select required strategy for example via sysctl.

Absolutely no! How can admin know any better than the kernel?

> For me "random kill" is worst choice, 
> Why can't we just kill the looped process instead?

See above.

> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

It means never kill it because of OOM. If it is retrying because of OOM
then it is effectively the same thing.

The oom killer from the #PF doesn't really provide any clear advantage
these days AFAIK. On the other hand it allows for a very disruptive
behavior. In a worst case it can lead to a system panic if the
VM_FAULT_OOM is not really caused by a memory shortage but rather a
wrong error handling. If a task is looping there without any progress
then it is still kilallable which is a much saner behavior IMHO.

Vasily Averin Oct. 21, 2021, 1:24 p.m. UTC | #26

On 21.10.2021 14:49, Michal Hocko wrote:
> On Thu 21-10-21 11:03:43, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>
>>> I will be honest that I am not really happy about pagefault_out_of_memory.
>>> I have tried to remove it in the past. Without much success back then,
>>> unfortunately[1]. 
>>>
>>> [1] I do not have msg-id so I cannot provide a lore link but google
>>> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
>>
>> I re-read this discussion and in general I support your position.
>> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
>> they are just afraid that it might be useful here and do not want to remove it completely.
> 
> That aligns with my recollection.
> 
>> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>>
>> However I would like to have some choice in this point.
>>
>> In general we can:
>> - continue to use "random kill" and rely on the wisdom of the ancestors.
> 
> I do not follow. Does that mean to preserve existing oom killer from
> #PF?
> 
>> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
>> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
> 
> Again, not really sure what you mean
> 
>> - mark the current task as cycled in #PF and somehow use this mark in allocator
> 
> How?
> 
>> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
> 
> No! We cannot really kill the task if we could we would have done it by
> the oom killer already
> 
>> - implement any better ideas,
>> - use any combination of previous points
>>
>> We can select required strategy for example via sysctl.
> 
> Absolutely no! How can admin know any better than the kernel?
> 
>> For me "random kill" is worst choice, 
>> Why can't we just kill the looped process instead?
> 
> See above.
> 
>> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
>> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
> 
> It means never kill it because of OOM. If it is retrying because of OOM
> then it is effectively the same thing.
> 
> The oom killer from the #PF doesn't really provide any clear advantage
> these days AFAIK. On the other hand it allows for a very disruptive
> behavior. In a worst case it can lead to a system panic if the
> VM_FAULT_OOM is not really caused by a memory shortage but rather a
> wrong error handling. If a task is looping there without any progress
> then it is still kilallable which is a much saner behavior IMHO.

Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread.
.

[memcg,0/1] false global OOM triggered by memcg-limited task

Message

Comments