Message ID | 9d10df01-0127-fb40-81c3-cc53c9733c3e@virtuozzo.com (mailing list archive) |
---|---|
Headers | show |
Series | false global OOM triggered by memcg-limited task | expand |
On Mon 18-10-21 11:13:52, Vasily Averin wrote: [...] > How could this happen? > > User-space task inside the memcg-limited container generated a page fault, > its handler do_user_addr_fault() called handle_mm_fault which could not > allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. > Then do_user_addr_fault() called pagefault_out_of_memory() which executed > out_of_memory() without set of memcg. > > Partially this problem depends on one of my recent patches, disabled unlimited > memory allocation for dying tasks. However I think the problem can happen > on non-killed tasks too, for example because of kmem limit. Could you be more specific on how this can happen without your patch? I have to say I haven't realized this side effect when discussing it. I will be honest that I am not really happy about pagefault_out_of_memory. I have tried to remove it in the past. Without much success back then, unfortunately[1]. Maybe we should get rid of it finally. The OOM is always triggered from inside the allocator where we have much more infromation about the allocation context. A first step would be to skip pagefault_out_of_memory for killed or exiting processes. [1] I do not have msg-id so I cannot provide a lore link but google pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
On 18.10.2021 12:04, Michal Hocko wrote: > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > [...] >> How could this happen? >> >> User-space task inside the memcg-limited container generated a page fault, >> its handler do_user_addr_fault() called handle_mm_fault which could not >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >> out_of_memory() without set of memcg. >> >> Partially this problem depends on one of my recent patches, disabled unlimited >> memory allocation for dying tasks. However I think the problem can happen >> on non-killed tasks too, for example because of kmem limit. > > Could you be more specific on how this can happen without your patch? I > have to say I haven't realized this side effect when discussing it. We can reach obj_cgroup_charge_pages() for example via do_user_addr_fault handle_mm_fault __handle_mm_fault p4d_alloc __p4d_alloc p4d_alloc_one get_zeroed_page __get_free_pages alloc_pages __alloc_pages __memcg_kmem_charge_page obj_cgroup_charge_pages Here we call try_charge_memcg() that return success and approve the allocation, however then we hit into kmem limit and fail the allocation. If required I can try to search how try_charge_memcg() can reject page allocation of non-dying task too. > I will be honest that I am not really happy about pagefault_out_of_memory. > I have tried to remove it in the past. Without much success back then, > unfortunately[1]. > Maybe we should get rid of it finally. The OOM is always triggered from > inside the allocator where we have much more infromation about the > allocation context. A first step would be to skip pagefault_out_of_memory > for killed or exiting processes. I like this idea, however it may be not enough, at least in scenario described above. > [1] I do not have msg-id so I cannot provide a lore link but google > pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html Thank you, I'll read this discussion. Vasily Averin
On 18.10.2021 13:05, Vasily Averin wrote: > On 18.10.2021 12:04, Michal Hocko wrote: >> On Mon 18-10-21 11:13:52, Vasily Averin wrote: >>> Partially this problem depends on one of my recent patches, disabled unlimited >>> memory allocation for dying tasks. However I think the problem can happen >>> on non-killed tasks too, for example because of kmem limit. >> >> Could you be more specific on how this can happen without your patch? I >> have to say I haven't realized this side effect when discussing it. > > We can reach obj_cgroup_charge_pages() for example via > > do_user_addr_fault > handle_mm_fault > __handle_mm_fault > p4d_alloc > __p4d_alloc > p4d_alloc_one > get_zeroed_page > __get_free_pages > alloc_pages > __alloc_pages > __memcg_kmem_charge_page > obj_cgroup_charge_pages > > Here we call try_charge_memcg() that return success and approve the allocation, > however then we hit into kmem limit and fail the allocation. btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit, therefore we moved kmem limit check into try_charge_memcg. Are any improvements for kmem controller interesting for upstream? Thank you, Vasily Averin
On Mon 18-10-21 13:05:35, Vasily Averin wrote: > On 18.10.2021 12:04, Michal Hocko wrote: > > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > > [...] > >> How could this happen? > >> > >> User-space task inside the memcg-limited container generated a page fault, > >> its handler do_user_addr_fault() called handle_mm_fault which could not > >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. > >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed > >> out_of_memory() without set of memcg. > >> > >> Partially this problem depends on one of my recent patches, disabled unlimited > >> memory allocation for dying tasks. However I think the problem can happen > >> on non-killed tasks too, for example because of kmem limit. > > > > Could you be more specific on how this can happen without your patch? I > > have to say I haven't realized this side effect when discussing it. > > We can reach obj_cgroup_charge_pages() for example via > > do_user_addr_fault > handle_mm_fault > __handle_mm_fault > p4d_alloc > __p4d_alloc > p4d_alloc_one > get_zeroed_page > __get_free_pages > alloc_pages > __alloc_pages > __memcg_kmem_charge_page > obj_cgroup_charge_pages > > Here we call try_charge_memcg() that return success and approve the allocation, > however then we hit into kmem limit and fail the allocation. Just to make sure I understand this would be for the v1 kmem explicit limit, correct? > If required I can try to search how try_charge_memcg() can reject page allocation > of non-dying task too. Yes. > > I will be honest that I am not really happy about pagefault_out_of_memory. > > I have tried to remove it in the past. Without much success back then, > > unfortunately[1]. > > Maybe we should get rid of it finally. The OOM is always triggered from > > inside the allocator where we have much more infromation about the > > allocation context. A first step would be to skip pagefault_out_of_memory > > for killed or exiting processes. > > I like this idea, however it may be not enough, at least in scenario described above. I original patch has removed the oom killer completely.
[restore the cc list] On Mon 18-10-21 15:14:26, Vasily Averin wrote: > On 18.10.2021 14:53, Michal Hocko wrote: > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >> On 18.10.2021 12:04, Michal Hocko wrote: > >> Here we call try_charge_memcg() that return success and approve the allocation, > >> however then we hit into kmem limit and fail the allocation. > > > > Just to make sure I understand this would be for the v1 kmem explicit > > limit, correct? > > yes, I mean this limit. OK, thanks for the clarification. This is a known problem. Have a look at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed status since 2019 without any actual report sugested by the kernel message. Maybe we should try and remove it and see whether that prompts some pushback.
On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > > [restore the cc list] > > On Mon 18-10-21 15:14:26, Vasily Averin wrote: > > On 18.10.2021 14:53, Michal Hocko wrote: > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > > >> On 18.10.2021 12:04, Michal Hocko wrote: > > >> Here we call try_charge_memcg() that return success and approve the allocation, > > >> however then we hit into kmem limit and fail the allocation. > > > > > > Just to make sure I understand this would be for the v1 kmem explicit > > > limit, correct? > > > > yes, I mean this limit. > > OK, thanks for the clarification. This is a known problem. Have a look > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > status since 2019 without any actual report sugested by the kernel > message. Maybe we should try and remove it and see whether that prompts > some pushback. > Yes, I think now should be the right time to take the next step for deprecation of kmem limits: https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
On Mon 18-10-21 08:07:20, Shakeel Butt wrote: > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > [restore the cc list] > > > > On Mon 18-10-21 15:14:26, Vasily Averin wrote: > > > On 18.10.2021 14:53, Michal Hocko wrote: > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > > > >> On 18.10.2021 12:04, Michal Hocko wrote: > > > >> Here we call try_charge_memcg() that return success and approve the allocation, > > > >> however then we hit into kmem limit and fail the allocation. > > > > > > > > Just to make sure I understand this would be for the v1 kmem explicit > > > > limit, correct? > > > > > > yes, I mean this limit. > > > > OK, thanks for the clarification. This is a known problem. Have a look > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > > status since 2019 without any actual report sugested by the kernel > > message. Maybe we should try and remove it and see whether that prompts > > some pushback. > > > > Yes, I think now should be the right time to take the next step for > deprecation of kmem limits: > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ I completely forgot about your patch. Anyway, it usually takes us years to deprecate something so let's stick with it and consider 2 years as years ;)
On Mon, Oct 18, 2021 at 9:51 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 18-10-21 08:07:20, Shakeel Butt wrote: > > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > [restore the cc list] > > > > > > On Mon 18-10-21 15:14:26, Vasily Averin wrote: > > > > On 18.10.2021 14:53, Michal Hocko wrote: > > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote: > > > > >> On 18.10.2021 12:04, Michal Hocko wrote: > > > > >> Here we call try_charge_memcg() that return success and approve the allocation, > > > > >> however then we hit into kmem limit and fail the allocation. > > > > > > > > > > Just to make sure I understand this would be for the v1 kmem explicit > > > > > limit, correct? > > > > > > > > yes, I mean this limit. > > > > > > OK, thanks for the clarification. This is a known problem. Have a look > > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > > > status since 2019 without any actual report sugested by the kernel > > > message. Maybe we should try and remove it and see whether that prompts > > > some pushback. > > > > > > > Yes, I think now should be the right time to take the next step for > > deprecation of kmem limits: > > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ > > I completely forgot about your patch. Anyway, it usually takes us years > to deprecate something so let's stick with it and consider 2 years as > years ;) > Sure, I will rebase and resend.
On 18.10.2021 18:07, Shakeel Butt wrote: > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: >> >> [restore the cc list] >> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote: >>> On 18.10.2021 14:53, Michal Hocko wrote: >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: >>>>> On 18.10.2021 12:04, Michal Hocko wrote: >>>>> Here we call try_charge_memcg() that return success and approve the allocation, >>>>> however then we hit into kmem limit and fail the allocation. >>>> >>>> Just to make sure I understand this would be for the v1 kmem explicit >>>> limit, correct? >>> >>> yes, I mean this limit. >> >> OK, thanks for the clarification. This is a known problem. Have a look >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed >> status since 2019 without any actual report sugested by the kernel >> message. Maybe we should try and remove it and see whether that prompts >> some pushback. >> > > Yes, I think now should be the right time to take the next step for > deprecation of kmem limits: > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ Are you going to push it to stable kernels too? Thank you, Vasily Averin
On 18.10.2021 21:52, Vasily Averin wrote: > On 18.10.2021 18:07, Shakeel Butt wrote: >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: >>> >>> [restore the cc list] >>> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: >>>> On 18.10.2021 14:53, Michal Hocko wrote: >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: >>>>>> On 18.10.2021 12:04, Michal Hocko wrote: >>>>>> Here we call try_charge_memcg() that return success and approve the allocation, >>>>>> however then we hit into kmem limit and fail the allocation. >>>>> >>>>> Just to make sure I understand this would be for the v1 kmem explicit >>>>> limit, correct? >>>> >>>> yes, I mean this limit. >>> >>> OK, thanks for the clarification. This is a known problem. Have a look >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed >>> status since 2019 without any actual report sugested by the kernel >>> message. Maybe we should try and remove it and see whether that prompts >>> some pushback. >>> >> >> Yes, I think now should be the right time to take the next step for >> deprecation of kmem limits: >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ > > Are you going to push it to stable kernels too? Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote: > > On 18.10.2021 18:07, Shakeel Butt wrote: > > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > >> > >> [restore the cc list] > >> > >> On Mon 18-10-21 15:14:26, Vasily Averin wrote: > >>> On 18.10.2021 14:53, Michal Hocko wrote: > >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >>>>> On 18.10.2021 12:04, Michal Hocko wrote: > >>>>> Here we call try_charge_memcg() that return success and approve the allocation, > >>>>> however then we hit into kmem limit and fail the allocation. > >>>> > >>>> Just to make sure I understand this would be for the v1 kmem explicit > >>>> limit, correct? > >>> > >>> yes, I mean this limit. > >> > >> OK, thanks for the clarification. This is a known problem. Have a look > >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > >> status since 2019 without any actual report sugested by the kernel > >> message. Maybe we should try and remove it and see whether that prompts > >> some pushback. > >> > > > > Yes, I think now should be the right time to take the next step for > > deprecation of kmem limits: > > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ > > Are you going to push it to stable kernels too? > Not really. Is there a reason I should? More exposure to catch breakage?
On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <vvs@virtuozzo.com> wrote: > > On 18.10.2021 21:52, Vasily Averin wrote: > > On 18.10.2021 18:07, Shakeel Butt wrote: > >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > >>> > >>> [restore the cc list] > >>> > >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: > >>>> On 18.10.2021 14:53, Michal Hocko wrote: > >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >>>>>> On 18.10.2021 12:04, Michal Hocko wrote: > >>>>>> Here we call try_charge_memcg() that return success and approve the allocation, > >>>>>> however then we hit into kmem limit and fail the allocation. > >>>>> > >>>>> Just to make sure I understand this would be for the v1 kmem explicit > >>>>> limit, correct? > >>>> > >>>> yes, I mean this limit. > >>> > >>> OK, thanks for the clarification. This is a known problem. Have a look > >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > >>> status since 2019 without any actual report sugested by the kernel > >>> message. Maybe we should try and remove it and see whether that prompts > >>> some pushback. > >>> > >> > >> Yes, I think now should be the right time to take the next step for > >> deprecation of kmem limits: > >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ > > > > Are you going to push it to stable kernels too? > > Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37. > CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not disabling the kmem accounting.
On 18.10.2021 14:53, Michal Hocko wrote: > On Mon 18-10-21 13:05:35, Vasily Averin wrote: >> On 18.10.2021 12:04, Michal Hocko wrote: >>> On Mon 18-10-21 11:13:52, Vasily Averin wrote: >>> [...] >>>> How could this happen? >>>> >>>> User-space task inside the memcg-limited container generated a page fault, >>>> its handler do_user_addr_fault() called handle_mm_fault which could not >>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >>>> out_of_memory() without set of memcg. >>>> >>>> Partially this problem depends on one of my recent patches, disabled unlimited >>>> memory allocation for dying tasks. However I think the problem can happen >>>> on non-killed tasks too, for example because of kmem limit. >>> >>> Could you be more specific on how this can happen without your patch? I >>> have to say I haven't realized this side effect when discussing it. >> If required I can try to search how try_charge_memcg() can reject page allocation >> of non-dying task too. > > Yes. Now I think that such failure was very unlikely (w/o my patch and kmem limit). I cannot exclude it completely, because I did not finished this review and perhaps I missed something, but I checked most part of code and found nothing. With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: a) due to fatal signal b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. However all these cases can be successfully handled by my new patch "memcg: prevent false global OOM triggered by memcg limited task" and I think it is better solution. Thank you, Vasily Averin
On 19.10.2021 08:33, Shakeel Butt wrote: > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote: >> >> On 18.10.2021 18:07, Shakeel Butt wrote: >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: >>>> >>>> [restore the cc list] >>>> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: >>>>> On 18.10.2021 14:53, Michal Hocko wrote: >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote: >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation, >>>>>>> however then we hit into kmem limit and fail the allocation. >>>>>> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit >>>>>> limit, correct? >>>>> >>>>> yes, I mean this limit. >>>> >>>> OK, thanks for the clarification. This is a known problem. Have a look >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed >>>> status since 2019 without any actual report sugested by the kernel >>>> message. Maybe we should try and remove it and see whether that prompts >>>> some pushback. >>> >>> Yes, I think now should be the right time to take the next step for >>> deprecation of kmem limits: >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ >> >> Are you going to push it to stable kernels too? > > Not really. Is there a reason I should? More exposure to catch breakage? There is a problem: kmem limit can trigger fake global OOM. To fix it in upstream you can remove kmem controller. However how to handle this problem in stable and LTS kernels? My current patch resolves the problem too, and it can be backported. However I afraid nobody will do it if teh patch will not be approved in upsteam. Thank you, Vasily Averin
On Tue 19-10-21 09:42:38, Vasily Averin wrote: > On 19.10.2021 08:33, Shakeel Butt wrote: > > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote: > >> > >> On 18.10.2021 18:07, Shakeel Butt wrote: > >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote: > >>>> > >>>> [restore the cc list] > >>>> > >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote: > >>>>> On 18.10.2021 14:53, Michal Hocko wrote: > >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote: > >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote: > >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation, > >>>>>>> however then we hit into kmem limit and fail the allocation. > >>>>>> > >>>>>> Just to make sure I understand this would be for the v1 kmem explicit > >>>>>> limit, correct? > >>>>> > >>>>> yes, I mean this limit. > >>>> > >>>> OK, thanks for the clarification. This is a known problem. Have a look > >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate > >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed > >>>> status since 2019 without any actual report sugested by the kernel > >>>> message. Maybe we should try and remove it and see whether that prompts > >>>> some pushback. > >>> > >>> Yes, I think now should be the right time to take the next step for > >>> deprecation of kmem limits: > >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/ > >> > >> Are you going to push it to stable kernels too? > > > > Not really. Is there a reason I should? More exposure to catch breakage? > > There is a problem: kmem limit can trigger fake global OOM. > To fix it in upstream you can remove kmem controller. > > However how to handle this problem in stable and LTS kernels? I do not see any bug reports coming in and I strongly suspect this is because the functionality is simply not used wildly enough or in the mode where it would matter (U != 0, K < U from our documentation). If there are relevant usecases for this setup then we really want to hear about those because we do not want to break userspace. Handling that setup would far from trivial and the global oom killer is not the only one of those. > My current patch resolves the problem too, and it can be backported. > However I afraid nobody will do it if teh patch will not be approved in upsteam. I do not think your current approach is the right one. We do not really want yet another flag to tell we are in a memcg OOM. We already have one. A proper way is to deal with the pagefault oom killer trigger but that might be just too risky for stable kernels.
On Tue 19-10-21 09:30:18, Vasily Averin wrote: [...] > With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > a) due to fatal signal > b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > > To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. How is b) possible without current being killed? Do we allow remote charging? > However all these cases can be successfully handled by my new patch > "memcg: prevent false global OOM triggered by memcg limited task" > and I think it is better solution. I have already replied to your approach in other email. Sorry our replies have crossed.
On 19.10.2021 11:49, Michal Hocko wrote: > On Tue 19-10-21 09:30:18, Vasily Averin wrote: > [...] >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: >> a) due to fatal signal >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) >> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > How is b) possible without current being killed? Do we allow remote > charging? out_of_memory for memcg_oom select_bad_process mem_cgroup_scan_tasks oom_evaluate_task oom_badness /* * Do not even consider tasks which are explicitly marked oom * unkillable or have been already oom reaped or the are in * the middle of vfork */ adj = (long)p->signal->oom_score_adj; if (adj == OOM_SCORE_ADJ_MIN || test_bit(MMF_OOM_SKIP, &p->mm->flags) || in_vfork(p)) { task_unlock(p); return LONG_MIN; } This time we handle userspace page fault, so we cannot be kenrel thread, and cannot be in_vfork(). However task can be marked as oom unkillable, i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN It can be set via oom_score_adj_write(). Thank you, Vasily Averin
On Tue 19-10-21 13:30:06, Vasily Averin wrote: > On 19.10.2021 11:49, Michal Hocko wrote: > > On Tue 19-10-21 09:30:18, Vasily Averin wrote: > > [...] > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > >> a) due to fatal signal > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > >> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > > > How is b) possible without current being killed? Do we allow remote > > charging? > > out_of_memory for memcg_oom > select_bad_process > mem_cgroup_scan_tasks > oom_evaluate_task > oom_badness > > /* > * Do not even consider tasks which are explicitly marked oom > * unkillable or have been already oom reaped or the are in > * the middle of vfork > */ > adj = (long)p->signal->oom_score_adj; > if (adj == OOM_SCORE_ADJ_MIN || > test_bit(MMF_OOM_SKIP, &p->mm->flags) || > in_vfork(p)) { > task_unlock(p); > return LONG_MIN; > } > > This time we handle userspace page fault, so we cannot be kenrel thread, > and cannot be in_vfork(). > However task can be marked as oom unkillable, > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN You are right. I am not sure there is a way out of this though. The task can only retry for ever in this case. There is nothing actionable here. We cannot kill the task and there is no other way to release the memory.
On Tue 19-10-21 13:54:42, Michal Hocko wrote: > On Tue 19-10-21 13:30:06, Vasily Averin wrote: > > On 19.10.2021 11:49, Michal Hocko wrote: > > > On Tue 19-10-21 09:30:18, Vasily Averin wrote: > > > [...] > > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > > >> a) due to fatal signal > > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > > >> > > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > > > > > How is b) possible without current being killed? Do we allow remote > > > charging? > > > > out_of_memory for memcg_oom > > select_bad_process > > mem_cgroup_scan_tasks > > oom_evaluate_task > > oom_badness > > > > /* > > * Do not even consider tasks which are explicitly marked oom > > * unkillable or have been already oom reaped or the are in > > * the middle of vfork > > */ > > adj = (long)p->signal->oom_score_adj; > > if (adj == OOM_SCORE_ADJ_MIN || > > test_bit(MMF_OOM_SKIP, &p->mm->flags) || > > in_vfork(p)) { > > task_unlock(p); > > return LONG_MIN; > > } > > > > This time we handle userspace page fault, so we cannot be kenrel thread, > > and cannot be in_vfork(). > > However task can be marked as oom unkillable, > > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN > > You are right. I am not sure there is a way out of this though. The task > can only retry for ever in this case. There is nothing actionable here. > We cannot kill the task and there is no other way to release the memory. Btw. don't we force the charge in that case?
On 19.10.2021 15:04, Michal Hocko wrote: > On Tue 19-10-21 13:54:42, Michal Hocko wrote: >> On Tue 19-10-21 13:30:06, Vasily Averin wrote: >>> On 19.10.2021 11:49, Michal Hocko wrote: >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote: >>>> [...] >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: >>>>> a) due to fatal signal >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) >>>>> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. >>> >>>> How is b) possible without current being killed? Do we allow remote >>>> charging? >>> >>> out_of_memory for memcg_oom >>> select_bad_process >>> mem_cgroup_scan_tasks >>> oom_evaluate_task >>> oom_badness >>> >>> /* >>> * Do not even consider tasks which are explicitly marked oom >>> * unkillable or have been already oom reaped or the are in >>> * the middle of vfork >>> */ >>> adj = (long)p->signal->oom_score_adj; >>> if (adj == OOM_SCORE_ADJ_MIN || >>> test_bit(MMF_OOM_SKIP, &p->mm->flags) || >>> in_vfork(p)) { >>> task_unlock(p); >>> return LONG_MIN; >>> } >>> >>> This time we handle userspace page fault, so we cannot be kenrel thread, >>> and cannot be in_vfork(). >>> However task can be marked as oom unkillable, >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN >> >> You are right. I am not sure there is a way out of this though. The task >> can only retry for ever in this case. There is nothing actionable here. >> We cannot kill the task and there is no other way to release the memory. > > Btw. don't we force the charge in that case? We should force charge for allocation from inside page fault handler, to prevent endless cycle in retried page faults. However we should not do it for allocations from task context, to prevent memcg-limited vmalloc-eaters from to consume all host memory. Also I would like to return to the following hunk. @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); Now I think it's better to keep task_is_dying() check here. if task is dying, it is not necessary to push other task to free the memory. We broke vmalloc cycle already, so it looks like nothing should prevent us from returning to userspace, handle fatal signal, exit and free the memory. Thank you, Vasily Averin
On Tue 19-10-21 16:26:50, Vasily Averin wrote: > On 19.10.2021 15:04, Michal Hocko wrote: > > On Tue 19-10-21 13:54:42, Michal Hocko wrote: > >> On Tue 19-10-21 13:30:06, Vasily Averin wrote: > >>> On 19.10.2021 11:49, Michal Hocko wrote: > >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote: > >>>> [...] > >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: > >>>>> a) due to fatal signal > >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) > >>>>> > >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() > >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. > >>> > >>>> How is b) possible without current being killed? Do we allow remote > >>>> charging? > >>> > >>> out_of_memory for memcg_oom > >>> select_bad_process > >>> mem_cgroup_scan_tasks > >>> oom_evaluate_task > >>> oom_badness > >>> > >>> /* > >>> * Do not even consider tasks which are explicitly marked oom > >>> * unkillable or have been already oom reaped or the are in > >>> * the middle of vfork > >>> */ > >>> adj = (long)p->signal->oom_score_adj; > >>> if (adj == OOM_SCORE_ADJ_MIN || > >>> test_bit(MMF_OOM_SKIP, &p->mm->flags) || > >>> in_vfork(p)) { > >>> task_unlock(p); > >>> return LONG_MIN; > >>> } > >>> > >>> This time we handle userspace page fault, so we cannot be kenrel thread, > >>> and cannot be in_vfork(). > >>> However task can be marked as oom unkillable, > >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN > >> > >> You are right. I am not sure there is a way out of this though. The task > >> can only retry for ever in this case. There is nothing actionable here. > >> We cannot kill the task and there is no other way to release the memory. > > > > Btw. don't we force the charge in that case? > > We should force charge for allocation from inside page fault handler, > to prevent endless cycle in retried page faults. > However we should not do it for allocations from task context, > to prevent memcg-limited vmalloc-eaters from to consume all host memory. I don't see a big difference between those two. Because the #PF could result into the very same situation depleting all the memory by overcharging. A different behavior just leads to a confusion and unexpected behavior. E.g. in the past we only triggered memcg OOM killer from the #PF path and failed the charge otherwise. That is something different but it shows problems we haven't anticipated and had user visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path"). > Also I would like to return to the following hunk. > @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > * A few threads which were not waiting at mutex_lock_killable() can > * fail to bail out. Therefore, check again after holding oom_lock. > */ > - ret = should_force_charge() || out_of_memory(&oc); > + ret = task_is_dying() || out_of_memory(&oc); > > unlock: > mutex_unlock(&oom_lock); > > Now I think it's better to keep task_is_dying() check here. > if task is dying, it is not necessary to push other task to free the memory. > We broke vmalloc cycle already, so it looks like nothing should prevent us > from returning to userspace, handle fatal signal, exit and free the memory. That patch has to be discuss in its full length. There were other details I have brought up AFAIU.
On Tue 19-10-21 16:13:19, Michal Hocko wrote: [...] > That patch has to be discuss in its full length. There were other > details I have brought up AFAIU. And btw. thanks for walking through that maze! It is far from trivial with side effects all over the place which far from obvious and heavily underdocumented.
On 19.10.2021 17:13, Michal Hocko wrote: > On Tue 19-10-21 16:26:50, Vasily Averin wrote: >> On 19.10.2021 15:04, Michal Hocko wrote: >>> On Tue 19-10-21 13:54:42, Michal Hocko wrote: >>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote: >>>>> On 19.10.2021 11:49, Michal Hocko wrote: >>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote: >>>>>> [...] >>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail: >>>>>>> a) due to fatal signal >>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing) >>>>>>> >>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory() >>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED. >>>>> >>>>>> How is b) possible without current being killed? Do we allow remote >>>>>> charging? >>>>> >>>>> out_of_memory for memcg_oom >>>>> select_bad_process >>>>> mem_cgroup_scan_tasks >>>>> oom_evaluate_task >>>>> oom_badness >>>>> >>>>> /* >>>>> * Do not even consider tasks which are explicitly marked oom >>>>> * unkillable or have been already oom reaped or the are in >>>>> * the middle of vfork >>>>> */ >>>>> adj = (long)p->signal->oom_score_adj; >>>>> if (adj == OOM_SCORE_ADJ_MIN || >>>>> test_bit(MMF_OOM_SKIP, &p->mm->flags) || >>>>> in_vfork(p)) { >>>>> task_unlock(p); >>>>> return LONG_MIN; >>>>> } >>>>> >>>>> This time we handle userspace page fault, so we cannot be kenrel thread, >>>>> and cannot be in_vfork(). >>>>> However task can be marked as oom unkillable, >>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN >>>> >>>> You are right. I am not sure there is a way out of this though. The task >>>> can only retry for ever in this case. There is nothing actionable here. >>>> We cannot kill the task and there is no other way to release the memory. >>> >>> Btw. don't we force the charge in that case? >> >> We should force charge for allocation from inside page fault handler, >> to prevent endless cycle in retried page faults. >> However we should not do it for allocations from task context, >> to prevent memcg-limited vmalloc-eaters from to consume all host memory. > > I don't see a big difference between those two. Because the #PF could > result into the very same situation depleting all the memory by > overcharging. A different behavior just leads to a confusion and > unexpected behavior. E.g. in the past we only triggered memcg OOM killer > from the #PF path and failed the charge otherwise. That is something > different but it shows problems we haven't anticipated and had user > visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back > to the charge path"). In this case I think we should fail this allocation. It's better do not allow overcharge, neither in #PF not in regular allocations. However this failure will trigger false global OOM in pagefault_out_of_memory(), and we need to find some way to prevent it. Thank you, Vasily Averin
On 18.10.2021 12:04, Michal Hocko wrote: > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > [...] >> How could this happen? >> >> User-space task inside the memcg-limited container generated a page fault, >> its handler do_user_addr_fault() called handle_mm_fault which could not >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >> out_of_memory() without set of memcg. > I will be honest that I am not really happy about pagefault_out_of_memory. > I have tried to remove it in the past. Without much success back then, > unfortunately[1]. > > [1] I do not have msg-id so I cannot provide a lore link but google > pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html I re-read this discussion and in general I support your position. As far as I understand your opponents cannot explain why "random kill" is mandatory here, they are just afraid that it might be useful here and do not want to remove it completely. Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior. However I would like to have some choice in this point. In general we can: - continue to use "random kill" and rely on the wisdom of the ancestors. - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again". - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory. - mark the current task as cycled in #PF and somehow use this mark in allocator - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle. - implement any better ideas, - use any combination of previous points We can select required strategy for example via sysctl. For me "random kill" is worst choice, Why can't we just kill the looped process instead? It can be marked as oom-unkillable, so OOM-killer was unable to select it. However I doubt it means "never kill it", for me it is something like "last possible victim" priority. Thank you, Vasily Averin
On Thu 21-10-21 11:03:43, Vasily Averin wrote: > On 18.10.2021 12:04, Michal Hocko wrote: > > On Mon 18-10-21 11:13:52, Vasily Averin wrote: > > [...] > >> How could this happen? > >> > >> User-space task inside the memcg-limited container generated a page fault, > >> its handler do_user_addr_fault() called handle_mm_fault which could not > >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. > >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed > >> out_of_memory() without set of memcg. > > > I will be honest that I am not really happy about pagefault_out_of_memory. > > I have tried to remove it in the past. Without much success back then, > > unfortunately[1]. > > > > [1] I do not have msg-id so I cannot provide a lore link but google > > pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html > > I re-read this discussion and in general I support your position. > As far as I understand your opponents cannot explain why "random kill" is mandatory here, > they are just afraid that it might be useful here and do not want to remove it completely. That aligns with my recollection. > Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior. > > However I would like to have some choice in this point. > > In general we can: > - continue to use "random kill" and rely on the wisdom of the ancestors. I do not follow. Does that mean to preserve existing oom killer from #PF? > - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again". > - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory. Again, not really sure what you mean > - mark the current task as cycled in #PF and somehow use this mark in allocator How? > - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle. No! We cannot really kill the task if we could we would have done it by the oom killer already > - implement any better ideas, > - use any combination of previous points > > We can select required strategy for example via sysctl. Absolutely no! How can admin know any better than the kernel? > For me "random kill" is worst choice, > Why can't we just kill the looped process instead? See above. > It can be marked as oom-unkillable, so OOM-killer was unable to select it. > However I doubt it means "never kill it", for me it is something like "last possible victim" priority. It means never kill it because of OOM. If it is retrying because of OOM then it is effectively the same thing. The oom killer from the #PF doesn't really provide any clear advantage these days AFAIK. On the other hand it allows for a very disruptive behavior. In a worst case it can lead to a system panic if the VM_FAULT_OOM is not really caused by a memory shortage but rather a wrong error handling. If a task is looping there without any progress then it is still kilallable which is a much saner behavior IMHO.
On 21.10.2021 14:49, Michal Hocko wrote: > On Thu 21-10-21 11:03:43, Vasily Averin wrote: >> On 18.10.2021 12:04, Michal Hocko wrote: >>> On Mon 18-10-21 11:13:52, Vasily Averin wrote: >>> [...] >>>> How could this happen? >>>> >>>> User-space task inside the memcg-limited container generated a page fault, >>>> its handler do_user_addr_fault() called handle_mm_fault which could not >>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM. >>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed >>>> out_of_memory() without set of memcg. >> >>> I will be honest that I am not really happy about pagefault_out_of_memory. >>> I have tried to remove it in the past. Without much success back then, >>> unfortunately[1]. >>> >>> [1] I do not have msg-id so I cannot provide a lore link but google >>> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html >> >> I re-read this discussion and in general I support your position. >> As far as I understand your opponents cannot explain why "random kill" is mandatory here, >> they are just afraid that it might be useful here and do not want to remove it completely. > > That aligns with my recollection. > >> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior. >> >> However I would like to have some choice in this point. >> >> In general we can: >> - continue to use "random kill" and rely on the wisdom of the ancestors. > > I do not follow. Does that mean to preserve existing oom killer from > #PF? > >> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again". >> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory. > > Again, not really sure what you mean > >> - mark the current task as cycled in #PF and somehow use this mark in allocator > > How? > >> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle. > > No! We cannot really kill the task if we could we would have done it by > the oom killer already > >> - implement any better ideas, >> - use any combination of previous points >> >> We can select required strategy for example via sysctl. > > Absolutely no! How can admin know any better than the kernel? > >> For me "random kill" is worst choice, >> Why can't we just kill the looped process instead? > > See above. > >> It can be marked as oom-unkillable, so OOM-killer was unable to select it. >> However I doubt it means "never kill it", for me it is something like "last possible victim" priority. > > It means never kill it because of OOM. If it is retrying because of OOM > then it is effectively the same thing. > > The oom killer from the #PF doesn't really provide any clear advantage > these days AFAIK. On the other hand it allows for a very disruptive > behavior. In a worst case it can lead to a system panic if the > VM_FAULT_OOM is not really caused by a memory shortage but rather a > wrong error handling. If a task is looping there without any progress > then it is still kilallable which is a much saner behavior IMHO. Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread. .