diff mbox series

[mm,v2] vmalloc: back off when the current task is OOM-killed

Message ID 83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com (mailing list archive)
State New
Headers show
Series [mm,v2] vmalloc: back off when the current task is OOM-killed | expand

Commit Message

Vasily Averin Oct. 5, 2021, 1:52 p.m. UTC
Huge vmalloc allocation on heavy loaded node can lead to a global
memory shortage. Task called vmalloc can have worst badness and
be selected by OOM-killer, however taken fatal signal does not
interrupt allocation cycle. Vmalloc repeat page allocaions
again and again, exacerbating the crisis and consuming the memory
freed up by another killed tasks.

After a successful completion of the allocation procedure, a fatal
signal will be processed and task will be destroyed finally.
However it may not release the consumed memory, since the allocated
object may have a lifetime unrelated to the completed task.
In the worst case, this can lead to the host will panic
due to "Out of memory and no killable processes..."

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic. It does not check oom condition directly,
however, and breaks page allocation cycle when fatal signal was received.

This may trigger some hidden problems, when caller does not handle
vmalloc failures, or when rollaback after failed vmalloc calls own
vmallocs inside. However all of these scenarios are incorrect:
vmalloc does not guarantee successful allocation, it has never been called
with __GFP_NOFAIL and threfore either should not be used for any rollbacks
or should handle such errors correctly and not lead to critical
failures.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: tsk_is_oom_victim() check replaced by fatal_signal_pending(current),
    removed check inside __alloc_pages_bulk(),
    according to feedback from mhocko@.
    Updated patch description.
---
 mm/vmalloc.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Vasily Averin Oct. 5, 2021, 2 p.m. UTC | #1
On 10/5/21 4:52 PM, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.
> 
> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.

I briefly checked this patch together with 
 v3 memcg: prohibit unconditional exceeding the limit of dying tasks
 over v5.15-rc4.
I executed LTP on host, all oom, cgroup and memcg tests was successfully finished.
and then experimented with memcg limited LXC containers.
I did not noticed any troubles on my test node.
Thank you,
	Vasily Averin
Michal Hocko Oct. 7, 2021, 10:47 a.m. UTC | #2
On Tue 05-10-21 16:52:40, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.

This will allow also interrupting a user space requist which happens to
trigger a large vmalloc, hence the reason for going for
fatal_signal_pending rather than oom specific condition.

> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.

__GFP_NOFAIL semantic is explicitly not supported for vmalloc.

> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Michal Hocko <mhocko@suse.com>

I would keep it sitting in the linux-next for some time to sort out
potential fallouts and have them fixed before this one gets merged.

Thanks!
> ---
> v2: tsk_is_oom_victim() check replaced by fatal_signal_pending(current),
>     removed check inside __alloc_pages_bulk(),
>     according to feedback from mhocko@.
>     Updated patch description.
> ---
>  mm/vmalloc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..71706f5447f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2860,6 +2860,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		struct page *page;
>  		int i;
>  
> +		if (fatal_signal_pending(current))
> +			break;
> +
>  		page = alloc_pages_node(nid, gfp, order);
>  		if (unlikely(!page))
>  			break;
> -- 
> 2.31.1
Andrew Morton Oct. 7, 2021, 7:55 p.m. UTC | #3
On Tue, 5 Oct 2021 16:52:40 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. Task called vmalloc can have worst badness and
> be selected by OOM-killer, however taken fatal signal does not
> interrupt allocation cycle. Vmalloc repeat page allocaions
> again and again, exacerbating the crisis and consuming the memory
> freed up by another killed tasks.
> 
> After a successful completion of the allocation procedure, a fatal
> signal will be processed and task will be destroyed finally.
> However it may not release the consumed memory, since the allocated
> object may have a lifetime unrelated to the completed task.
> In the worst case, this can lead to the host will panic
> due to "Out of memory and no killable processes..."
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic. It does not check oom condition directly,
> however, and breaks page allocation cycle when fatal signal was received.
> 
> This may trigger some hidden problems, when caller does not handle
> vmalloc failures, or when rollaback after failed vmalloc calls own
> vmallocs inside. However all of these scenarios are incorrect:
> vmalloc does not guarantee successful allocation, it has never been called
> with __GFP_NOFAIL and threfore either should not be used for any rollbacks
> or should handle such errors correctly and not lead to critical
> failures.
> 

This needed a little rework due to
https://lkml.kernel.org/r/20210928121040.2547407-1-chenwandun@huawei.com.
Please check and retest sometime?

--- a/mm/vmalloc.c~vmalloc-back-off-when-the-current-task-is-oom-killed
+++ a/mm/vmalloc.c
@@ -2887,6 +2887,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 
 	page = NULL;
 	while (nr_allocated < nr_pages) {
+		if (fatal_signal_pending(current))
+			break;
+
 		if (nid == NUMA_NO_NODE)
 			page = alloc_pages(gfp, order);
 		else
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..71706f5447f0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2860,6 +2860,9 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 		struct page *page;
 		int i;
 
+		if (fatal_signal_pending(current))
+			break;
+
 		page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;