diff mbox series

[v4] mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt

Message ID 20210520142901.3371299-1-atomlin@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v4] mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt | expand

Commit Message

Aaron Tomlin May 20, 2021, 2:29 p.m. UTC
A customer experienced a low-memory situation and decided to issue a
SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
would expect, the aforementioned task remained unresponsive.

Further investigation indicated that the task was "stuck" in the
reclaim/compaction retry loop. Now, it does not make sense to retry
compaction when a fatal signal is pending.

In the context of try_to_compact_pages(), indeed COMPACT_SKIPPED can be
returned; albeit, not every zone, on the zone list, would be considered
in the case a fatal signal is found to be pending.
Yet, in should_compact_retry(), given the last known compaction result,
each zone, on the zone list, can be considered/or checked
(see compaction_zonelist_suitable()). For example, if a zone was found
to succeed, then reclaim/compaction would be tried again
(notwithstanding the above).

This patch ensures that compaction is not needlessly retried
irrespective of the last known compaction result e.g. if it was skipped,
in the unlikely case a fatal signal is found pending.
So, OOM is at least attempted.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Vlastimil Babka May 28, 2021, 12:53 p.m. UTC | #1
On 5/20/21 4:29 PM, Aaron Tomlin wrote:
> A customer experienced a low-memory situation and decided to issue a
> SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
> would expect, the aforementioned task remained unresponsive.
> 
> Further investigation indicated that the task was "stuck" in the
> reclaim/compaction retry loop. Now, it does not make sense to retry
> compaction when a fatal signal is pending.
> 
> In the context of try_to_compact_pages(), indeed COMPACT_SKIPPED can be
> returned; albeit, not every zone, on the zone list, would be considered
> in the case a fatal signal is found to be pending.
> Yet, in should_compact_retry(), given the last known compaction result,
> each zone, on the zone list, can be considered/or checked
> (see compaction_zonelist_suitable()). For example, if a zone was found
> to succeed, then reclaim/compaction would be tried again
> (notwithstanding the above).
> 
> This patch ensures that compaction is not needlessly retried
> irrespective of the last known compaction result e.g. if it was skipped,
> in the unlikely case a fatal signal is found pending.
> So, OOM is at least attempted.
> 
> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aaa1655cf682..b317057ac186 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4252,6 +4252,9 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  	if (!order)
>  		return false;
>  
> +	if (fatal_signal_pending(current))
> +		return false;
> +
>  	if (compaction_made_progress(compact_result))
>  		(*compaction_retries)++;
>  
>
Michal Hocko May 31, 2021, 11:33 a.m. UTC | #2
On Thu 20-05-21 15:29:01, Aaron Tomlin wrote:
> A customer experienced a low-memory situation and decided to issue a
> SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
> would expect, the aforementioned task remained unresponsive.
> 
> Further investigation indicated that the task was "stuck" in the
> reclaim/compaction retry loop. Now, it does not make sense to retry
> compaction when a fatal signal is pending.

Is this really true in general? The memory reclaim is retried even when
fatal signals are pending. Why should be compaction different? I do
agree that retrying way too much is bad but is there any reason why this
special case doesn't follow the max retry logic?
Vlastimil Babka May 31, 2021, 11:35 a.m. UTC | #3
On 5/31/21 1:33 PM, Michal Hocko wrote:
> On Thu 20-05-21 15:29:01, Aaron Tomlin wrote:
>> A customer experienced a low-memory situation and decided to issue a
>> SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
>> would expect, the aforementioned task remained unresponsive.
>> 
>> Further investigation indicated that the task was "stuck" in the
>> reclaim/compaction retry loop. Now, it does not make sense to retry
>> compaction when a fatal signal is pending.
> 
> Is this really true in general? The memory reclaim is retried even when
> fatal signals are pending. Why should be compaction different? I do
> agree that retrying way too much is bad but is there any reason why this
> special case doesn't follow the max retry logic?

Compaction doesn't do anything if fatal signal is pending, it bails out
immediately and the checks are rather frequent. So why retry?
Michal Hocko May 31, 2021, 1:21 p.m. UTC | #4
On Mon 31-05-21 13:35:31, Vlastimil Babka wrote:
> On 5/31/21 1:33 PM, Michal Hocko wrote:
> > On Thu 20-05-21 15:29:01, Aaron Tomlin wrote:
> >> A customer experienced a low-memory situation and decided to issue a
> >> SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
> >> would expect, the aforementioned task remained unresponsive.
> >> 
> >> Further investigation indicated that the task was "stuck" in the
> >> reclaim/compaction retry loop. Now, it does not make sense to retry
> >> compaction when a fatal signal is pending.
> > 
> > Is this really true in general? The memory reclaim is retried even when
> > fatal signals are pending. Why should be compaction different? I do
> > agree that retrying way too much is bad but is there any reason why this
> > special case doesn't follow the max retry logic?
> 
> Compaction doesn't do anything if fatal signal is pending, it bails out
> immediately and the checks are rather frequent. So why retry?

OK, I was not aware of that and it would be helpful to have that
mentioned in the changelog.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aaa1655cf682..b317057ac186 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4252,6 +4252,9 @@  should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	if (!order)
 		return false;
 
+	if (fatal_signal_pending(current))
+		return false;
+
 	if (compaction_made_progress(compact_result))
 		(*compaction_retries)++;