diff mbox series

mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

Message ID alpine.DEB.2.22.394.2004241347310.70176@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show
Series mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon | expand

Commit Message

David Rientjes April 24, 2020, 8:48 p.m. UTC
If GFP_ATOMIC allocations will start failing soon because the amount of
free memory is substantially under per-zone min watermarks, it is better
to oom kill a process rather than continue to reclaim.

This intends to significantly reduce the number of page allocation
failures that are encountered when the demands of user and atomic
allocations overwhelm the ability of reclaim to keep up.  We can see this
with a high ingress of networking traffic where memory allocated in irq
context can overwhelm the ability to reclaim fast enough such that user
memory consistently loops.  In that case, we have reclaimable memory, and
reclaiming is successful, but we've fully depleted memory reserves that
are allowed for non-blockable allocations.

Commit 400e22499dd9 ("mm: don't warn about allocations which stall for
too long") removed evidence of user allocations stalling because of this,
but the situation can apply anytime we get "page allocation failures"
where reclaim is happening but per-zone min watermarks are starved:

Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0

Without this patch, there is no guarantee that user memory allocations
will ever be successful when non-blockable allocations overwhelm the
ability to get above per-zone min watermarks.

This doesn't solve page allocation failures entirely since it's a
preemptive measure based on watermarks that requires concurrent blockable
allocations to trigger the oom kill.  To complete solve page allocation
failures, it would be possible to do the same watermark check for non-
blockable allocations and then queue a worker to asynchronously oom kill
if it finds watermarks to be sufficiently low as well.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Tetsuo Handa April 25, 2020, 12:32 a.m. UTC | #1
On 2020/04/25 5:48, David Rientjes wrote:
> @@ -4372,11 +4372,21 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		unsigned long free;
>  		unsigned long min_wmark = min_wmark_pages(zone);
>  		bool wmark;
>  
> +		free = zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		/*
> +		 * If this zone is approaching the point where even order-0
> +		 * GFP_ATOMIC allocations will fail, stop considering reclaim.
> +		 */
> +		if (!__zone_watermark_ok(zone, 0, min_wmark, ac_classzone_idx(ac),
> +					 alloc_flags | ALLOC_HIGH, free))
> +			continue;

This is to trigger the OOM killer more aggressively, isn't it? But where is the
guarantee that this is an allocation request which can trigger the OOM killer?
If this is an allocation request which cannot trigger the OOM killer, wouldn't
this cause premature allocation failures?

> +
>  		available = reclaimable = zone_reclaimable_pages(zone);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		available += free;
>  
>  		/*
>  		 * Would the allocation succeed if we reclaimed all
>
Andrew Morton April 26, 2020, 12:27 a.m. UTC | #2
On Fri, 24 Apr 2020 13:48:06 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> If GFP_ATOMIC allocations will start failing soon because the amount of
> free memory is substantially under per-zone min watermarks, it is better
> to oom kill a process rather than continue to reclaim.
> 
> This intends to significantly reduce the number of page allocation
> failures that are encountered when the demands of user and atomic
> allocations overwhelm the ability of reclaim to keep up.  We can see this
> with a high ingress of networking traffic where memory allocated in irq
> context can overwhelm the ability to reclaim fast enough such that user
> memory consistently loops.  In that case, we have reclaimable memory, and

"user memory allocation", I assume?  Or maybe "blockable memory
allocatoins".

> reclaiming is successful, but we've fully depleted memory reserves that
> are allowed for non-blockable allocations.
> 
> Commit 400e22499dd9 ("mm: don't warn about allocations which stall for
> too long") removed evidence of user allocations stalling because of this,
> but the situation can apply anytime we get "page allocation failures"
> where reclaim is happening but per-zone min watermarks are starved:
> 
> Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> 
> Without this patch, there is no guarantee that user memory allocations
> will ever be successful when non-blockable allocations overwhelm the
> ability to get above per-zone min watermarks.
> 
> This doesn't solve page allocation failures entirely since it's a
> preemptive measure based on watermarks that requires concurrent blockable
> allocations to trigger the oom kill.  To complete solve page allocation
> failures, it would be possible to do the same watermark check for non-
> blockable allocations and then queue a worker to asynchronously oom kill
> if it finds watermarks to be sufficiently low as well.
> 

Well, what's really going on here?

Is networking potentially consuming an unbounded amount of memory?  If
so, then killing a process will just cause networking to consume more
memory then hit against the same thing.  So presumably the answer is
"no, the watermarks are inappropriately set for this workload".

So would it not be sensible to dynamically adjust the watermarks in
response to this condition?  Maintain a larger pool of memory for these
allocations?  Or possibly push back on networking and tell it to reduce
its queue sizes?  So that stuff doesn't keep on getting oom-killed?
Tetsuo Handa April 26, 2020, 3:04 a.m. UTC | #3
On 2020/04/26 9:27, Andrew Morton wrote:
> Well, what's really going on here?
> 
> Is networking potentially consuming an unbounded amount of memory?  If
> so, then killing a process will just cause networking to consume more
> memory then hit against the same thing.  So presumably the answer is
> "no, the watermarks are inappropriately set for this workload".
Maybe somebody is abusing __GFP_MEMALLOC allocations. David, please include whole
context (e.g. which function from which process) instead of only memory stat lines.
David Rientjes April 27, 2020, 3:12 a.m. UTC | #4
On Sat, 25 Apr 2020, Andrew Morton wrote:

> > If GFP_ATOMIC allocations will start failing soon because the amount of
> > free memory is substantially under per-zone min watermarks, it is better
> > to oom kill a process rather than continue to reclaim.
> > 
> > This intends to significantly reduce the number of page allocation
> > failures that are encountered when the demands of user and atomic
> > allocations overwhelm the ability of reclaim to keep up.  We can see this
> > with a high ingress of networking traffic where memory allocated in irq
> > context can overwhelm the ability to reclaim fast enough such that user
> > memory consistently loops.  In that case, we have reclaimable memory, and
> 
> "user memory allocation", I assume?  Or maybe "blockable memory
> allocatoins".
> 

"user memory allocations consistently loop", yeah.  Thanks.

> > reclaiming is successful, but we've fully depleted memory reserves that
> > are allowed for non-blockable allocations.
> > 
> > Commit 400e22499dd9 ("mm: don't warn about allocations which stall for
> > too long") removed evidence of user allocations stalling because of this,
> > but the situation can apply anytime we get "page allocation failures"
> > where reclaim is happening but per-zone min watermarks are starved:
> > 
> > Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB
> > lowmem_reserve[]: 0 0 0 0
> > Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 0 0 0
> > 
> > Without this patch, there is no guarantee that user memory allocations
> > will ever be successful when non-blockable allocations overwhelm the
> > ability to get above per-zone min watermarks.
> > 
> > This doesn't solve page allocation failures entirely since it's a
> > preemptive measure based on watermarks that requires concurrent blockable
> > allocations to trigger the oom kill.  To complete solve page allocation
> > failures, it would be possible to do the same watermark check for non-
> > blockable allocations and then queue a worker to asynchronously oom kill
> > if it finds watermarks to be sufficiently low as well.
> > 
> 
> Well, what's really going on here?
> 
> Is networking potentially consuming an unbounded amount of memory?  If
> so, then killing a process will just cause networking to consume more
> memory then hit against the same thing.  So presumably the answer is
> "no, the watermarks are inappropriately set for this workload".
> 
> So would it not be sensible to dynamically adjust the watermarks in
> response to this condition?  Maintain a larger pool of memory for these
> allocations?  Or possibly push back on networking and tell it to reduce
> its queue sizes?  So that stuff doesn't keep on getting oom-killed?
> 

No - that would actually make the problem worse.

Today, per-zone min watermarks dictate when user allocations will loop or 
oom kill.  should_reclaim_retry() currently loops if reclaim has succeeded 
in the past few tries and we should be able to allocate if we are able to 
reclaim the amount of memory that we think we can.

The issue is that this supposes that looping to reclaim more will result 
in more free memory.  That doesn't always happen if there are concurrent 
memory allocators.

GFP_ATOMIC allocators can access below these per-zone watermarks.  So the 
issue is that per-zone free pages stays between ALLOC_HIGH watermarks 
(the watermark that GFP_ATOMIC allocators can allocate to) and min 
watermarks.  We never reclaim enough memory to get back to min watermarks 
because reclaim cannot keep up with the amount of GFP_ATOMIC allocations.

In production scnearios, this results in userspace processes going out to 
lunch because they are constantly looping in the page allocator reclaiming 
only for the benefit of GFP_ATOMIC allocations.

In fact, when we hit ALLOC_HIGH watermarks and we start getting "page 
allocation failures" in the kernel log, there is also no guarantee that 
kswapd's reclaim will outpace GFP_ATOMIC allocations.  Thus, an oom kill 
is really the best policy at this point to provide an actual guarantee of 
net positive memory freeing.

This isn't a matter of any specific networking stack; the scope of 
allocations that can trigger this is the set of all GFP_ATOMIC (or 
GFP_MEMALLOC) allocations in the kernel.

Tetsuo: the specific allocation that triggers a page allocation failure is 
not interesting; we have tens of thousands of examples.  Each example is 
simply the unlucky last GFP_ATOMIC allocation that fails; the interesting 
point is the amount of free memory.  In other words, when free memory is 
below ALLOC_HIGH watermarks, we assume that we have depleted memory 
reserves *faster* than when user allocations started to fail.  In the 
interest of userspace being responsive, we should oom kill here.
Tetsuo Handa April 27, 2020, 5:03 a.m. UTC | #5
On 2020/04/27 12:12, David Rientjes wrote:
> Tetsuo: the specific allocation that triggers a page allocation failure is 
> not interesting; we have tens of thousands of examples.  Each example is 
> simply the unlucky last GFP_ATOMIC allocation that fails; the interesting 
> point is the amount of free memory.  In other words, when free memory is 
> below ALLOC_HIGH watermarks, we assume that we have depleted memory 
> reserves *faster* than when user allocations started to fail.  In the 
> interest of userspace being responsive, we should oom kill here.

My interest is, which function (and which process if process context) is [ab]using
GFP_ATOMIC (or __GFP_MEMALLOC) allocations enough to hit memory allocation failure.
GFP_NOWAIT (or __GFP_NOMEMALLOC) could be used if that allocation can't sleep and
can't shortly recover free memory.
Peter Enderborg April 27, 2020, 8:20 a.m. UTC | #6
On 4/26/20 2:27 AM, Andrew Morton wrote:
> On Fri, 24 Apr 2020 13:48:06 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
>
>> If GFP_ATOMIC allocations will start failing soon because the amount of
>> free memory is substantially under per-zone min watermarks, it is better
>> to oom kill a process rather than continue to reclaim.
>>
>> This intends to significantly reduce the number of page allocation
>> failures that are encountered when the demands of user and atomic
>> allocations overwhelm the ability of reclaim to keep up.  We can see this
>> with a high ingress of networking traffic where memory allocated in irq
>> context can overwhelm the ability to reclaim fast enough such that user
>> memory consistently loops.  In that case, we have reclaimable memory, and
> "user memory allocation", I assume?  Or maybe "blockable memory
> allocatoins".
>
>> reclaiming is successful, but we've fully depleted memory reserves that
>> are allowed for non-blockable allocations.
>>
>> Commit 400e22499dd9 ("mm: don't warn about allocations which stall for
>> too long") removed evidence of user allocations stalling because of this,
>> but the situation can apply anytime we get "page allocation failures"
>> where reclaim is happening but per-zone min watermarks are starved:
>>
>> Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>> Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB
>> lowmem_reserve[]: 0 0 0 0
>>
>> Without this patch, there is no guarantee that user memory allocations
>> will ever be successful when non-blockable allocations overwhelm the
>> ability to get above per-zone min watermarks.
>>
>> This doesn't solve page allocation failures entirely since it's a
>> preemptive measure based on watermarks that requires concurrent blockable
>> allocations to trigger the oom kill.  To complete solve page allocation
>> failures, it would be possible to do the same watermark check for non-
>> blockable allocations and then queue a worker to asynchronously oom kill
>> if it finds watermarks to be sufficiently low as well.
>>
> Well, what's really going on here?
>
> Is networking potentially consuming an unbounded amount of memory?  If
> so, then killing a process will just cause networking to consume more
> memory then hit against the same thing.  So presumably the answer is
> "no, the watermarks are inappropriately set for this workload".
>
> So would it not be sensible to dynamically adjust the watermarks in
> response to this condition?  Maintain a larger pool of memory for these
> allocations?  Or possibly push back on networking and tell it to reduce
> its queue sizes?  So that stuff doesn't keep on getting oom-killed?
>
I think I seen similar issues when dma-buf allocate a lot.  But that is on older kernels and out of tree.
So networking is maybe not the only cause. dma-buf are used a lot for camera stuff in android.
Michal Hocko April 27, 2020, 3:01 p.m. UTC | #7
On Fri 24-04-20 13:48:06, David Rientjes wrote:
> If GFP_ATOMIC allocations will start failing soon because the amount of
> free memory is substantially under per-zone min watermarks, it is better
> to oom kill a process rather than continue to reclaim.
> 
> This intends to significantly reduce the number of page allocation
> failures that are encountered when the demands of user and atomic
> allocations overwhelm the ability of reclaim to keep up.  We can see this
> with a high ingress of networking traffic where memory allocated in irq
> context can overwhelm the ability to reclaim fast enough such that user
> memory consistently loops.  In that case, we have reclaimable memory, and
> reclaiming is successful, but we've fully depleted memory reserves that
> are allowed for non-blockable allocations.
> 
> Commit 400e22499dd9 ("mm: don't warn about allocations which stall for
> too long") removed evidence of user allocations stalling because of this,
> but the situation can apply anytime we get "page allocation failures"
> where reclaim is happening but per-zone min watermarks are starved:
> 
> Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB
> lowmem_reserve[]: 0 0 0 0
> 
> Without this patch, there is no guarantee that user memory allocations
> will ever be successful when non-blockable allocations overwhelm the
> ability to get above per-zone min watermarks.

We have never had any guarantee and we will not have any after this
patch either. The fundamental problem is that direct reclaim doesn't
provide any guarantee that the reclaimed memory is going to be used for
the reclaimer. You can see the same if the memory demand is higher
than the reclaim. GFP_ATOMIC is only different in that aspect that they
are not throttled by reclaiming and consume what is availble right away
which makes the problem worse.
 
> This doesn't solve page allocation failures entirely since it's a
> preemptive measure based on watermarks that requires concurrent blockable
> allocations to trigger the oom kill.  To complete solve page allocation
> failures, it would be possible to do the same watermark check for non-
> blockable allocations and then queue a worker to asynchronously oom kill
> if it finds watermarks to be sufficiently low as well.

I do not think this is the right approach. This patch is also quite
dangerous as pointed out by Tetsuo. You are effectively allowing a
remote DoS via OOM killer. If the reclaim is making progress then the
issue seem to be more on the configuration side I believe. Have you
tried to tune watermarks resp. watermark_scale_factor? Another potential
problem might be that the kswapd is not making sufficient progress
because it is blocked on something.

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4372,11 +4372,21 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		unsigned long free;
>  		unsigned long min_wmark = min_wmark_pages(zone);
>  		bool wmark;
>  
> +		free = zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		/*
> +		 * If this zone is approaching the point where even order-0
> +		 * GFP_ATOMIC allocations will fail, stop considering reclaim.
> +		 */
> +		if (!__zone_watermark_ok(zone, 0, min_wmark, ac_classzone_idx(ac),
> +					 alloc_flags | ALLOC_HIGH, free))
> +			continue;
> +
>  		available = reclaimable = zone_reclaimable_pages(zone);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		available += free;
>  
>  		/*
>  		 * Would the allocation succeed if we reclaimed all
Andrew Morton April 27, 2020, 8:30 p.m. UTC | #8
On Sun, 26 Apr 2020 20:12:58 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> > > blockable allocations and then queue a worker to asynchronously oom kill
> > > if it finds watermarks to be sufficiently low as well.
> > > 
> > 
> > Well, what's really going on here?
> > 
> > Is networking potentially consuming an unbounded amount of memory?  If
> > so, then killing a process will just cause networking to consume more
> > memory then hit against the same thing.  So presumably the answer is
> > "no, the watermarks are inappropriately set for this workload".
> > 
> > So would it not be sensible to dynamically adjust the watermarks in
> > response to this condition?  Maintain a larger pool of memory for these
> > allocations?  Or possibly push back on networking and tell it to reduce
> > its queue sizes?  So that stuff doesn't keep on getting oom-killed?
> > 
> 
> No - that would actually make the problem worse.
> 
> Today, per-zone min watermarks dictate when user allocations will loop or 
> oom kill.  should_reclaim_retry() currently loops if reclaim has succeeded 
> in the past few tries and we should be able to allocate if we are able to 
> reclaim the amount of memory that we think we can.
> 
> The issue is that this supposes that looping to reclaim more will result 
> in more free memory.  That doesn't always happen if there are concurrent 
> memory allocators.
> 
> GFP_ATOMIC allocators can access below these per-zone watermarks.  So the 
> issue is that per-zone free pages stays between ALLOC_HIGH watermarks 
> (the watermark that GFP_ATOMIC allocators can allocate to) and min 
> watermarks.  We never reclaim enough memory to get back to min watermarks 
> because reclaim cannot keep up with the amount of GFP_ATOMIC allocations.

But there should be an upper bound upon the total amount of in-flight
GFP_ATOMIC memory at any point in time?  These aren't like pagecache
which will take more if we give it more.  Setting the various
thresholds appropriately should ensure that blockable allocations don't
get their memory stolen by GPP_ATOMIC allocations?

I took a look at doing a quick-fix for the
direct-reclaimers-get-their-stuff-stolen issue about a million years
ago.  I don't recall where it ended up.  It's pretty trivial for the
direct reclaimer to free pages into current->reclaimed_pages and to
take a look in there on the allocation path, etc.  But it's only
practical for order-0 pages.
David Rientjes April 27, 2020, 11:03 p.m. UTC | #9
On Mon, 27 Apr 2020, Andrew Morton wrote:

> > No - that would actually make the problem worse.
> > 
> > Today, per-zone min watermarks dictate when user allocations will loop or 
> > oom kill.  should_reclaim_retry() currently loops if reclaim has succeeded 
> > in the past few tries and we should be able to allocate if we are able to 
> > reclaim the amount of memory that we think we can.
> > 
> > The issue is that this supposes that looping to reclaim more will result 
> > in more free memory.  That doesn't always happen if there are concurrent 
> > memory allocators.
> > 
> > GFP_ATOMIC allocators can access below these per-zone watermarks.  So the 
> > issue is that per-zone free pages stays between ALLOC_HIGH watermarks 
> > (the watermark that GFP_ATOMIC allocators can allocate to) and min 
> > watermarks.  We never reclaim enough memory to get back to min watermarks 
> > because reclaim cannot keep up with the amount of GFP_ATOMIC allocations.
> 
> But there should be an upper bound upon the total amount of in-flight
> GFP_ATOMIC memory at any point in time?  These aren't like pagecache
> which will take more if we give it more.  Setting the various
> thresholds appropriately should ensure that blockable allocations don't
> get their memory stolen by GPP_ATOMIC allocations?
> 

Certainly if that upper bound is defined and enforced somewhere we would 
not have run into this issue causing all userspace to become completely 
unresponsive.  Do you have links to patches that proposed enforcing this 
upper bound?  It seems like it would have to be generic to 
__alloc_pages_slowpath() because otherwise multiple different GFP_ATOMIC 
allocators, all from different sources, couldn't orchestrate their memory 
allocations amongst themselves to enforce this upper bound.  They would 
need to work together to ensure they don't conspire to cause this 
depletion.  I'd be happy to take a look if there are links to other 
approaches.
Andrew Morton April 27, 2020, 11:35 p.m. UTC | #10
On Mon, 27 Apr 2020 16:03:56 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On Mon, 27 Apr 2020, Andrew Morton wrote:
> 
> > > No - that would actually make the problem worse.
> > > 
> > > Today, per-zone min watermarks dictate when user allocations will loop or 
> > > oom kill.  should_reclaim_retry() currently loops if reclaim has succeeded 
> > > in the past few tries and we should be able to allocate if we are able to 
> > > reclaim the amount of memory that we think we can.
> > > 
> > > The issue is that this supposes that looping to reclaim more will result 
> > > in more free memory.  That doesn't always happen if there are concurrent 
> > > memory allocators.
> > > 
> > > GFP_ATOMIC allocators can access below these per-zone watermarks.  So the 
> > > issue is that per-zone free pages stays between ALLOC_HIGH watermarks 
> > > (the watermark that GFP_ATOMIC allocators can allocate to) and min 
> > > watermarks.  We never reclaim enough memory to get back to min watermarks 
> > > because reclaim cannot keep up with the amount of GFP_ATOMIC allocations.
> > 
> > But there should be an upper bound upon the total amount of in-flight
> > GFP_ATOMIC memory at any point in time?  These aren't like pagecache
> > which will take more if we give it more.  Setting the various
> > thresholds appropriately should ensure that blockable allocations don't
> > get their memory stolen by GPP_ATOMIC allocations?
> > 
> 
> Certainly if that upper bound is defined and enforced somewhere we would 
> not have run into this issue causing all userspace to become completely 
> unresponsive.  Do you have links to patches that proposed enforcing this 
> upper bound?

There is no such enforcement and there are no such patches, as I'm sure
you know.

No consumer of GFP_ATOMIC memory should consume an unbounded amount of
it.  Subsystems such as networking will consume a certain amount and
will then start recycling it.  The total amount in-flight will vary
over the longer term as workloads change.  A dynamically tuning
threshold system will need to adapt rapidly enough to sudden load
shifts, which might require unreasonable amounts of headroom.

Michal asked relevant questions regarding watermark tuning - an ansewr
to those would be interesting.  To amplify that, is it possible to
manually tune this system so that the problem no longer exhibits?  If
so, then why can't that tuning be performed automatically?
Michal Hocko April 28, 2020, 7:43 a.m. UTC | #11
On Mon 27-04-20 16:35:58, Andrew Morton wrote:
[...]
> No consumer of GFP_ATOMIC memory should consume an unbounded amount of
> it.
> Subsystems such as networking will consume a certain amount and
> will then start recycling it.  The total amount in-flight will vary
> over the longer term as workloads change.  A dynamically tuning
> threshold system will need to adapt rapidly enough to sudden load
> shifts, which might require unreasonable amounts of headroom.

I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the
reserves under memory pressure. Then allocatios start failing very
quickly and users have to cope with that, usually by deferring to a
sleepable context. Tuning reserves dynamically for heavy reserves
consumers would be possible but I am worried that this is far from
trivial.

We definitely need to understand what is going on here.  Why doesn't
kswapd + N*direct reclaimers do not provide enough memory to satisfy
both N threads + reserves consumers? How many times those direct
reclaimers have to retry?

We used to have the allocation stall warning as David mentioned in the
patch description and I have seen it triggering without heavy reserves
consumers (aka reported free pages corresponded to the min watermark).
The underlying problem was usually kswapd being stuck on some FS locks,
direct reclaimers stuck in shrinkers or way too overloaded system with
dozens if not hundreds of processes stuck in the page allocator each
racing with the reclaim and betting on luck. The last problem was the
most annoying because it is really hard to tune for.
Vlastimil Babka April 28, 2020, 9:38 a.m. UTC | #12
On 4/27/20 10:30 PM, Andrew Morton wrote:
> On Sun, 26 Apr 2020 20:12:58 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
>> 
>> GFP_ATOMIC allocators can access below these per-zone watermarks.  So the 
>> issue is that per-zone free pages stays between ALLOC_HIGH watermarks 
>> (the watermark that GFP_ATOMIC allocators can allocate to) and min 
>> watermarks.  We never reclaim enough memory to get back to min watermarks 
>> because reclaim cannot keep up with the amount of GFP_ATOMIC allocations.
> 
> But there should be an upper bound upon the total amount of in-flight
> GFP_ATOMIC memory at any point in time?  These aren't like pagecache

If it's a network receive path, then this is effectively bounded by link speed 
vs ability to deal with the packets quickly and free the buffers. And the bursts 
of incoming packets might be out of control of the admin. With my "enterprise 
kernel support" hat on, it's it's annoying enough to explain GFP_ATOMIC failures 
(usually high-order) in dmesg every once in a while (the usual suggestion is to 
bump min_free_kbytes and stress that unless they are frequent, there's no actual 
harm as networking can defer the allocation to non-atomic context). If there was 
an OOM kill as a result, that could not be disabled, I can well imagine we would 
have to revert such patch in our kernel as a result due to the DOS (intentional 
or not) potential.

> which will take more if we give it more.  Setting the various
> thresholds appropriately should ensure that blockable allocations don't
> get their memory stolen by GPP_ATOMIC allocations?

I agree with the view that GFP_ATOMIC is only a (perhaps more visible) part of 
the problem that there's no fairness guarantee in reclaim, and allocators can 
steal from each other. GFP_ATOMIC allocations just have it easier thanks to 
lower thresholds.

> I took a look at doing a quick-fix for the
> direct-reclaimers-get-their-stuff-stolen issue about a million years
> ago.  I don't recall where it ended up.  It's pretty trivial for the
> direct reclaimer to free pages into current->reclaimed_pages and to
> take a look in there on the allocation path, etc.  But it's only
> practical for order-0 pages.

FWIW there's already such approach added to compaction by Mel some time ago, so 
order>0 allocations are covered to some extent. But in this case I imagine that 
compaction won't even start because order-0 watermarks are too low.

The order-0 reclaim capture might work though - as a result the GFP_ATOMIC 
allocations would more likely fail and defer to their fallback context.
David Rientjes April 28, 2020, 9:48 p.m. UTC | #13
On Tue, 28 Apr 2020, Vlastimil Babka wrote:

> > I took a look at doing a quick-fix for the
> > direct-reclaimers-get-their-stuff-stolen issue about a million years
> > ago.  I don't recall where it ended up.  It's pretty trivial for the
> > direct reclaimer to free pages into current->reclaimed_pages and to
> > take a look in there on the allocation path, etc.  But it's only
> > practical for order-0 pages.
> 
> FWIW there's already such approach added to compaction by Mel some time ago,
> so order>0 allocations are covered to some extent. But in this case I imagine
> that compaction won't even start because order-0 watermarks are too low.
> 
> The order-0 reclaim capture might work though - as a result the GFP_ATOMIC
> allocations would more likely fail and defer to their fallback context.
> 

Yes, order-0 reclaim capture is interesting since the issue being reported 
here is userspace going out to lunch because it loops for an unbounded 
amount of time trying to get above a watermark where it's allowed to 
allocate and other consumers are depleting that resource.

We actually prefer to oom kill earlier rather than being put in a 
perpetual state of aggressive reclaim that affects all allocators and the 
unbounded nature of those allocations leads to very poor results for 
everybody.

I'm happy to scope this solely to an order-0 reclaim capture.  I'm not 
sure if I'm clear on whether this has been worked on before and patches 
existed in the past?

Somewhat related to what I described in the changelog: we lost the "page 
allocation stalls" artifacts in the kernel log for 4.15.  The commit 
description references an asynchronous mechanism for getting this 
information; I don't know where this mechanism currently lives.
Tetsuo Handa April 28, 2020, 11:37 p.m. UTC | #14
On 2020/04/29 6:48, David Rientjes wrote:
> Somewhat related to what I described in the changelog: we lost the "page 
> allocation stalls" artifacts in the kernel log for 4.15.  The commit 
> description references an asynchronous mechanism for getting this 
> information; I don't know where this mechanism currently lives.

The asynchronous mechanism I meant in that commit is
https://lkml.kernel.org/r/c9c87635-6e00-5ce7-b05a-966011c8fe3f@I-love.SAKURA.ne.jp .
We are still forced to live in a world where we cannot get a clue for stalled allocations. :-(
Vlastimil Babka April 29, 2020, 7:51 a.m. UTC | #15
On 4/28/20 11:48 PM, David Rientjes wrote:
> On Tue, 28 Apr 2020, Vlastimil Babka wrote:
> 
> Yes, order-0 reclaim capture is interesting since the issue being reported 
> here is userspace going out to lunch because it loops for an unbounded 
> amount of time trying to get above a watermark where it's allowed to 
> allocate and other consumers are depleting that resource.
> 
> We actually prefer to oom kill earlier rather than being put in a 
> perpetual state of aggressive reclaim that affects all allocators and the 
> unbounded nature of those allocations leads to very poor results for 
> everybody.

Sure. My vague impression is that your (and similar cloud companies) kind of
workloads are designed to maximize machine utilization, and overshooting and
killing something as a result is no big deal. Then you perhaps have more
probability of hitting this state, and on the other hand, even an occasional
premature oom kill is not a big deal?

My concers are workloads not designed in such a way, where premature oom kill
due to temporary higher reclaim activity together with burst of incoming network
packets will result in e.g. killing an important database. There, the tradeoff
looks different.

> I'm happy to scope this solely to an order-0 reclaim capture.  I'm not 
> sure if I'm clear on whether this has been worked on before and patches 
> existed in the past?

Andrew mentioned some. I don't recall any, so it might have been before my time.

> Somewhat related to what I described in the changelog: we lost the "page 
> allocation stalls" artifacts in the kernel log for 4.15.  The commit 
> description references an asynchronous mechanism for getting this 
> information; I don't know where this mechanism currently lives.
>
Peter Enderborg April 29, 2020, 8:31 a.m. UTC | #16
On 4/28/20 9:43 AM, Michal Hocko wrote:
> On Mon 27-04-20 16:35:58, Andrew Morton wrote:
> [...]
>> No consumer of GFP_ATOMIC memory should consume an unbounded amount of
>> it.
>> Subsystems such as networking will consume a certain amount and
>> will then start recycling it.  The total amount in-flight will vary
>> over the longer term as workloads change.  A dynamically tuning
>> threshold system will need to adapt rapidly enough to sudden load
>> shifts, which might require unreasonable amounts of headroom.
> I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the
> reserves under memory pressure. Then allocatios start failing very
> quickly and users have to cope with that, usually by deferring to a
> sleepable context. Tuning reserves dynamically for heavy reserves
> consumers would be possible but I am worried that this is far from
> trivial.
>
> We definitely need to understand what is going on here.  Why doesn't
> kswapd + N*direct reclaimers do not provide enough memory to satisfy
> both N threads + reserves consumers? How many times those direct
> reclaimers have to retry?

Was this not supposed to be avoided with PSI, user-space should
a fair change to take actions before it goes bad in user-space?


> We used to have the allocation stall warning as David mentioned in the
> patch description and I have seen it triggering without heavy reserves
> consumers (aka reported free pages corresponded to the min watermark).
> The underlying problem was usually kswapd being stuck on some FS locks,
> direct reclaimers stuck in shrinkers or way too overloaded system with
> dozens if not hundreds of processes stuck in the page allocator each
> racing with the reclaim and betting on luck. The last problem was the
> most annoying because it is really hard to tune for.
Michal Hocko April 29, 2020, 9 a.m. UTC | #17
On Wed 29-04-20 10:31:41, peter enderborg wrote:
> On 4/28/20 9:43 AM, Michal Hocko wrote:
> > On Mon 27-04-20 16:35:58, Andrew Morton wrote:
> > [...]
> >> No consumer of GFP_ATOMIC memory should consume an unbounded amount of
> >> it.
> >> Subsystems such as networking will consume a certain amount and
> >> will then start recycling it.  The total amount in-flight will vary
> >> over the longer term as workloads change.  A dynamically tuning
> >> threshold system will need to adapt rapidly enough to sudden load
> >> shifts, which might require unreasonable amounts of headroom.
> > I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the
> > reserves under memory pressure. Then allocatios start failing very
> > quickly and users have to cope with that, usually by deferring to a
> > sleepable context. Tuning reserves dynamically for heavy reserves
> > consumers would be possible but I am worried that this is far from
> > trivial.
> >
> > We definitely need to understand what is going on here.  Why doesn't
> > kswapd + N*direct reclaimers do not provide enough memory to satisfy
> > both N threads + reserves consumers? How many times those direct
> > reclaimers have to retry?
> 
> Was this not supposed to be avoided with PSI, user-space should
> a fair change to take actions before it goes bad in user-space?

Yes, PSI is certainly a tool to help userspace make actions on heavy
reclaim. And I agree that if there is a desire to trigger the oom killer
early as David states elsewhere in the thread then this approach should
be considered.
Michal Hocko April 29, 2020, 9:04 a.m. UTC | #18
On Wed 29-04-20 09:51:39, Vlastimil Babka wrote:
> On 4/28/20 11:48 PM, David Rientjes wrote:
> > On Tue, 28 Apr 2020, Vlastimil Babka wrote:
> > 
> > Yes, order-0 reclaim capture is interesting since the issue being reported 
> > here is userspace going out to lunch because it loops for an unbounded 
> > amount of time trying to get above a watermark where it's allowed to 
> > allocate and other consumers are depleting that resource.
> > 
> > We actually prefer to oom kill earlier rather than being put in a 
> > perpetual state of aggressive reclaim that affects all allocators and the 
> > unbounded nature of those allocations leads to very poor results for 
> > everybody.
> 
> Sure. My vague impression is that your (and similar cloud companies) kind of
> workloads are designed to maximize machine utilization, and overshooting and
> killing something as a result is no big deal. Then you perhaps have more
> probability of hitting this state, and on the other hand, even an occasional
> premature oom kill is not a big deal?
> 
> My concers are workloads not designed in such a way, where premature oom kill
> due to temporary higher reclaim activity together with burst of incoming network
> packets will result in e.g. killing an important database. There, the tradeoff
> looks different.

Completely agreed! The in kernel OOM killer is to deal with situations
when memory is desperately depleted without any sign of a forward
progress. If there is a reclaimable memory then we are not there yet.
If a workload can benefit from early oom killing based on response time
then we have facilities to achieve that (e.g. PSI).
Tetsuo Handa April 29, 2020, 10:45 a.m. UTC | #19
On 2020/04/29 18:04, Michal Hocko wrote:
> Completely agreed! The in kernel OOM killer is to deal with situations
> when memory is desperately depleted without any sign of a forward
> progress. If there is a reclaimable memory then we are not there yet.
> If a workload can benefit from early oom killing based on response time
> then we have facilities to achieve that (e.g. PSI).

Can PSI work even if userspace process cannot avoid reclaimable memory
allocations (e.g. page fault, file read) is already stalling? I'm not sure
whether PSI allows responding quickly enough to "keep reclaimable memory
allocations not to reclaim" despite there is still reclaimable memory...
Michal Hocko April 29, 2020, 11:43 a.m. UTC | #20
On Wed 29-04-20 19:45:07, Tetsuo Handa wrote:
> On 2020/04/29 18:04, Michal Hocko wrote:
> > Completely agreed! The in kernel OOM killer is to deal with situations
> > when memory is desperately depleted without any sign of a forward
> > progress. If there is a reclaimable memory then we are not there yet.
> > If a workload can benefit from early oom killing based on response time
> > then we have facilities to achieve that (e.g. PSI).
> 
> Can PSI work even if userspace process cannot avoid reclaimable memory
> allocations (e.g. page fault, file read) is already stalling?

The userspace itself would have to be careful and use mlock of course.
But collecting the psi information itself should be pretty independent
on memory allocations as monitoring the system memory state is one of
the main usecases.

> I'm not sure
> whether PSI allows responding quickly enough to "keep reclaimable memory
> allocations not to reclaim" despite there is still reclaimable memory...

PSI is supposed to monitor time spent in the memory allocator (among
other things) and report the tendency. This should be a sufficient
metric to judge that a large part of the userspace is not making forward
progress.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4372,11 +4372,21 @@  should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		unsigned long free;
 		unsigned long min_wmark = min_wmark_pages(zone);
 		bool wmark;
 
+		free = zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		/*
+		 * If this zone is approaching the point where even order-0
+		 * GFP_ATOMIC allocations will fail, stop considering reclaim.
+		 */
+		if (!__zone_watermark_ok(zone, 0, min_wmark, ac_classzone_idx(ac),
+					 alloc_flags | ALLOC_HIGH, free))
+			continue;
+
 		available = reclaimable = zone_reclaimable_pages(zone);
-		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		available += free;
 
 		/*
 		 * Would the allocation succeed if we reclaimed all