Message ID | alpine.DEB.2.22.394.2004241347310.70176@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon | expand |
On 2020/04/25 5:48, David Rientjes wrote: > @@ -4372,11 +4372,21 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > ac->nodemask) { > unsigned long available; > unsigned long reclaimable; > + unsigned long free; > unsigned long min_wmark = min_wmark_pages(zone); > bool wmark; > > + free = zone_page_state_snapshot(zone, NR_FREE_PAGES); > + /* > + * If this zone is approaching the point where even order-0 > + * GFP_ATOMIC allocations will fail, stop considering reclaim. > + */ > + if (!__zone_watermark_ok(zone, 0, min_wmark, ac_classzone_idx(ac), > + alloc_flags | ALLOC_HIGH, free)) > + continue; This is to trigger the OOM killer more aggressively, isn't it? But where is the guarantee that this is an allocation request which can trigger the OOM killer? If this is an allocation request which cannot trigger the OOM killer, wouldn't this cause premature allocation failures? > + > available = reclaimable = zone_reclaimable_pages(zone); > - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + available += free; > > /* > * Would the allocation succeed if we reclaimed all >
On Fri, 24 Apr 2020 13:48:06 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > If GFP_ATOMIC allocations will start failing soon because the amount of > free memory is substantially under per-zone min watermarks, it is better > to oom kill a process rather than continue to reclaim. > > This intends to significantly reduce the number of page allocation > failures that are encountered when the demands of user and atomic > allocations overwhelm the ability of reclaim to keep up. We can see this > with a high ingress of networking traffic where memory allocated in irq > context can overwhelm the ability to reclaim fast enough such that user > memory consistently loops. In that case, we have reclaimable memory, and "user memory allocation", I assume? Or maybe "blockable memory allocatoins". > reclaiming is successful, but we've fully depleted memory reserves that > are allowed for non-blockable allocations. > > Commit 400e22499dd9 ("mm: don't warn about allocations which stall for > too long") removed evidence of user allocations stalling because of this, > but the situation can apply anytime we get "page allocation failures" > where reclaim is happening but per-zone min watermarks are starved: > > Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 > Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 > > Without this patch, there is no guarantee that user memory allocations > will ever be successful when non-blockable allocations overwhelm the > ability to get above per-zone min watermarks. > > This doesn't solve page allocation failures entirely since it's a > preemptive measure based on watermarks that requires concurrent blockable > allocations to trigger the oom kill. To complete solve page allocation > failures, it would be possible to do the same watermark check for non- > blockable allocations and then queue a worker to asynchronously oom kill > if it finds watermarks to be sufficiently low as well. > Well, what's really going on here? Is networking potentially consuming an unbounded amount of memory? If so, then killing a process will just cause networking to consume more memory then hit against the same thing. So presumably the answer is "no, the watermarks are inappropriately set for this workload". So would it not be sensible to dynamically adjust the watermarks in response to this condition? Maintain a larger pool of memory for these allocations? Or possibly push back on networking and tell it to reduce its queue sizes? So that stuff doesn't keep on getting oom-killed?
On 2020/04/26 9:27, Andrew Morton wrote: > Well, what's really going on here? > > Is networking potentially consuming an unbounded amount of memory? If > so, then killing a process will just cause networking to consume more > memory then hit against the same thing. So presumably the answer is > "no, the watermarks are inappropriately set for this workload". Maybe somebody is abusing __GFP_MEMALLOC allocations. David, please include whole context (e.g. which function from which process) instead of only memory stat lines.
On Sat, 25 Apr 2020, Andrew Morton wrote: > > If GFP_ATOMIC allocations will start failing soon because the amount of > > free memory is substantially under per-zone min watermarks, it is better > > to oom kill a process rather than continue to reclaim. > > > > This intends to significantly reduce the number of page allocation > > failures that are encountered when the demands of user and atomic > > allocations overwhelm the ability of reclaim to keep up. We can see this > > with a high ingress of networking traffic where memory allocated in irq > > context can overwhelm the ability to reclaim fast enough such that user > > memory consistently loops. In that case, we have reclaimable memory, and > > "user memory allocation", I assume? Or maybe "blockable memory > allocatoins". > "user memory allocations consistently loop", yeah. Thanks. > > reclaiming is successful, but we've fully depleted memory reserves that > > are allowed for non-blockable allocations. > > > > Commit 400e22499dd9 ("mm: don't warn about allocations which stall for > > too long") removed evidence of user allocations stalling because of this, > > but the situation can apply anytime we get "page allocation failures" > > where reclaim is happening but per-zone min watermarks are starved: > > > > Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB > > lowmem_reserve[]: 0 0 0 0 > > Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB > > lowmem_reserve[]: 0 0 0 0 > > > > Without this patch, there is no guarantee that user memory allocations > > will ever be successful when non-blockable allocations overwhelm the > > ability to get above per-zone min watermarks. > > > > This doesn't solve page allocation failures entirely since it's a > > preemptive measure based on watermarks that requires concurrent blockable > > allocations to trigger the oom kill. To complete solve page allocation > > failures, it would be possible to do the same watermark check for non- > > blockable allocations and then queue a worker to asynchronously oom kill > > if it finds watermarks to be sufficiently low as well. > > > > Well, what's really going on here? > > Is networking potentially consuming an unbounded amount of memory? If > so, then killing a process will just cause networking to consume more > memory then hit against the same thing. So presumably the answer is > "no, the watermarks are inappropriately set for this workload". > > So would it not be sensible to dynamically adjust the watermarks in > response to this condition? Maintain a larger pool of memory for these > allocations? Or possibly push back on networking and tell it to reduce > its queue sizes? So that stuff doesn't keep on getting oom-killed? > No - that would actually make the problem worse. Today, per-zone min watermarks dictate when user allocations will loop or oom kill. should_reclaim_retry() currently loops if reclaim has succeeded in the past few tries and we should be able to allocate if we are able to reclaim the amount of memory that we think we can. The issue is that this supposes that looping to reclaim more will result in more free memory. That doesn't always happen if there are concurrent memory allocators. GFP_ATOMIC allocators can access below these per-zone watermarks. So the issue is that per-zone free pages stays between ALLOC_HIGH watermarks (the watermark that GFP_ATOMIC allocators can allocate to) and min watermarks. We never reclaim enough memory to get back to min watermarks because reclaim cannot keep up with the amount of GFP_ATOMIC allocations. In production scnearios, this results in userspace processes going out to lunch because they are constantly looping in the page allocator reclaiming only for the benefit of GFP_ATOMIC allocations. In fact, when we hit ALLOC_HIGH watermarks and we start getting "page allocation failures" in the kernel log, there is also no guarantee that kswapd's reclaim will outpace GFP_ATOMIC allocations. Thus, an oom kill is really the best policy at this point to provide an actual guarantee of net positive memory freeing. This isn't a matter of any specific networking stack; the scope of allocations that can trigger this is the set of all GFP_ATOMIC (or GFP_MEMALLOC) allocations in the kernel. Tetsuo: the specific allocation that triggers a page allocation failure is not interesting; we have tens of thousands of examples. Each example is simply the unlucky last GFP_ATOMIC allocation that fails; the interesting point is the amount of free memory. In other words, when free memory is below ALLOC_HIGH watermarks, we assume that we have depleted memory reserves *faster* than when user allocations started to fail. In the interest of userspace being responsive, we should oom kill here.
On 2020/04/27 12:12, David Rientjes wrote: > Tetsuo: the specific allocation that triggers a page allocation failure is > not interesting; we have tens of thousands of examples. Each example is > simply the unlucky last GFP_ATOMIC allocation that fails; the interesting > point is the amount of free memory. In other words, when free memory is > below ALLOC_HIGH watermarks, we assume that we have depleted memory > reserves *faster* than when user allocations started to fail. In the > interest of userspace being responsive, we should oom kill here. My interest is, which function (and which process if process context) is [ab]using GFP_ATOMIC (or __GFP_MEMALLOC) allocations enough to hit memory allocation failure. GFP_NOWAIT (or __GFP_NOMEMALLOC) could be used if that allocation can't sleep and can't shortly recover free memory.
On 4/26/20 2:27 AM, Andrew Morton wrote: > On Fri, 24 Apr 2020 13:48:06 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > >> If GFP_ATOMIC allocations will start failing soon because the amount of >> free memory is substantially under per-zone min watermarks, it is better >> to oom kill a process rather than continue to reclaim. >> >> This intends to significantly reduce the number of page allocation >> failures that are encountered when the demands of user and atomic >> allocations overwhelm the ability of reclaim to keep up. We can see this >> with a high ingress of networking traffic where memory allocated in irq >> context can overwhelm the ability to reclaim fast enough such that user >> memory consistently loops. In that case, we have reclaimable memory, and > "user memory allocation", I assume? Or maybe "blockable memory > allocatoins". > >> reclaiming is successful, but we've fully depleted memory reserves that >> are allowed for non-blockable allocations. >> >> Commit 400e22499dd9 ("mm: don't warn about allocations which stall for >> too long") removed evidence of user allocations stalling because of this, >> but the situation can apply anytime we get "page allocation failures" >> where reclaim is happening but per-zone min watermarks are starved: >> >> Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB >> lowmem_reserve[]: 0 0 0 0 >> Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB >> lowmem_reserve[]: 0 0 0 0 >> >> Without this patch, there is no guarantee that user memory allocations >> will ever be successful when non-blockable allocations overwhelm the >> ability to get above per-zone min watermarks. >> >> This doesn't solve page allocation failures entirely since it's a >> preemptive measure based on watermarks that requires concurrent blockable >> allocations to trigger the oom kill. To complete solve page allocation >> failures, it would be possible to do the same watermark check for non- >> blockable allocations and then queue a worker to asynchronously oom kill >> if it finds watermarks to be sufficiently low as well. >> > Well, what's really going on here? > > Is networking potentially consuming an unbounded amount of memory? If > so, then killing a process will just cause networking to consume more > memory then hit against the same thing. So presumably the answer is > "no, the watermarks are inappropriately set for this workload". > > So would it not be sensible to dynamically adjust the watermarks in > response to this condition? Maintain a larger pool of memory for these > allocations? Or possibly push back on networking and tell it to reduce > its queue sizes? So that stuff doesn't keep on getting oom-killed? > I think I seen similar issues when dma-buf allocate a lot. But that is on older kernels and out of tree. So networking is maybe not the only cause. dma-buf are used a lot for camera stuff in android.
On Fri 24-04-20 13:48:06, David Rientjes wrote: > If GFP_ATOMIC allocations will start failing soon because the amount of > free memory is substantially under per-zone min watermarks, it is better > to oom kill a process rather than continue to reclaim. > > This intends to significantly reduce the number of page allocation > failures that are encountered when the demands of user and atomic > allocations overwhelm the ability of reclaim to keep up. We can see this > with a high ingress of networking traffic where memory allocated in irq > context can overwhelm the ability to reclaim fast enough such that user > memory consistently loops. In that case, we have reclaimable memory, and > reclaiming is successful, but we've fully depleted memory reserves that > are allowed for non-blockable allocations. > > Commit 400e22499dd9 ("mm: don't warn about allocations which stall for > too long") removed evidence of user allocations stalling because of this, > but the situation can apply anytime we get "page allocation failures" > where reclaim is happening but per-zone min watermarks are starved: > > Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 > Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB > lowmem_reserve[]: 0 0 0 0 > > Without this patch, there is no guarantee that user memory allocations > will ever be successful when non-blockable allocations overwhelm the > ability to get above per-zone min watermarks. We have never had any guarantee and we will not have any after this patch either. The fundamental problem is that direct reclaim doesn't provide any guarantee that the reclaimed memory is going to be used for the reclaimer. You can see the same if the memory demand is higher than the reclaim. GFP_ATOMIC is only different in that aspect that they are not throttled by reclaiming and consume what is availble right away which makes the problem worse. > This doesn't solve page allocation failures entirely since it's a > preemptive measure based on watermarks that requires concurrent blockable > allocations to trigger the oom kill. To complete solve page allocation > failures, it would be possible to do the same watermark check for non- > blockable allocations and then queue a worker to asynchronously oom kill > if it finds watermarks to be sufficiently low as well. I do not think this is the right approach. This patch is also quite dangerous as pointed out by Tetsuo. You are effectively allowing a remote DoS via OOM killer. If the reclaim is making progress then the issue seem to be more on the configuration side I believe. Have you tried to tune watermarks resp. watermark_scale_factor? Another potential problem might be that the kswapd is not making sufficient progress because it is blocked on something. > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/page_alloc.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4372,11 +4372,21 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > ac->nodemask) { > unsigned long available; > unsigned long reclaimable; > + unsigned long free; > unsigned long min_wmark = min_wmark_pages(zone); > bool wmark; > > + free = zone_page_state_snapshot(zone, NR_FREE_PAGES); > + /* > + * If this zone is approaching the point where even order-0 > + * GFP_ATOMIC allocations will fail, stop considering reclaim. > + */ > + if (!__zone_watermark_ok(zone, 0, min_wmark, ac_classzone_idx(ac), > + alloc_flags | ALLOC_HIGH, free)) > + continue; > + > available = reclaimable = zone_reclaimable_pages(zone); > - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + available += free; > > /* > * Would the allocation succeed if we reclaimed all
On Sun, 26 Apr 2020 20:12:58 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > blockable allocations and then queue a worker to asynchronously oom kill > > > if it finds watermarks to be sufficiently low as well. > > > > > > > Well, what's really going on here? > > > > Is networking potentially consuming an unbounded amount of memory? If > > so, then killing a process will just cause networking to consume more > > memory then hit against the same thing. So presumably the answer is > > "no, the watermarks are inappropriately set for this workload". > > > > So would it not be sensible to dynamically adjust the watermarks in > > response to this condition? Maintain a larger pool of memory for these > > allocations? Or possibly push back on networking and tell it to reduce > > its queue sizes? So that stuff doesn't keep on getting oom-killed? > > > > No - that would actually make the problem worse. > > Today, per-zone min watermarks dictate when user allocations will loop or > oom kill. should_reclaim_retry() currently loops if reclaim has succeeded > in the past few tries and we should be able to allocate if we are able to > reclaim the amount of memory that we think we can. > > The issue is that this supposes that looping to reclaim more will result > in more free memory. That doesn't always happen if there are concurrent > memory allocators. > > GFP_ATOMIC allocators can access below these per-zone watermarks. So the > issue is that per-zone free pages stays between ALLOC_HIGH watermarks > (the watermark that GFP_ATOMIC allocators can allocate to) and min > watermarks. We never reclaim enough memory to get back to min watermarks > because reclaim cannot keep up with the amount of GFP_ATOMIC allocations. But there should be an upper bound upon the total amount of in-flight GFP_ATOMIC memory at any point in time? These aren't like pagecache which will take more if we give it more. Setting the various thresholds appropriately should ensure that blockable allocations don't get their memory stolen by GPP_ATOMIC allocations? I took a look at doing a quick-fix for the direct-reclaimers-get-their-stuff-stolen issue about a million years ago. I don't recall where it ended up. It's pretty trivial for the direct reclaimer to free pages into current->reclaimed_pages and to take a look in there on the allocation path, etc. But it's only practical for order-0 pages.
On Mon, 27 Apr 2020, Andrew Morton wrote: > > No - that would actually make the problem worse. > > > > Today, per-zone min watermarks dictate when user allocations will loop or > > oom kill. should_reclaim_retry() currently loops if reclaim has succeeded > > in the past few tries and we should be able to allocate if we are able to > > reclaim the amount of memory that we think we can. > > > > The issue is that this supposes that looping to reclaim more will result > > in more free memory. That doesn't always happen if there are concurrent > > memory allocators. > > > > GFP_ATOMIC allocators can access below these per-zone watermarks. So the > > issue is that per-zone free pages stays between ALLOC_HIGH watermarks > > (the watermark that GFP_ATOMIC allocators can allocate to) and min > > watermarks. We never reclaim enough memory to get back to min watermarks > > because reclaim cannot keep up with the amount of GFP_ATOMIC allocations. > > But there should be an upper bound upon the total amount of in-flight > GFP_ATOMIC memory at any point in time? These aren't like pagecache > which will take more if we give it more. Setting the various > thresholds appropriately should ensure that blockable allocations don't > get their memory stolen by GPP_ATOMIC allocations? > Certainly if that upper bound is defined and enforced somewhere we would not have run into this issue causing all userspace to become completely unresponsive. Do you have links to patches that proposed enforcing this upper bound? It seems like it would have to be generic to __alloc_pages_slowpath() because otherwise multiple different GFP_ATOMIC allocators, all from different sources, couldn't orchestrate their memory allocations amongst themselves to enforce this upper bound. They would need to work together to ensure they don't conspire to cause this depletion. I'd be happy to take a look if there are links to other approaches.
On Mon, 27 Apr 2020 16:03:56 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > On Mon, 27 Apr 2020, Andrew Morton wrote: > > > > No - that would actually make the problem worse. > > > > > > Today, per-zone min watermarks dictate when user allocations will loop or > > > oom kill. should_reclaim_retry() currently loops if reclaim has succeeded > > > in the past few tries and we should be able to allocate if we are able to > > > reclaim the amount of memory that we think we can. > > > > > > The issue is that this supposes that looping to reclaim more will result > > > in more free memory. That doesn't always happen if there are concurrent > > > memory allocators. > > > > > > GFP_ATOMIC allocators can access below these per-zone watermarks. So the > > > issue is that per-zone free pages stays between ALLOC_HIGH watermarks > > > (the watermark that GFP_ATOMIC allocators can allocate to) and min > > > watermarks. We never reclaim enough memory to get back to min watermarks > > > because reclaim cannot keep up with the amount of GFP_ATOMIC allocations. > > > > But there should be an upper bound upon the total amount of in-flight > > GFP_ATOMIC memory at any point in time? These aren't like pagecache > > which will take more if we give it more. Setting the various > > thresholds appropriately should ensure that blockable allocations don't > > get their memory stolen by GPP_ATOMIC allocations? > > > > Certainly if that upper bound is defined and enforced somewhere we would > not have run into this issue causing all userspace to become completely > unresponsive. Do you have links to patches that proposed enforcing this > upper bound? There is no such enforcement and there are no such patches, as I'm sure you know. No consumer of GFP_ATOMIC memory should consume an unbounded amount of it. Subsystems such as networking will consume a certain amount and will then start recycling it. The total amount in-flight will vary over the longer term as workloads change. A dynamically tuning threshold system will need to adapt rapidly enough to sudden load shifts, which might require unreasonable amounts of headroom. Michal asked relevant questions regarding watermark tuning - an ansewr to those would be interesting. To amplify that, is it possible to manually tune this system so that the problem no longer exhibits? If so, then why can't that tuning be performed automatically?
On Mon 27-04-20 16:35:58, Andrew Morton wrote: [...] > No consumer of GFP_ATOMIC memory should consume an unbounded amount of > it. > Subsystems such as networking will consume a certain amount and > will then start recycling it. The total amount in-flight will vary > over the longer term as workloads change. A dynamically tuning > threshold system will need to adapt rapidly enough to sudden load > shifts, which might require unreasonable amounts of headroom. I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the reserves under memory pressure. Then allocatios start failing very quickly and users have to cope with that, usually by deferring to a sleepable context. Tuning reserves dynamically for heavy reserves consumers would be possible but I am worried that this is far from trivial. We definitely need to understand what is going on here. Why doesn't kswapd + N*direct reclaimers do not provide enough memory to satisfy both N threads + reserves consumers? How many times those direct reclaimers have to retry? We used to have the allocation stall warning as David mentioned in the patch description and I have seen it triggering without heavy reserves consumers (aka reported free pages corresponded to the min watermark). The underlying problem was usually kswapd being stuck on some FS locks, direct reclaimers stuck in shrinkers or way too overloaded system with dozens if not hundreds of processes stuck in the page allocator each racing with the reclaim and betting on luck. The last problem was the most annoying because it is really hard to tune for.
On 4/27/20 10:30 PM, Andrew Morton wrote: > On Sun, 26 Apr 2020 20:12:58 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > >> >> GFP_ATOMIC allocators can access below these per-zone watermarks. So the >> issue is that per-zone free pages stays between ALLOC_HIGH watermarks >> (the watermark that GFP_ATOMIC allocators can allocate to) and min >> watermarks. We never reclaim enough memory to get back to min watermarks >> because reclaim cannot keep up with the amount of GFP_ATOMIC allocations. > > But there should be an upper bound upon the total amount of in-flight > GFP_ATOMIC memory at any point in time? These aren't like pagecache If it's a network receive path, then this is effectively bounded by link speed vs ability to deal with the packets quickly and free the buffers. And the bursts of incoming packets might be out of control of the admin. With my "enterprise kernel support" hat on, it's it's annoying enough to explain GFP_ATOMIC failures (usually high-order) in dmesg every once in a while (the usual suggestion is to bump min_free_kbytes and stress that unless they are frequent, there's no actual harm as networking can defer the allocation to non-atomic context). If there was an OOM kill as a result, that could not be disabled, I can well imagine we would have to revert such patch in our kernel as a result due to the DOS (intentional or not) potential. > which will take more if we give it more. Setting the various > thresholds appropriately should ensure that blockable allocations don't > get their memory stolen by GPP_ATOMIC allocations? I agree with the view that GFP_ATOMIC is only a (perhaps more visible) part of the problem that there's no fairness guarantee in reclaim, and allocators can steal from each other. GFP_ATOMIC allocations just have it easier thanks to lower thresholds. > I took a look at doing a quick-fix for the > direct-reclaimers-get-their-stuff-stolen issue about a million years > ago. I don't recall where it ended up. It's pretty trivial for the > direct reclaimer to free pages into current->reclaimed_pages and to > take a look in there on the allocation path, etc. But it's only > practical for order-0 pages. FWIW there's already such approach added to compaction by Mel some time ago, so order>0 allocations are covered to some extent. But in this case I imagine that compaction won't even start because order-0 watermarks are too low. The order-0 reclaim capture might work though - as a result the GFP_ATOMIC allocations would more likely fail and defer to their fallback context.
On Tue, 28 Apr 2020, Vlastimil Babka wrote: > > I took a look at doing a quick-fix for the > > direct-reclaimers-get-their-stuff-stolen issue about a million years > > ago. I don't recall where it ended up. It's pretty trivial for the > > direct reclaimer to free pages into current->reclaimed_pages and to > > take a look in there on the allocation path, etc. But it's only > > practical for order-0 pages. > > FWIW there's already such approach added to compaction by Mel some time ago, > so order>0 allocations are covered to some extent. But in this case I imagine > that compaction won't even start because order-0 watermarks are too low. > > The order-0 reclaim capture might work though - as a result the GFP_ATOMIC > allocations would more likely fail and defer to their fallback context. > Yes, order-0 reclaim capture is interesting since the issue being reported here is userspace going out to lunch because it loops for an unbounded amount of time trying to get above a watermark where it's allowed to allocate and other consumers are depleting that resource. We actually prefer to oom kill earlier rather than being put in a perpetual state of aggressive reclaim that affects all allocators and the unbounded nature of those allocations leads to very poor results for everybody. I'm happy to scope this solely to an order-0 reclaim capture. I'm not sure if I'm clear on whether this has been worked on before and patches existed in the past? Somewhat related to what I described in the changelog: we lost the "page allocation stalls" artifacts in the kernel log for 4.15. The commit description references an asynchronous mechanism for getting this information; I don't know where this mechanism currently lives.
On 2020/04/29 6:48, David Rientjes wrote: > Somewhat related to what I described in the changelog: we lost the "page > allocation stalls" artifacts in the kernel log for 4.15. The commit > description references an asynchronous mechanism for getting this > information; I don't know where this mechanism currently lives. The asynchronous mechanism I meant in that commit is https://lkml.kernel.org/r/c9c87635-6e00-5ce7-b05a-966011c8fe3f@I-love.SAKURA.ne.jp . We are still forced to live in a world where we cannot get a clue for stalled allocations. :-(
On 4/28/20 11:48 PM, David Rientjes wrote: > On Tue, 28 Apr 2020, Vlastimil Babka wrote: > > Yes, order-0 reclaim capture is interesting since the issue being reported > here is userspace going out to lunch because it loops for an unbounded > amount of time trying to get above a watermark where it's allowed to > allocate and other consumers are depleting that resource. > > We actually prefer to oom kill earlier rather than being put in a > perpetual state of aggressive reclaim that affects all allocators and the > unbounded nature of those allocations leads to very poor results for > everybody. Sure. My vague impression is that your (and similar cloud companies) kind of workloads are designed to maximize machine utilization, and overshooting and killing something as a result is no big deal. Then you perhaps have more probability of hitting this state, and on the other hand, even an occasional premature oom kill is not a big deal? My concers are workloads not designed in such a way, where premature oom kill due to temporary higher reclaim activity together with burst of incoming network packets will result in e.g. killing an important database. There, the tradeoff looks different. > I'm happy to scope this solely to an order-0 reclaim capture. I'm not > sure if I'm clear on whether this has been worked on before and patches > existed in the past? Andrew mentioned some. I don't recall any, so it might have been before my time. > Somewhat related to what I described in the changelog: we lost the "page > allocation stalls" artifacts in the kernel log for 4.15. The commit > description references an asynchronous mechanism for getting this > information; I don't know where this mechanism currently lives. >
On 4/28/20 9:43 AM, Michal Hocko wrote: > On Mon 27-04-20 16:35:58, Andrew Morton wrote: > [...] >> No consumer of GFP_ATOMIC memory should consume an unbounded amount of >> it. >> Subsystems such as networking will consume a certain amount and >> will then start recycling it. The total amount in-flight will vary >> over the longer term as workloads change. A dynamically tuning >> threshold system will need to adapt rapidly enough to sudden load >> shifts, which might require unreasonable amounts of headroom. > I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the > reserves under memory pressure. Then allocatios start failing very > quickly and users have to cope with that, usually by deferring to a > sleepable context. Tuning reserves dynamically for heavy reserves > consumers would be possible but I am worried that this is far from > trivial. > > We definitely need to understand what is going on here. Why doesn't > kswapd + N*direct reclaimers do not provide enough memory to satisfy > both N threads + reserves consumers? How many times those direct > reclaimers have to retry? Was this not supposed to be avoided with PSI, user-space should a fair change to take actions before it goes bad in user-space? > We used to have the allocation stall warning as David mentioned in the > patch description and I have seen it triggering without heavy reserves > consumers (aka reported free pages corresponded to the min watermark). > The underlying problem was usually kswapd being stuck on some FS locks, > direct reclaimers stuck in shrinkers or way too overloaded system with > dozens if not hundreds of processes stuck in the page allocator each > racing with the reclaim and betting on luck. The last problem was the > most annoying because it is really hard to tune for.
On Wed 29-04-20 10:31:41, peter enderborg wrote: > On 4/28/20 9:43 AM, Michal Hocko wrote: > > On Mon 27-04-20 16:35:58, Andrew Morton wrote: > > [...] > >> No consumer of GFP_ATOMIC memory should consume an unbounded amount of > >> it. > >> Subsystems such as networking will consume a certain amount and > >> will then start recycling it. The total amount in-flight will vary > >> over the longer term as workloads change. A dynamically tuning > >> threshold system will need to adapt rapidly enough to sudden load > >> shifts, which might require unreasonable amounts of headroom. > > I do agree. __GFP_HIGH/__GFP_ATOMIC are bound by the size of the > > reserves under memory pressure. Then allocatios start failing very > > quickly and users have to cope with that, usually by deferring to a > > sleepable context. Tuning reserves dynamically for heavy reserves > > consumers would be possible but I am worried that this is far from > > trivial. > > > > We definitely need to understand what is going on here. Why doesn't > > kswapd + N*direct reclaimers do not provide enough memory to satisfy > > both N threads + reserves consumers? How many times those direct > > reclaimers have to retry? > > Was this not supposed to be avoided with PSI, user-space should > a fair change to take actions before it goes bad in user-space? Yes, PSI is certainly a tool to help userspace make actions on heavy reclaim. And I agree that if there is a desire to trigger the oom killer early as David states elsewhere in the thread then this approach should be considered.
On Wed 29-04-20 09:51:39, Vlastimil Babka wrote: > On 4/28/20 11:48 PM, David Rientjes wrote: > > On Tue, 28 Apr 2020, Vlastimil Babka wrote: > > > > Yes, order-0 reclaim capture is interesting since the issue being reported > > here is userspace going out to lunch because it loops for an unbounded > > amount of time trying to get above a watermark where it's allowed to > > allocate and other consumers are depleting that resource. > > > > We actually prefer to oom kill earlier rather than being put in a > > perpetual state of aggressive reclaim that affects all allocators and the > > unbounded nature of those allocations leads to very poor results for > > everybody. > > Sure. My vague impression is that your (and similar cloud companies) kind of > workloads are designed to maximize machine utilization, and overshooting and > killing something as a result is no big deal. Then you perhaps have more > probability of hitting this state, and on the other hand, even an occasional > premature oom kill is not a big deal? > > My concers are workloads not designed in such a way, where premature oom kill > due to temporary higher reclaim activity together with burst of incoming network > packets will result in e.g. killing an important database. There, the tradeoff > looks different. Completely agreed! The in kernel OOM killer is to deal with situations when memory is desperately depleted without any sign of a forward progress. If there is a reclaimable memory then we are not there yet. If a workload can benefit from early oom killing based on response time then we have facilities to achieve that (e.g. PSI).
On 2020/04/29 18:04, Michal Hocko wrote: > Completely agreed! The in kernel OOM killer is to deal with situations > when memory is desperately depleted without any sign of a forward > progress. If there is a reclaimable memory then we are not there yet. > If a workload can benefit from early oom killing based on response time > then we have facilities to achieve that (e.g. PSI). Can PSI work even if userspace process cannot avoid reclaimable memory allocations (e.g. page fault, file read) is already stalling? I'm not sure whether PSI allows responding quickly enough to "keep reclaimable memory allocations not to reclaim" despite there is still reclaimable memory...
On Wed 29-04-20 19:45:07, Tetsuo Handa wrote: > On 2020/04/29 18:04, Michal Hocko wrote: > > Completely agreed! The in kernel OOM killer is to deal with situations > > when memory is desperately depleted without any sign of a forward > > progress. If there is a reclaimable memory then we are not there yet. > > If a workload can benefit from early oom killing based on response time > > then we have facilities to achieve that (e.g. PSI). > > Can PSI work even if userspace process cannot avoid reclaimable memory > allocations (e.g. page fault, file read) is already stalling? The userspace itself would have to be careful and use mlock of course. But collecting the psi information itself should be pretty independent on memory allocations as monitoring the system memory state is one of the main usecases. > I'm not sure > whether PSI allows responding quickly enough to "keep reclaimable memory > allocations not to reclaim" despite there is still reclaimable memory... PSI is supposed to monitor time spent in the memory allocator (among other things) and report the tendency. This should be a sufficient metric to judge that a large part of the userspace is not making forward progress.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4372,11 +4372,21 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; + unsigned long free; unsigned long min_wmark = min_wmark_pages(zone); bool wmark; + free = zone_page_state_snapshot(zone, NR_FREE_PAGES); + /* + * If this zone is approaching the point where even order-0 + * GFP_ATOMIC allocations will fail, stop considering reclaim. + */ + if (!__zone_watermark_ok(zone, 0, min_wmark, ac_classzone_idx(ac), + alloc_flags | ALLOC_HIGH, free)) + continue; + available = reclaimable = zone_reclaimable_pages(zone); - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + available += free; /* * Would the allocation succeed if we reclaimed all
If GFP_ATOMIC allocations will start failing soon because the amount of free memory is substantially under per-zone min watermarks, it is better to oom kill a process rather than continue to reclaim. This intends to significantly reduce the number of page allocation failures that are encountered when the demands of user and atomic allocations overwhelm the ability of reclaim to keep up. We can see this with a high ingress of networking traffic where memory allocated in irq context can overwhelm the ability to reclaim fast enough such that user memory consistently loops. In that case, we have reclaimable memory, and reclaiming is successful, but we've fully depleted memory reserves that are allowed for non-blockable allocations. Commit 400e22499dd9 ("mm: don't warn about allocations which stall for too long") removed evidence of user allocations stalling because of this, but the situation can apply anytime we get "page allocation failures" where reclaim is happening but per-zone min watermarks are starved: Node 0 Normal free:87356kB min:221984kB low:416984kB high:611984kB active_anon:123009936kB inactive_anon:67647652kB active_file:429612kB inactive_file:209980kB unevictable:112348kB writepending:260kB present:198180864kB managed:195027624kB mlocked:81756kB kernel_stack:24040kB pagetables:11460kB bounce:0kB free_pcp:940kB local_pcp:96kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:105616kB min:225568kB low:423716kB high:621864kB active_anon:122124196kB inactive_anon:74112696kB active_file:39172kB inactive_file:103696kB unevictable:204480kB writepending:180kB present:201326592kB managed:198174372kB mlocked:204480kB kernel_stack:11328kB pagetables:3680kB bounce:0kB free_pcp:1140kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Without this patch, there is no guarantee that user memory allocations will ever be successful when non-blockable allocations overwhelm the ability to get above per-zone min watermarks. This doesn't solve page allocation failures entirely since it's a preemptive measure based on watermarks that requires concurrent blockable allocations to trigger the oom kill. To complete solve page allocation failures, it would be possible to do the same watermark check for non- blockable allocations and then queue a worker to asynchronously oom kill if it finds watermarks to be sufficiently low as well. Signed-off-by: David Rientjes <rientjes@google.com> --- mm/page_alloc.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)