diff mbox series

[V3,3/3] mm: page_alloc: drain pcp lists before oom kill

Message ID a8e16f7eb295e1843f8edaa1ae1c68325c54c896.1699104759.git.quic_charante@quicinc.com (mailing list archive)
State New
Headers show
Series mm: page_alloc: fixes for early oom kills | expand

Commit Message

Charan Teja Kalla Nov. 5, 2023, 12:50 p.m. UTC
pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
progress is made in the attempt.

struct page *__alloc_pages_direct_reclaim() {
    .....
   *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
   if (unlikely(!(*did_some_progress)))
      goto out;
retry:
    page = get_page_from_freelist();
    if (!page && !drained) {
        drain_all_pages(NULL);
        drained = true;
        goto retry;
    }
out:
}

After the above, allocation attempt can fallback to
should_reclaim_retry() to decide reclaim retries. If it too return
false, allocation request will simply fallback to oom kill path without
even attempting the draining of the pcp pages that might help the
allocation attempt to succeed.

VM system running with ~50MB of memory shown the below stats during OOM
kill:
Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
reserved_highatomic:0KB managed:49152kB free_pcp:460kB

Though in such system state OOM kill is imminent, but the current kill
could have been delayed if the pcp is drained as pcp + free is even
above the high watermark.

Fix this missing drain of pcp list in should_reclaim_retry() along with
unreserving the high atomic page blocks, like it is done in
__alloc_pages_direct_reclaim().

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 mm/page_alloc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Charan Teja Kalla Nov. 5, 2023, 12:55 p.m. UTC | #1
Sorry, this is supposed to be PATCH V2 in place of V3. Not sure If I
have to resend it as V2 again.

On 11/5/2023 6:20 PM, Charan Teja Kalla wrote:
> pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
> progress is made in the attempt.
> 
> struct page *__alloc_pages_direct_reclaim() {
>     .....
>    *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
>    if (unlikely(!(*did_some_progress)))
>       goto out;
> retry:
>     page = get_page_from_freelist();
>     if (!page && !drained) {
>         drain_all_pages(NULL);
>         drained = true;
>         goto retry;
>     }
> out:
> }
> 
> After the above, allocation attempt can fallback to
> should_reclaim_retry() to decide reclaim retries. If it too return
> false, allocation request will simply fallback to oom kill path without
> even attempting the draining of the pcp pages that might help the
> allocation attempt to succeed.
> 
> VM system running with ~50MB of memory shown the below stats during OOM
> kill:
> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
> 
> Though in such system state OOM kill is imminent, but the current kill
> could have been delayed if the pcp is drained as pcp + free is even
> above the high watermark.
> 
> Fix this missing drain of pcp list in should_reclaim_retry() along with
> unreserving the high atomic page blocks, like it is done in
> __alloc_pages_direct_reclaim().
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b91c99e..8eee292 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		cond_resched();
>  out:
>  	/* Before OOM, exhaust highatomic_reserve */
> -	if (!ret)
> -		return unreserve_highatomic_pageblock(ac, true);
> +	if (!ret) {
> +		ret =  unreserve_highatomic_pageblock(ac, true);
> +		drain_all_pages(NULL);
> +	}
>  
>  	return ret;
>  }
Michal Hocko Nov. 9, 2023, 10:33 a.m. UTC | #2
On Sun 05-11-23 18:20:50, Charan Teja Kalla wrote:
> pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
> progress is made in the attempt.
> 
> struct page *__alloc_pages_direct_reclaim() {
>     .....
>    *did_some_progress = __perform_reclaim(gfp_mask, order, ac);
>    if (unlikely(!(*did_some_progress)))
>       goto out;
> retry:
>     page = get_page_from_freelist();
>     if (!page && !drained) {
>         drain_all_pages(NULL);
>         drained = true;
>         goto retry;
>     }
> out:
> }
> 
> After the above, allocation attempt can fallback to
> should_reclaim_retry() to decide reclaim retries. If it too return
> false, allocation request will simply fallback to oom kill path without
> even attempting the draining of the pcp pages that might help the
> allocation attempt to succeed.
> 
> VM system running with ~50MB of memory shown the below stats during OOM
> kill:
> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
> 
> Though in such system state OOM kill is imminent, but the current kill
> could have been delayed if the pcp is drained as pcp + free is even
> above the high watermark.

TBH I am not sure this is really worth it. Does it really reduce the
risk of the OOM in any practical situation?

> Fix this missing drain of pcp list in should_reclaim_retry() along with
> unreserving the high atomic page blocks, like it is done in
> __alloc_pages_direct_reclaim().
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  mm/page_alloc.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b91c99e..8eee292 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		cond_resched();
>  out:
>  	/* Before OOM, exhaust highatomic_reserve */
> -	if (!ret)
> -		return unreserve_highatomic_pageblock(ac, true);
> +	if (!ret) {
> +		ret =  unreserve_highatomic_pageblock(ac, true);
> +		drain_all_pages(NULL);
> +	}
>  
>  	return ret;
>  }
> -- 
> 2.7.4
Charan Teja Kalla Nov. 10, 2023, 4:36 p.m. UTC | #3
Thanks Michal!!

On 11/9/2023 4:03 PM, Michal Hocko wrote:
>> VM system running with ~50MB of memory shown the below stats during OOM
>> kill:
>> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
>> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
>>
>> Though in such system state OOM kill is imminent, but the current kill
>> could have been delayed if the pcp is drained as pcp + free is even
>> above the high watermark.
> TBH I am not sure this is really worth it. Does it really reduce the
> risk of the OOM in any practical situation?

At least in my particular stress test case it just delayed the OOM as i
can see that at the time of OOM kill, there are no free pcp pages. My
understanding of the OOM is that it should be the last resort and only
after doing the enough reclaim retries. CMIW here.

This patch just aims to not miss the corner case where we hit the OOM
without draining the pcp lists. And after draining, some systems may not
need the oom and some may still need the oom. My case is the later here
so I am really not sure If we ever encountered/noticed the former case here.

>
Michal Hocko Nov. 14, 2023, 10:48 a.m. UTC | #4
On Fri 10-11-23 22:06:22, Charan Teja Kalla wrote:
> Thanks Michal!!
> 
> On 11/9/2023 4:03 PM, Michal Hocko wrote:
> >> VM system running with ~50MB of memory shown the below stats during OOM
> >> kill:
> >> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
> >> reserved_highatomic:0KB managed:49152kB free_pcp:460kB
> >>
> >> Though in such system state OOM kill is imminent, but the current kill
> >> could have been delayed if the pcp is drained as pcp + free is even
> >> above the high watermark.
> > TBH I am not sure this is really worth it. Does it really reduce the
> > risk of the OOM in any practical situation?
> 
> At least in my particular stress test case it just delayed the OOM as i
> can see that at the time of OOM kill, there are no free pcp pages. My
> understanding of the OOM is that it should be the last resort and only
> after doing the enough reclaim retries. CMIW here.

Yes it is a last resort but it is a heuristic as well. So the real
questoin is whether this makes any practical difference outside of
artificial workloads. I do not see anything particularly worrying to
drain the pcp cache but it should be noted that this won't be 100%
either as racing freeing of memory will end up on pcp lists first.
Charan Teja Kalla Nov. 14, 2023, 4:36 p.m. UTC | #5
Thanks Michal!!

On 11/14/2023 4:18 PM, Michal Hocko wrote:
>> At least in my particular stress test case it just delayed the OOM as i
>> can see that at the time of OOM kill, there are no free pcp pages. My
>> understanding of the OOM is that it should be the last resort and only
>> after doing the enough reclaim retries. CMIW here.
> Yes it is a last resort but it is a heuristic as well. So the real
> questoin is whether this makes any practical difference outside of
> artificial workloads. I do not see anything particularly worrying to
> drain the pcp cache but it should be noted that this won't be 100%
> either as racing freeing of memory will end up on pcp lists first.

Okay, I don't have any practical scenario where this helped me in
avoiding the OOM.  Will comeback If I ever encounter this issue in
practical scenario.

Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
high atomic reserve calculations will help me.

Thanks.
Michal Hocko Nov. 15, 2023, 2:09 p.m. UTC | #6
On Tue 14-11-23 22:06:45, Charan Teja Kalla wrote:
> Thanks Michal!!
> 
> On 11/14/2023 4:18 PM, Michal Hocko wrote:
> >> At least in my particular stress test case it just delayed the OOM as i
> >> can see that at the time of OOM kill, there are no free pcp pages. My
> >> understanding of the OOM is that it should be the last resort and only
> >> after doing the enough reclaim retries. CMIW here.
> > Yes it is a last resort but it is a heuristic as well. So the real
> > questoin is whether this makes any practical difference outside of
> > artificial workloads. I do not see anything particularly worrying to
> > drain the pcp cache but it should be noted that this won't be 100%
> > either as racing freeing of memory will end up on pcp lists first.
> 
> Okay, I don't have any practical scenario where this helped me in
> avoiding the OOM.  Will comeback If I ever encounter this issue in
> practical scenario.
> 
> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
> high atomic reserve calculations will help me.

I do not have a strong opinion on that one to be honest. I am not even
sure that reserving a full page block (4MB) on small systems as
presented is really a good use of memory.
Charan Teja Kalla Nov. 16, 2023, 6 a.m. UTC | #7
Thanks Michal.

On 11/15/2023 7:39 PM, Michal Hocko wrote:
>> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
>> high atomic reserve calculations will help me.
> I do not have a strong opinion on that one to be honest. I am not even
> sure that reserving a full page block (4MB) on small systems as
> presented is really a good use of memory.

May be other way to look at that patch is comment is really not being
reflected in the code. It says, " Limit the number reserved to 1
pageblock or roughly 1% of a zone.", but the current code is making it 2
pageblocks. So, for 4M block size, it is > 1%.

A second patch, that I will post, like not reserving the high atomic
page blocks on small systems -- But how to define the meaning of small
systems is not sure. Instead will let the system administrators chose
this through either:
a) command line param, high_atomic_reserves=off, on by default --
Another knob, so admins may really not like this?
b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve.

Please lmk If you have any more suggestions here?

Also, I am thinking to request Andrew to pick [PATCH V2 1/3] patch and
take these discussions separately in a separate thread.

Thanks,
Charan
Michal Hocko Nov. 16, 2023, 12:55 p.m. UTC | #8
On Thu 16-11-23 11:30:04, Charan Teja Kalla wrote:
> Thanks Michal.
> 
> On 11/15/2023 7:39 PM, Michal Hocko wrote:
> >> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
> >> high atomic reserve calculations will help me.
> > I do not have a strong opinion on that one to be honest. I am not even
> > sure that reserving a full page block (4MB) on small systems as
> > presented is really a good use of memory.
> 
> May be other way to look at that patch is comment is really not being
> reflected in the code. It says, " Limit the number reserved to 1
> pageblock or roughly 1% of a zone.", but the current code is making it 2
> pageblocks. So, for 4M block size, it is > 1%.
> 
> A second patch, that I will post, like not reserving the high atomic
> page blocks on small systems -- But how to define the meaning of small
> systems is not sure. Instead will let the system administrators chose
> this through either:
> a) command line param, high_atomic_reserves=off, on by default --
> Another knob, so admins may really not like this?
> b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve.

Please don't! I do not see any admin wanting to care about this at all.
It just takes a lot of understanding of internal MM stuff to make an
educated guess. This should really be auto-tuned. And as responded in
other reply my take would be to reserve a page block on if it doesn't
consume more than 1% of memory to preserve the existing behavior yet not
overconsume on small systems.
 
> Please lmk If you have any more suggestions here?
> 
> Also, I am thinking to request Andrew to pick [PATCH V2 1/3] patch and
> take these discussions separately in a separate thread.

That makes sense as that is a clear bug fix.
Charan Teja Kalla Nov. 17, 2023, 5:43 a.m. UTC | #9
Thanks Michal!!

On 11/16/2023 6:25 PM, Michal Hocko wrote:
>> May be other way to look at that patch is comment is really not being
>> reflected in the code. It says, " Limit the number reserved to 1
>> pageblock or roughly 1% of a zone.", but the current code is making it 2
>> pageblocks. So, for 4M block size, it is > 1%.
>>
>> A second patch, that I will post, like not reserving the high atomic
>> page blocks on small systems -- But how to define the meaning of small
>> systems is not sure. Instead will let the system administrators chose
>> this through either:
>> a) command line param, high_atomic_reserves=off, on by default --
>> Another knob, so admins may really not like this?
>> b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve.
> Please don't! I do not see any admin wanting to care about this at all.
> It just takes a lot of understanding of internal MM stuff to make an
> educated guess. This should really be auto-tuned. And as responded in
> other reply my take would be to reserve a page block on if it doesn't
> consume more than 1% of memory to preserve the existing behavior yet not
> overconsume on small systems.

This idea of auto tune, by reserving a pageblock only if it doesn't
consume more than 1% of memory, seems cleaner to me.  For a page block
size of 4MB, this will turnout to be upto 400MB of RAM.

If it is fine, I can post a patch with suggested-by: you.

>
Zach O'Keefe Jan. 25, 2024, 4:36 p.m. UTC | #10
Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it.

I took a look at data from our fleet, and there are many cases on
high-cpu-count machines where we find multi-GiB worth of data sitting
on pcpu free lists at the time of system oom-kill, when free memory
for the relevant zones are below min watermarks. I.e. clear cases
where this patch could have prevented OOM.

This kind of issue scales with the number of cpus, so presumably this
patch will only become increasingly valuable to both datacenters and
desktops alike going forward. Can we revamp it as a standalone patch?

Thanks,
Zach


On Tue, Nov 14, 2023 at 8:37 AM Charan Teja Kalla
<quic_charante@quicinc.com> wrote:
>
> Thanks Michal!!
>
> On 11/14/2023 4:18 PM, Michal Hocko wrote:
> >> At least in my particular stress test case it just delayed the OOM as i
> >> can see that at the time of OOM kill, there are no free pcp pages. My
> >> understanding of the OOM is that it should be the last resort and only
> >> after doing the enough reclaim retries. CMIW here.
> > Yes it is a last resort but it is a heuristic as well. So the real
> > questoin is whether this makes any practical difference outside of
> > artificial workloads. I do not see anything particularly worrying to
> > drain the pcp cache but it should be noted that this won't be 100%
> > either as racing freeing of memory will end up on pcp lists first.
>
> Okay, I don't have any practical scenario where this helped me in
> avoiding the OOM.  Will comeback If I ever encounter this issue in
> practical scenario.
>
> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
> high atomic reserve calculations will help me.
>
> Thanks.
>
Charan Teja Kalla Jan. 26, 2024, 10:47 a.m. UTC | #11
Hi Michal/Zach,

On 1/25/2024 10:06 PM, Zach O'Keefe wrote:
> Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it.
> 
> I took a look at data from our fleet, and there are many cases on
> high-cpu-count machines where we find multi-GiB worth of data sitting
> on pcpu free lists at the time of system oom-kill, when free memory
> for the relevant zones are below min watermarks. I.e. clear cases
> where this patch could have prevented OOM.
> 
> This kind of issue scales with the number of cpus, so presumably this
> patch will only become increasingly valuable to both datacenters and
> desktops alike going forward. Can we revamp it as a standalone patch?
> 

Glad to see a real world use case for this. We too have observed OOM for
 every now and then with relatively significant PCP cache, but in all
such cases OOM is imminent.

AFAICS, Your use case description to be seen like a premature OOM
scenario despite lot of free memory sitting on the pcp lists, where this
patch should've helped.

@Michal: This usecase seems to be a practical scenario that you were
asking below.
Other concern of racing freeing of memory ending up in pcp lists first
-- will that be such a big issue? This patch enables, drain the current
pcp lists now that can avoid the oom altogether. If this racing free is
a major concern, should that be taken as a separate discussion?

Will revamp this as a separate patch if no more concerns here.

> Thanks,
> Zach
> 
> 
> On Tue, Nov 14, 2023 at 8:37 AM Charan Teja Kalla
> <quic_charante@quicinc.com> wrote:
>>
>> Thanks Michal!!
>>
>> On 11/14/2023 4:18 PM, Michal Hocko wrote:
>>>> At least in my particular stress test case it just delayed the OOM as i
>>>> can see that at the time of OOM kill, there are no free pcp pages. My
>>>> understanding of the OOM is that it should be the last resort and only
>>>> after doing the enough reclaim retries. CMIW here.
>>> Yes it is a last resort but it is a heuristic as well. So the real
>>> questoin is whether this makes any practical difference outside of
>>> artificial workloads. I do not see anything particularly worrying to
>>> drain the pcp cache but it should be noted that this won't be 100%
>>> either as racing freeing of memory will end up on pcp lists first.
>>
>> Okay, I don't have any practical scenario where this helped me in
>> avoiding the OOM.  Will comeback If I ever encounter this issue in
>> practical scenario.
>>
>> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct
>> high atomic reserve calculations will help me.
>>
>> Thanks.
>>
Michal Hocko Jan. 26, 2024, 10:57 a.m. UTC | #12
On Fri 26-01-24 16:17:04, Charan Teja Kalla wrote:
> Hi Michal/Zach,
> 
> On 1/25/2024 10:06 PM, Zach O'Keefe wrote:
> > Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it.
> > 
> > I took a look at data from our fleet, and there are many cases on
> > high-cpu-count machines where we find multi-GiB worth of data sitting
> > on pcpu free lists at the time of system oom-kill, when free memory
> > for the relevant zones are below min watermarks. I.e. clear cases
> > where this patch could have prevented OOM.
> > 
> > This kind of issue scales with the number of cpus, so presumably this
> > patch will only become increasingly valuable to both datacenters and
> > desktops alike going forward. Can we revamp it as a standalone patch?

Do you have any example OOM reports? There were recent changes to scale
the pcp pages and it would be good to know whether they work reasonably
well even under memory pressure.

I am not objecting to the patch discussed here but it would be really
good to understand the underlying problem and the scale of it.

Thanks!
Zach O'Keefe Jan. 26, 2024, 10:51 p.m. UTC | #13
Hey Michal,

> Do you have any example OOM reports? [..]

Sure, here is one on a 1TiB, 128-physical core machine running a
5.10-based kernel (sorry, it reads pretty awkwardly when wrapped):

---8<---
mytask invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE),
order=0, oom_score_adj=0
<...>
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=sdc,mems_allowed=0-1,global_oom,task_memcg=/sdc,task=mytask,pid=835214,uid=0
Out of memory: Killed process 835214 (mytask) total-vm:787716604kB,
anon-rss:787536152kB, file-rss:64kB, shmem-rss:0kB, UID:0
pgtables:1541224kB oom_score_adj:0, hugetlb-usage:0kB
Mem-Info:
active_anon:320 inactive_anon:198083493 isolated_anon:0
 active_file:128283 inactive_file:290086 isolated_file:0
 unevictable:3525 dirty:15 writeback:0
 slab_reclaimable:35505 slab_unreclaimable:272917
 mapped:46414 shmem:822 pagetables:64085088
 sec_pagetables:0 bounce:0
 kernel_misc_reclaimable:0
 free:325793 free_pcp:263277 free_cma:0
Node 0 active_anon:1112kB inactive_anon:268172556kB
active_file:270992kB inactive_file:254612kB unevictable:12404kB
isolated(anon):0kB isolated(file):0kB mapped:147240kB dirty:52kB
writeback:0kB shmem:304kB shmem_thp:0kB shmem_pmdmapped:0kB
anon_thp:1310720kB writeback_tmp:0kB kernel_stack:32000kB
pagetables:255483108kB sec_pagetables:0kB all_unreclaimable? yes
Node 1 active_anon:168kB inactive_anon:524161416kB
active_file:242140kB inactive_file:905732kB unevictable:1696kB
isolated(anon):0kB isolated(file):0kB mapped:38416kB dirty:8kB
writeback:0kB shmem:2984kB shmem_thp:0kB shmem_pmdmapped:0kB
anon_thp:267732992kB writeback_tmp:0kB kernel_stack:8520kB
pagetables:857244kB sec_pagetables:0kB all_unreclaimable? yes
Node 0 Crash free:72kB min:108kB low:220kB high:332kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:111940kB
active_file:280kB inactive_file:316kB unevictable:0kB writepending:4kB
present:114284kB managed:114196kB mlocked:0kB bounce:0kB
free_pcp:1528kB local_pcp:24kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB
reserved_highatomic:0KB active_anon:8kB inactive_anon:19456kB
active_file:4kB inactive_file:224kB unevictable:0kB writepending:0kB
present:2643512kB managed:2643512kB mlocked:0kB bounce:0kB
free_pcp:8040kB local_pcp:244kB free_cma:0kB
lowmem_reserve[]: 0 0 16029 16029
Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB
reserved_highatomic:0KB active_anon:1104kB inactive_anon:268040520kB
active_file:270708kB inactive_file:254072kB unevictable:12404kB
writepending:48kB present:533969920kB managed:525510968kB
mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB
free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:723460kB min:755656kB low:1284080kB high:1812504kB
reserved_highatomic:0KB active_anon:168kB inactive_anon:524161416kB
active_file:242140kB inactive_file:905732kB unevictable:1696kB
writepending:8kB present:536866816kB managed:528427664kB
mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB
free_cma:0kB
lowmem_reserve[]: 0 0 0 0
Node 0 Crash: 0*4kB 0*8kB 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 16kB
Node 0 DMA32: 80*4kB (UME) 74*8kB (UE) 23*16kB (UME) 21*32kB (UME)
40*64kB (UE) 35*128kB (UME) 3*256kB (UE) 9*512kB (UME) 13*1024kB (UM)
19*2048kB (UME) 0*4096kB = 66592kB
Node 0 Normal: 1999*4kB (UE) 259*8kB (UM) 465*16kB (UM) 114*32kB (UE)
54*64kB (UME) 14*128kB (U) 74*256kB (UME) 128*512kB (UE) 96*1024kB (U)
56*2048kB (U) 46*4096kB (U) = 512292kB
Node 1 Normal: 2280*4kB (UM) 12667*8kB (UM) 8859*16kB (UME) 5221*32kB
(UME) 1631*64kB (UME) 899*128kB (UM) 330*256kB (UME) 0*512kB 0*1024kB
0*2048kB 0*4096kB = 723208kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
420675 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 268435456kB
Total swap = 268435456kB
---8<---

Node 0/1 Normal free memory is below respective min watermarks, with
790040kB+253500kB ~= 1GiB of memory on pcp lists.

With this patch, the GFP_HIGHUSER_MOVABLE + unrestricted mems_allowed
allocation would have allowed us to access all that memory, very
likely avoiding the oom.

> [..] There were recent changes to scale
> the pcp pages and it would be good to know whether they work reasonably
> well even under memory pressure.

I'm not familiar with these changes, but a quick check of recent
activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote
pageset") ; is this what you are referring to?

Thanks, and have a great day,
Zach



>
> I am not objecting to the patch discussed here but it would be really
> good to understand the underlying problem and the scale of it.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 29, 2024, 3:04 p.m. UTC | #14
On Fri 26-01-24 14:51:26, Zach O'Keefe wrote:
[...]
> Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB
[...]
> free_pcp:8040kB local_pcp:244kB free_cma:0kB
> lowmem_reserve[]: 0 0 16029 16029
> Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB
[...]
> mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB
[...]
> mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB
[...]
> I'm not familiar with these changes, but a quick check of recent
> activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote
> pageset") ; is this what you are referring to?

No, but looking at above discrepancy between free_pcp and local_pcp
would point that direction for sure. So this is worth checking.
vmstat is a periodic activity and it cannot really deal with bursts
of memory allocations but it is quite possible that the patch above
will prevent the build up before it grows that large.

I originally referred to different work though https://lore.kernel.org/all/20231016053002.756205-10-ying.huang@intel.com/T/#m9fdfabaee37db1320bbc678a69d1cdd8391640e0
merged as ca71fe1ad922 ("mm, pcp: avoid to drain PCP when process exit")
and the associated patches.
Zach O'Keefe Feb. 6, 2024, 11:15 p.m. UTC | #15
On Mon, Jan 29, 2024 at 7:04 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 26-01-24 14:51:26, Zach O'Keefe wrote:
> [...]
> > Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB
> [...]
> > free_pcp:8040kB local_pcp:244kB free_cma:0kB
> > lowmem_reserve[]: 0 0 16029 16029
> > Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB
> [...]
> > mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB
> [...]
> > mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB
> [...]
> > I'm not familiar with these changes, but a quick check of recent
> > activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote
> > pageset") ; is this what you are referring to?
>
> No, but looking at above discrepancy between free_pcp and local_pcp
> would point that direction for sure. So this is worth checking.
> vmstat is a periodic activity and it cannot really deal with bursts
> of memory allocations but it is quite possible that the patch above
> will prevent the build up before it grows that large.
>
> I originally referred to different work though https://lore.kernel.org/all/20231016053002.756205-10-ying.huang@intel.com/T/#m9fdfabaee37db1320bbc678a69d1cdd8391640e0
> merged as ca71fe1ad922 ("mm, pcp: avoid to drain PCP when process exit")
> and the associated patches.

Thanks for the response, Michal, and also thank you for the reference here.

It'll take me a bit to evaluate how these patches might have helped,
and if draining pcpu would have added anything on top. At present,
that might take me a bit to get to, but I just wanted to thank you for
your response, and to leave this discussion, for the moment, with the
ball in my court to return w/ findings.

Thanks,
Zach

> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b91c99e..8eee292 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3857,8 +3857,10 @@  should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		cond_resched();
 out:
 	/* Before OOM, exhaust highatomic_reserve */
-	if (!ret)
-		return unreserve_highatomic_pageblock(ac, true);
+	if (!ret) {
+		ret =  unreserve_highatomic_pageblock(ac, true);
+		drain_all_pages(NULL);
+	}
 
 	return ret;
 }