diff mbox series

[RESEND] mm:page_alloc.c: lower the order requirement of should_reclaim_retry

Message ID 1663556455-30188-1-git-send-email-quic_zhenhuah@quicinc.com (mailing list archive)
State New
Headers show
Series [RESEND] mm:page_alloc.c: lower the order requirement of should_reclaim_retry | expand

Commit Message

Zhenhua Huang Sept. 19, 2022, 3 a.m. UTC
When a driver was continuously allocating order 3
pages, it would be very easily OOM even there were lots of reclaimable
pages. A test module is used to reproduce this issue,
several key ftrace events are as below:

          insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
wmark_check=0
          insmod-6968    [005] ....   321.306009: compact_retry: order=3
priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0
max_retries=16 should_retry=1
          insmod-6968    [004] ....   321.308220:
mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0
          insmod-6968    [004] ....   321.308964: mm_compaction_end:
zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000,
mode=sync status=complete
          insmod-6968    [004] ....   321.308971: reclaim_retry_zone: node=0
zone=Normal   order=3 reclaimable=539830 available=592776 min_wmark=21227
no_progress_loops=0 wmark_check=0
          insmod-6968    [004] ....   321.308973: compact_retry: order=3
priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0
max_retries=16 should_retry=0

There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to
reclaim any more:
insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
wmark_check=0

From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified)
can meet should_reclaim_retry's requirement:
Normal  : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C)
94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C)
46*4096kB (C) = 571796kB

The reason of should_reclaim_retry early aborting was that is based on having the order
pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free
pages are the fundamental of compaction. It may not be suitable to stop reclaiming
when lots of page cache there. Relax order by one to fix this issue.

With the change meminfo output when first OOM showing page cache was nearly
exhausted:

Normal free: 462956kB min:8644kB low:44672kB high:50844kB
reserved_highatomic:4096KB active_anon:48kB inactive_anon:12kB
active_file:508kB inactive_file:552kB unevictable:109016kB writepending:160kB
present:7111680kB managed:6175004kB mlocked:107784kB pagetables:78732kB
bounce:0kB free_pcp:996kB local_pcp:0kB free_cma:376412kB

Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Michal Hocko Sept. 19, 2022, 8:14 a.m. UTC | #1
On Mon 19-09-22 11:00:55, Zhenhua Huang wrote:
> When a driver was continuously allocating order 3
> pages, it would be very easily OOM even there were lots of reclaimable
> pages. A test module is used to reproduce this issue,
> several key ftrace events are as below:
> 
>           insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
> wmark_check=0
>           insmod-6968    [005] ....   321.306009: compact_retry: order=3
> priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0
> max_retries=16 should_retry=1
>           insmod-6968    [004] ....   321.308220:
> mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0
>           insmod-6968    [004] ....   321.308964: mm_compaction_end:
> zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000,
> mode=sync status=complete
>           insmod-6968    [004] ....   321.308971: reclaim_retry_zone: node=0
> zone=Normal   order=3 reclaimable=539830 available=592776 min_wmark=21227
> no_progress_loops=0 wmark_check=0
>           insmod-6968    [004] ....   321.308973: compact_retry: order=3
> priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0
> max_retries=16 should_retry=0
> 
> There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to
> reclaim any more:
> insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
> wmark_check=0
> 
> >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified)
> can meet should_reclaim_retry's requirement:
> Normal  : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C)
> 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C)
> 46*4096kB (C) = 571796kB
> 
> The reason of should_reclaim_retry early aborting was that is based on having the order
> pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free
> pages are the fundamental of compaction. It may not be suitable to stop reclaiming
> when lots of page cache there. Relax order by one to fix this issue.

For the higher order request we rely on should_compact_retry which backs
on based on the compaction feedback. I would recommend looking why the
compaction fails.

Also this patch doesn't really explain why it should work and honestly
it doesn't really make much sense to me either.

> With the change meminfo output when first OOM showing page cache was nearly
> exhausted:
> 
> Normal free: 462956kB min:8644kB low:44672kB high:50844kB
> reserved_highatomic:4096KB active_anon:48kB inactive_anon:12kB
> active_file:508kB inactive_file:552kB unevictable:109016kB writepending:160kB
> present:7111680kB managed:6175004kB mlocked:107784kB pagetables:78732kB
> bounce:0kB free_pcp:996kB local_pcp:0kB free_cma:376412kB
> 
> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
> ---
>  mm/page_alloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 36b2021..b4ca6d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4954,8 +4954,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		/*
>  		 * Would the allocation succeed if we reclaimed all
>  		 * reclaimable pages?
> +		 * considering fragmentation, enough free pages are the
> +		 * fundamental of compaction:
> +		 * lower the order requirement by one
>  		 */
> -		wmark = __zone_watermark_ok(zone, order, min_wmark,
> +		wmark = __zone_watermark_ok(zone, order ? order - 1 : 0, min_wmark,
>  				ac->highest_zoneidx, alloc_flags, available);
>  		trace_reclaim_retry_zone(z, order, reclaimable,
>  				available, min_wmark, *no_progress_loops, wmark);
> -- 
> 2.7.4
Zhenhua Huang Sept. 19, 2022, 11:24 a.m. UTC | #2
Thanks Michal for comments!

On 2022/9/19 16:14, Michal Hocko wrote:
> On Mon 19-09-22 11:00:55, Zhenhua Huang wrote:
>> When a driver was continuously allocating order 3
>> pages, it would be very easily OOM even there were lots of reclaimable
>> pages. A test module is used to reproduce this issue,
>> several key ftrace events are as below:
>>
>>            insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
>> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
>> wmark_check=0
>>            insmod-6968    [005] ....   321.306009: compact_retry: order=3
>> priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0
>> max_retries=16 should_retry=1
>>            insmod-6968    [004] ....   321.308220:
>> mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0
>>            insmod-6968    [004] ....   321.308964: mm_compaction_end:
>> zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000,
>> mode=sync status=complete
>>            insmod-6968    [004] ....   321.308971: reclaim_retry_zone: node=0
>> zone=Normal   order=3 reclaimable=539830 available=592776 min_wmark=21227
>> no_progress_loops=0 wmark_check=0
>>            insmod-6968    [004] ....   321.308973: compact_retry: order=3
>> priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0
>> max_retries=16 should_retry=0
>>
>> There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to
>> reclaim any more:
>> insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
>> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
>> wmark_check=0
>>
>> >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified)
>> can meet should_reclaim_retry's requirement:
>> Normal  : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C)
>> 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C)
>> 46*4096kB (C) = 571796kB
>>
>> The reason of should_reclaim_retry early aborting was that is based on having the order
>> pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free
>> pages are the fundamental of compaction. It may not be suitable to stop reclaiming
>> when lots of page cache there. Relax order by one to fix this issue.
> 
> For the higher order request we rely on should_compact_retry which backs
> on based on the compaction feedback. I would recommend looking why the
> compaction fails.
I think the reason of compaction failure is there're not enough free 
pages. Like in ftrace events showed, free pages(which include CMA) was 
only 592856 - 539988 = 52868 pages(reclaimable=539988 available=592856).

There are some restrictions like suitable_migration_target() for free 
pages and suitable_migration_source() for movable pages. Hence eligible 
targets is fewer.

> 
> Also this patch doesn't really explain why it should work and honestly
> it doesn't really make much sense to me either.
Sorry, my fault. IMO, The reason it should work is, say for this case of 
order 3 allocation: we can perform direct reclaim more times as we have 
only order 2 pages(which *lowered* by this change) in 
free_list(8214*16kB (UEC)). The order requirement which I have lowered 
is should_reclaim_retry -> __zone_watermark_ok:
        for (o = order; o < MAX_ORDER; o++) {
                 struct free_area *area = &z->free_area[o];
...
                 for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
                         if (!free_area_empty(area, mt))
                                 return true;
                 }

Order 2 pages can be more easily met, hence VM has more chance to return 
true from should_reclaim_retry.


> 
>> With the change meminfo output when first OOM showing page cache was nearly
>> exhausted:
>>
>> Normal free: 462956kB min:8644kB low:44672kB high:50844kB
>> reserved_highatomic:4096KB active_anon:48kB inactive_anon:12kB
>> active_file:508kB inactive_file:552kB unevictable:109016kB writepending:160kB
>> present:7111680kB managed:6175004kB mlocked:107784kB pagetables:78732kB
>> bounce:0kB free_pcp:996kB local_pcp:0kB free_cma:376412kB
>>
>> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
>> ---
>>   mm/page_alloc.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 36b2021..b4ca6d1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4954,8 +4954,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>   		/*
>>   		 * Would the allocation succeed if we reclaimed all
>>   		 * reclaimable pages?
>> +		 * considering fragmentation, enough free pages are the
>> +		 * fundamental of compaction:
>> +		 * lower the order requirement by one
>>   		 */
>> -		wmark = __zone_watermark_ok(zone, order, min_wmark,
>> +		wmark = __zone_watermark_ok(zone, order ? order - 1 : 0, min_wmark,
>>   				ac->highest_zoneidx, alloc_flags, available);
>>   		trace_reclaim_retry_zone(z, order, reclaimable,
>>   				available, min_wmark, *no_progress_loops, wmark);
>> -- 
>> 2.7.4
> 

BTW, regarding the test case:
I used order 3 with GFP_KERNEL flag to continuously allocate to 
reproduce it. If you have other suggestions/ideas, please let me know. 
Thank you.

Thanks,
Zhenhua
Michal Hocko Sept. 19, 2022, 1:28 p.m. UTC | #3
On Mon 19-09-22 19:24:32, Zhenhua Huang wrote:
> Thanks Michal for comments!
> 
> On 2022/9/19 16:14, Michal Hocko wrote:
> > On Mon 19-09-22 11:00:55, Zhenhua Huang wrote:
> > > When a driver was continuously allocating order 3
> > > pages, it would be very easily OOM even there were lots of reclaimable
> > > pages. A test module is used to reproduce this issue,
> > > several key ftrace events are as below:
> > > 
> > >            insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
> > > order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
> > > wmark_check=0
> > >            insmod-6968    [005] ....   321.306009: compact_retry: order=3
> > > priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0
> > > max_retries=16 should_retry=1
> > >            insmod-6968    [004] ....   321.308220:
> > > mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0
> > >            insmod-6968    [004] ....   321.308964: mm_compaction_end:
> > > zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000,
> > > mode=sync status=complete
> > >            insmod-6968    [004] ....   321.308971: reclaim_retry_zone: node=0
> > > zone=Normal   order=3 reclaimable=539830 available=592776 min_wmark=21227
> > > no_progress_loops=0 wmark_check=0
> > >            insmod-6968    [004] ....   321.308973: compact_retry: order=3
> > > priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0
> > > max_retries=16 should_retry=0
> > > 
> > > There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to
> > > reclaim any more:
> > > insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
> > > order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
> > > wmark_check=0
> > > 
> > > >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified)
> > > can meet should_reclaim_retry's requirement:
> > > Normal  : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C)
> > > 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C)
> > > 46*4096kB (C) = 571796kB
> > > 
> > > The reason of should_reclaim_retry early aborting was that is based on having the order
> > > pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free
> > > pages are the fundamental of compaction. It may not be suitable to stop reclaiming
> > > when lots of page cache there. Relax order by one to fix this issue.
> > 
> > For the higher order request we rely on should_compact_retry which backs
> > on based on the compaction feedback. I would recommend looking why the
> > compaction fails.
> I think the reason of compaction failure is there're not enough free pages.
> Like in ftrace events showed, free pages(which include CMA) was only 592856
> - 539988 = 52868 pages(reclaimable=539988 available=592856).
> 
> There are some restrictions like suitable_migration_target() for free pages
> and suitable_migration_source() for movable pages. Hence eligible targets is
> fewer.

If the compaction decides the retry is not worth it then either it is
making a wrong call or it doesn't make sense to retry.
 
> > Also this patch doesn't really explain why it should work and honestly
> > it doesn't really make much sense to me either.
> Sorry, my fault. IMO, The reason it should work is, say for this case of
> order 3 allocation: we can perform direct reclaim more times as we have only
> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)).
> The order requirement which I have lowered is should_reclaim_retry ->
> __zone_watermark_ok:
>        for (o = order; o < MAX_ORDER; o++) {
>                 struct free_area *area = &z->free_area[o];
> ...
>                 for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
>                         if (!free_area_empty(area, mt))
>                                 return true;
>                 }
> 
> Order 2 pages can be more easily met, hence VM has more chance to return
> true from should_reclaim_retry.

This is a wrong approach to the problem because there is no real
guarantee the reclaim round will do anything useful. You should be
really looking at the compaction side of the thing.
Zhenhua Huang Sept. 20, 2022, 9:38 a.m. UTC | #4
Thanks Michal for comments!

On 2022/9/19 21:28, Michal Hocko wrote:
> On Mon 19-09-22 19:24:32, Zhenhua Huang wrote:
>> Thanks Michal for comments!
>>
>> On 2022/9/19 16:14, Michal Hocko wrote:
>>> On Mon 19-09-22 11:00:55, Zhenhua Huang wrote:
>>>> When a driver was continuously allocating order 3
>>>> pages, it would be very easily OOM even there were lots of reclaimable
>>>> pages. A test module is used to reproduce this issue,
>>>> several key ftrace events are as below:
>>>>
>>>>             insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
>>>> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
>>>> wmark_check=0
>>>>             insmod-6968    [005] ....   321.306009: compact_retry: order=3
>>>> priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0
>>>> max_retries=16 should_retry=1
>>>>             insmod-6968    [004] ....   321.308220:
>>>> mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0
>>>>             insmod-6968    [004] ....   321.308964: mm_compaction_end:
>>>> zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000,
>>>> mode=sync status=complete
>>>>             insmod-6968    [004] ....   321.308971: reclaim_retry_zone: node=0
>>>> zone=Normal   order=3 reclaimable=539830 available=592776 min_wmark=21227
>>>> no_progress_loops=0 wmark_check=0
>>>>             insmod-6968    [004] ....   321.308973: compact_retry: order=3
>>>> priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0
>>>> max_retries=16 should_retry=0
>>>>
>>>> There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to
>>>> reclaim any more:
>>>> insmod-6968    [005] ....   321.306007: reclaim_retry_zone: node=0 zone=Normal
>>>> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0
>>>> wmark_check=0
>>>>
>>>> >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified)
>>>> can meet should_reclaim_retry's requirement:
>>>> Normal  : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C)
>>>> 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C)
>>>> 46*4096kB (C) = 571796kB
>>>>
>>>> The reason of should_reclaim_retry early aborting was that is based on having the order
>>>> pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free
>>>> pages are the fundamental of compaction. It may not be suitable to stop reclaiming
>>>> when lots of page cache there. Relax order by one to fix this issue.
>>>
>>> For the higher order request we rely on should_compact_retry which backs
>>> on based on the compaction feedback. I would recommend looking why the
>>> compaction fails.
>> I think the reason of compaction failure is there're not enough free pages.
>> Like in ftrace events showed, free pages(which include CMA) was only 592856
>> - 539988 = 52868 pages(reclaimable=539988 available=592856).
>>
>> There are some restrictions like suitable_migration_target() for free pages
>> and suitable_migration_source() for movable pages. Hence eligible targets is
>> fewer.
> 
> If the compaction decides the retry is not worth it then either it is
> making a wrong call or it doesn't make sense to retry.

Yeah.. got it. This case compaction already tried its best with highest 
prio in ftrace logs.

>   
>>> Also this patch doesn't really explain why it should work and honestly
>>> it doesn't really make much sense to me either.
>> Sorry, my fault. IMO, The reason it should work is, say for this case of
>> order 3 allocation: we can perform direct reclaim more times as we have only
>> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)).
>> The order requirement which I have lowered is should_reclaim_retry ->
>> __zone_watermark_ok:
>>         for (o = order; o < MAX_ORDER; o++) {
>>                  struct free_area *area = &z->free_area[o];
>> ...
>>                  for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
>>                          if (!free_area_empty(area, mt))
>>                                  return true;
>>                  }
>>
>> Order 2 pages can be more easily met, hence VM has more chance to return
>> true from should_reclaim_retry.
> 
> This is a wrong approach to the problem because there is no real
> guarantee the reclaim round will do anything useful. You should be
> really looking at the compaction side of the thing.

Thanks Michal for the advice, I'll look at from compaction side also. 
But I have one further question, IMO reclaim(~2GB LRU pages can be 
reclaimed) should be more feasible compared to compaction(already tried 
with highest prio and failed) in this case? Could you please elaborate 
more...it seems I still not fully understand why it's a wrong approach 
to check from reclaim phase.

Thanks,
Zhenhua
Mel Gorman Sept. 20, 2022, 11:02 a.m. UTC | #5
On Tue, Sep 20, 2022 at 05:38:30PM +0800, Zhenhua Huang wrote:
> > > > Also this patch doesn't really explain why it should work and honestly
> > > > it doesn't really make much sense to me either.
> > > Sorry, my fault. IMO, The reason it should work is, say for this case of
> > > order 3 allocation: we can perform direct reclaim more times as we have only
> > > order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)).
> > > The order requirement which I have lowered is should_reclaim_retry ->
> > > __zone_watermark_ok:
> > >         for (o = order; o < MAX_ORDER; o++) {
> > >                  struct free_area *area = &z->free_area[o];
> > > ...
> > >                  for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> > >                          if (!free_area_empty(area, mt))
> > >                                  return true;
> > >                  }
> > > 
> > > Order 2 pages can be more easily met, hence VM has more chance to return
> > > true from should_reclaim_retry.
> > 
> > This is a wrong approach to the problem because there is no real
> > guarantee the reclaim round will do anything useful. You should be
> > really looking at the compaction side of the thing.
> 
> Thanks Michal for the advice, I'll look at from compaction side also. But I
> have one further question, IMO reclaim(~2GB LRU pages can be reclaimed)
> should be more feasible compared to compaction(already tried with highest
> prio and failed) in this case? Could you please elaborate more...it seems I
> still not fully understand why it's a wrong approach to check from reclaim
> phase.
> 

Because it risks major slowdowns due to excessive reclaim. Early support
used "lumpy reclaim" instead of compaction and resulted in major stalls when
trying to allocate THP resulting in THP often being disabled. The success
rates were great but systems could become unusable for several minutes
and ultimately this resulted in compaction and the current backoff logic
of reclaim. Your scenario is similar, you want to aggressively trying to
shrink slabs in case an order-3 block of pages gets freed. It might succeed
but the system grinds to a halt with excessive re-reading of information
from the disk for other use cases.

Your focus likely should be on reclaim and compaction aborting
prematurely because free CMA pages are available at the correct order
but the calling context cannot use CMA pages.

It's strange to hear of a driver that has a strict need for order-3 pages
being available at all times due to a lack of an IOMMU because that is
going to be fragile. One point of CMA was to carve out a region for such
drivers so they could the contiguous regions they needed. I believe phone
cameras were an early example. If your driver has strict requirements for
high-order page availability then CMA probably should be configured and
the driver should use CMA.
Michal Hocko Sept. 20, 2022, 11:20 a.m. UTC | #6
On Tue 20-09-22 17:38:30, Zhenhua Huang wrote:
[...]
> > > > Also this patch doesn't really explain why it should work and honestly
> > > > it doesn't really make much sense to me either.
> > > Sorry, my fault. IMO, The reason it should work is, say for this case of
> > > order 3 allocation: we can perform direct reclaim more times as we have only
> > > order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)).
> > > The order requirement which I have lowered is should_reclaim_retry ->
> > > __zone_watermark_ok:
> > >         for (o = order; o < MAX_ORDER; o++) {
> > >                  struct free_area *area = &z->free_area[o];
> > > ...
> > >                  for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> > >                          if (!free_area_empty(area, mt))
> > >                                  return true;
> > >                  }
> > > 
> > > Order 2 pages can be more easily met, hence VM has more chance to return
> > > true from should_reclaim_retry.
> > 
> > This is a wrong approach to the problem because there is no real
> > guarantee the reclaim round will do anything useful. You should be
> > really looking at the compaction side of the thing.
> 
> Thanks Michal for the advice, I'll look at from compaction side also. But I
> have one further question, IMO reclaim(~2GB LRU pages can be reclaimed)
> should be more feasible compared to compaction(already tried with highest
> prio and failed) in this case? Could you please elaborate more...it seems I
> still not fully understand why it's a wrong approach to check from reclaim
> phase.

Mel, has explained a large part I would go with let me just be more
explicit about one part. Memory reclaim is not order aware and that
means that probability of higher order pages are not directly related to
the number of reclaimed pages. You might be lucky and reclaim a THP to
form many order-3 pages or just reclaim those few that would create one
order-3 but all that is very unpredictable and hence we rely on the
compaction. Memory reclaim mostly assists compaction to allow some spare
memory to do the migration.
Zhenhua Huang Sept. 21, 2022, 8:12 a.m. UTC | #7
Thanks Mel for your detailed comments!

On 2022/9/20 19:02, Mel Gorman wrote:
> On Tue, Sep 20, 2022 at 05:38:30PM +0800, Zhenhua Huang wrote:
>>>>> Also this patch doesn't really explain why it should work and honestly
>>>>> it doesn't really make much sense to me either.
>>>> Sorry, my fault. IMO, The reason it should work is, say for this case of
>>>> order 3 allocation: we can perform direct reclaim more times as we have only
>>>> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)).
>>>> The order requirement which I have lowered is should_reclaim_retry ->
>>>> __zone_watermark_ok:
>>>>          for (o = order; o < MAX_ORDER; o++) {
>>>>                   struct free_area *area = &z->free_area[o];
>>>> ...
>>>>                   for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
>>>>                           if (!free_area_empty(area, mt))
>>>>                                   return true;
>>>>                   }
>>>>
>>>> Order 2 pages can be more easily met, hence VM has more chance to return
>>>> true from should_reclaim_retry.
>>>
>>> This is a wrong approach to the problem because there is no real
>>> guarantee the reclaim round will do anything useful. You should be
>>> really looking at the compaction side of the thing.
>>
>> Thanks Michal for the advice, I'll look at from compaction side also. But I
>> have one further question, IMO reclaim(~2GB LRU pages can be reclaimed)
>> should be more feasible compared to compaction(already tried with highest
>> prio and failed) in this case? Could you please elaborate more...it seems I
>> still not fully understand why it's a wrong approach to check from reclaim
>> phase.
>>
> 
> Because it risks major slowdowns due to excessive reclaim. Early support
> used "lumpy reclaim" instead of compaction and resulted in major stalls when
> trying to allocate THP resulting in THP often being disabled. The success
> rates were great but systems could become unusable for several minutes
> and ultimately this resulted in compaction and the current backoff logic
> of reclaim. Your scenario is similar, you want to aggressively trying to
> shrink slabs in case an order-3 block of pages gets freed. It might succeed
> but the system grinds to a halt with excessive re-reading of information
> from the disk for other use cases.

Thanks, I've also noticed. Contiguous reclaim obviously enlarged the 
time to OOM as I saw.
> 
> Your focus likely should be on reclaim and compaction aborting
> prematurely because free CMA pages are available at the correct order
> but the calling context cannot use CMA pages.
> 
> It's strange to hear of a driver that has a strict need for order-3 pages
> being available at all times due to a lack of an IOMMU because that is
> going to be fragile. One point of CMA was to carve out a region for such
> drivers so they could the contiguous regions they needed. I believe phone
> cameras were an early example. If your driver has strict requirements for
> high-order page availability then CMA probably should be configured and
> the driver should use CMA.
> 

You point is to avoid to allocate order 3 unless that's really needed.
Got it, thanks.

Thanks,
Zhenhua
Zhenhua Huang Sept. 21, 2022, 8:18 a.m. UTC | #8
Thanks Michal again!

On 2022/9/20 19:20, Michal Hocko wrote:
> On Tue 20-09-22 17:38:30, Zhenhua Huang wrote:
> [...]
>>>>> Also this patch doesn't really explain why it should work and honestly
>>>>> it doesn't really make much sense to me either.
>>>> Sorry, my fault. IMO, The reason it should work is, say for this case of
>>>> order 3 allocation: we can perform direct reclaim more times as we have only
>>>> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)).
>>>> The order requirement which I have lowered is should_reclaim_retry ->
>>>> __zone_watermark_ok:
>>>>          for (o = order; o < MAX_ORDER; o++) {
>>>>                   struct free_area *area = &z->free_area[o];
>>>> ...
>>>>                   for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
>>>>                           if (!free_area_empty(area, mt))
>>>>                                   return true;
>>>>                   }
>>>>
>>>> Order 2 pages can be more easily met, hence VM has more chance to return
>>>> true from should_reclaim_retry.
>>>
>>> This is a wrong approach to the problem because there is no real
>>> guarantee the reclaim round will do anything useful. You should be
>>> really looking at the compaction side of the thing.
>>
>> Thanks Michal for the advice, I'll look at from compaction side also. But I
>> have one further question, IMO reclaim(~2GB LRU pages can be reclaimed)
>> should be more feasible compared to compaction(already tried with highest
>> prio and failed) in this case? Could you please elaborate more...it seems I
>> still not fully understand why it's a wrong approach to check from reclaim
>> phase.
> 
> Mel, has explained a large part I would go with let me just be more
> explicit about one part. Memory reclaim is not order aware and that
> means that probability of higher order pages are not directly related to
> the number of reclaimed pages. You might be lucky and reclaim a THP to
> form many order-3 pages or just reclaim those few that would create one
> order-3 but all that is very unpredictable and hence we rely on the
> compaction. Memory reclaim mostly assists compaction to allow some spare
> memory to do the migration.

Thanks for your kind sharing between reclaim VS compaction for high 
order allocations, Michal. Now I agreed with Mel and you, compaction 
should play a more important role in this.. or we have to use CMA instead.

Thanks,
Zhenhua
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 36b2021..b4ca6d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4954,8 +4954,11 @@  should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		/*
 		 * Would the allocation succeed if we reclaimed all
 		 * reclaimable pages?
+		 * considering fragmentation, enough free pages are the
+		 * fundamental of compaction:
+		 * lower the order requirement by one
 		 */
-		wmark = __zone_watermark_ok(zone, order, min_wmark,
+		wmark = __zone_watermark_ok(zone, order ? order - 1 : 0, min_wmark,
 				ac->highest_zoneidx, alloc_flags, available);
 		trace_reclaim_retry_zone(z, order, reclaimable,
 				available, min_wmark, *no_progress_loops, wmark);