Message ID | 1663556455-30188-1-git-send-email-quic_zhenhuah@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RESEND] mm:page_alloc.c: lower the order requirement of should_reclaim_retry | expand |
On Mon 19-09-22 11:00:55, Zhenhua Huang wrote: > When a driver was continuously allocating order 3 > pages, it would be very easily OOM even there were lots of reclaimable > pages. A test module is used to reproduce this issue, > several key ftrace events are as below: > > insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal > order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 > wmark_check=0 > insmod-6968 [005] .... 321.306009: compact_retry: order=3 > priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 > max_retries=16 should_retry=1 > insmod-6968 [004] .... 321.308220: > mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0 > insmod-6968 [004] .... 321.308964: mm_compaction_end: > zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000, > mode=sync status=complete > insmod-6968 [004] .... 321.308971: reclaim_retry_zone: node=0 > zone=Normal order=3 reclaimable=539830 available=592776 min_wmark=21227 > no_progress_loops=0 wmark_check=0 > insmod-6968 [004] .... 321.308973: compact_retry: order=3 > priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0 > max_retries=16 should_retry=0 > > There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to > reclaim any more: > insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal > order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 > wmark_check=0 > > >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified) > can meet should_reclaim_retry's requirement: > Normal : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C) > 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C) > 46*4096kB (C) = 571796kB > > The reason of should_reclaim_retry early aborting was that is based on having the order > pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free > pages are the fundamental of compaction. It may not be suitable to stop reclaiming > when lots of page cache there. Relax order by one to fix this issue. For the higher order request we rely on should_compact_retry which backs on based on the compaction feedback. I would recommend looking why the compaction fails. Also this patch doesn't really explain why it should work and honestly it doesn't really make much sense to me either. > With the change meminfo output when first OOM showing page cache was nearly > exhausted: > > Normal free: 462956kB min:8644kB low:44672kB high:50844kB > reserved_highatomic:4096KB active_anon:48kB inactive_anon:12kB > active_file:508kB inactive_file:552kB unevictable:109016kB writepending:160kB > present:7111680kB managed:6175004kB mlocked:107784kB pagetables:78732kB > bounce:0kB free_pcp:996kB local_pcp:0kB free_cma:376412kB > > Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> > --- > mm/page_alloc.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 36b2021..b4ca6d1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4954,8 +4954,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > /* > * Would the allocation succeed if we reclaimed all > * reclaimable pages? > + * considering fragmentation, enough free pages are the > + * fundamental of compaction: > + * lower the order requirement by one > */ > - wmark = __zone_watermark_ok(zone, order, min_wmark, > + wmark = __zone_watermark_ok(zone, order ? order - 1 : 0, min_wmark, > ac->highest_zoneidx, alloc_flags, available); > trace_reclaim_retry_zone(z, order, reclaimable, > available, min_wmark, *no_progress_loops, wmark); > -- > 2.7.4
Thanks Michal for comments! On 2022/9/19 16:14, Michal Hocko wrote: > On Mon 19-09-22 11:00:55, Zhenhua Huang wrote: >> When a driver was continuously allocating order 3 >> pages, it would be very easily OOM even there were lots of reclaimable >> pages. A test module is used to reproduce this issue, >> several key ftrace events are as below: >> >> insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal >> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 >> wmark_check=0 >> insmod-6968 [005] .... 321.306009: compact_retry: order=3 >> priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 >> max_retries=16 should_retry=1 >> insmod-6968 [004] .... 321.308220: >> mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0 >> insmod-6968 [004] .... 321.308964: mm_compaction_end: >> zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000, >> mode=sync status=complete >> insmod-6968 [004] .... 321.308971: reclaim_retry_zone: node=0 >> zone=Normal order=3 reclaimable=539830 available=592776 min_wmark=21227 >> no_progress_loops=0 wmark_check=0 >> insmod-6968 [004] .... 321.308973: compact_retry: order=3 >> priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0 >> max_retries=16 should_retry=0 >> >> There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to >> reclaim any more: >> insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal >> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 >> wmark_check=0 >> >> >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified) >> can meet should_reclaim_retry's requirement: >> Normal : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C) >> 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C) >> 46*4096kB (C) = 571796kB >> >> The reason of should_reclaim_retry early aborting was that is based on having the order >> pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free >> pages are the fundamental of compaction. It may not be suitable to stop reclaiming >> when lots of page cache there. Relax order by one to fix this issue. > > For the higher order request we rely on should_compact_retry which backs > on based on the compaction feedback. I would recommend looking why the > compaction fails. I think the reason of compaction failure is there're not enough free pages. Like in ftrace events showed, free pages(which include CMA) was only 592856 - 539988 = 52868 pages(reclaimable=539988 available=592856). There are some restrictions like suitable_migration_target() for free pages and suitable_migration_source() for movable pages. Hence eligible targets is fewer. > > Also this patch doesn't really explain why it should work and honestly > it doesn't really make much sense to me either. Sorry, my fault. IMO, The reason it should work is, say for this case of order 3 allocation: we can perform direct reclaim more times as we have only order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). The order requirement which I have lowered is should_reclaim_retry -> __zone_watermark_ok: for (o = order; o < MAX_ORDER; o++) { struct free_area *area = &z->free_area[o]; ... for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { if (!free_area_empty(area, mt)) return true; } Order 2 pages can be more easily met, hence VM has more chance to return true from should_reclaim_retry. > >> With the change meminfo output when first OOM showing page cache was nearly >> exhausted: >> >> Normal free: 462956kB min:8644kB low:44672kB high:50844kB >> reserved_highatomic:4096KB active_anon:48kB inactive_anon:12kB >> active_file:508kB inactive_file:552kB unevictable:109016kB writepending:160kB >> present:7111680kB managed:6175004kB mlocked:107784kB pagetables:78732kB >> bounce:0kB free_pcp:996kB local_pcp:0kB free_cma:376412kB >> >> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> >> --- >> mm/page_alloc.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 36b2021..b4ca6d1 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4954,8 +4954,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, >> /* >> * Would the allocation succeed if we reclaimed all >> * reclaimable pages? >> + * considering fragmentation, enough free pages are the >> + * fundamental of compaction: >> + * lower the order requirement by one >> */ >> - wmark = __zone_watermark_ok(zone, order, min_wmark, >> + wmark = __zone_watermark_ok(zone, order ? order - 1 : 0, min_wmark, >> ac->highest_zoneidx, alloc_flags, available); >> trace_reclaim_retry_zone(z, order, reclaimable, >> available, min_wmark, *no_progress_loops, wmark); >> -- >> 2.7.4 > BTW, regarding the test case: I used order 3 with GFP_KERNEL flag to continuously allocate to reproduce it. If you have other suggestions/ideas, please let me know. Thank you. Thanks, Zhenhua
On Mon 19-09-22 19:24:32, Zhenhua Huang wrote: > Thanks Michal for comments! > > On 2022/9/19 16:14, Michal Hocko wrote: > > On Mon 19-09-22 11:00:55, Zhenhua Huang wrote: > > > When a driver was continuously allocating order 3 > > > pages, it would be very easily OOM even there were lots of reclaimable > > > pages. A test module is used to reproduce this issue, > > > several key ftrace events are as below: > > > > > > insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal > > > order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 > > > wmark_check=0 > > > insmod-6968 [005] .... 321.306009: compact_retry: order=3 > > > priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 > > > max_retries=16 should_retry=1 > > > insmod-6968 [004] .... 321.308220: > > > mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0 > > > insmod-6968 [004] .... 321.308964: mm_compaction_end: > > > zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000, > > > mode=sync status=complete > > > insmod-6968 [004] .... 321.308971: reclaim_retry_zone: node=0 > > > zone=Normal order=3 reclaimable=539830 available=592776 min_wmark=21227 > > > no_progress_loops=0 wmark_check=0 > > > insmod-6968 [004] .... 321.308973: compact_retry: order=3 > > > priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0 > > > max_retries=16 should_retry=0 > > > > > > There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to > > > reclaim any more: > > > insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal > > > order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 > > > wmark_check=0 > > > > > > >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified) > > > can meet should_reclaim_retry's requirement: > > > Normal : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C) > > > 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C) > > > 46*4096kB (C) = 571796kB > > > > > > The reason of should_reclaim_retry early aborting was that is based on having the order > > > pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free > > > pages are the fundamental of compaction. It may not be suitable to stop reclaiming > > > when lots of page cache there. Relax order by one to fix this issue. > > > > For the higher order request we rely on should_compact_retry which backs > > on based on the compaction feedback. I would recommend looking why the > > compaction fails. > I think the reason of compaction failure is there're not enough free pages. > Like in ftrace events showed, free pages(which include CMA) was only 592856 > - 539988 = 52868 pages(reclaimable=539988 available=592856). > > There are some restrictions like suitable_migration_target() for free pages > and suitable_migration_source() for movable pages. Hence eligible targets is > fewer. If the compaction decides the retry is not worth it then either it is making a wrong call or it doesn't make sense to retry. > > Also this patch doesn't really explain why it should work and honestly > > it doesn't really make much sense to me either. > Sorry, my fault. IMO, The reason it should work is, say for this case of > order 3 allocation: we can perform direct reclaim more times as we have only > order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). > The order requirement which I have lowered is should_reclaim_retry -> > __zone_watermark_ok: > for (o = order; o < MAX_ORDER; o++) { > struct free_area *area = &z->free_area[o]; > ... > for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { > if (!free_area_empty(area, mt)) > return true; > } > > Order 2 pages can be more easily met, hence VM has more chance to return > true from should_reclaim_retry. This is a wrong approach to the problem because there is no real guarantee the reclaim round will do anything useful. You should be really looking at the compaction side of the thing.
Thanks Michal for comments! On 2022/9/19 21:28, Michal Hocko wrote: > On Mon 19-09-22 19:24:32, Zhenhua Huang wrote: >> Thanks Michal for comments! >> >> On 2022/9/19 16:14, Michal Hocko wrote: >>> On Mon 19-09-22 11:00:55, Zhenhua Huang wrote: >>>> When a driver was continuously allocating order 3 >>>> pages, it would be very easily OOM even there were lots of reclaimable >>>> pages. A test module is used to reproduce this issue, >>>> several key ftrace events are as below: >>>> >>>> insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal >>>> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 >>>> wmark_check=0 >>>> insmod-6968 [005] .... 321.306009: compact_retry: order=3 >>>> priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 >>>> max_retries=16 should_retry=1 >>>> insmod-6968 [004] .... 321.308220: >>>> mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0 >>>> insmod-6968 [004] .... 321.308964: mm_compaction_end: >>>> zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000, >>>> mode=sync status=complete >>>> insmod-6968 [004] .... 321.308971: reclaim_retry_zone: node=0 >>>> zone=Normal order=3 reclaimable=539830 available=592776 min_wmark=21227 >>>> no_progress_loops=0 wmark_check=0 >>>> insmod-6968 [004] .... 321.308973: compact_retry: order=3 >>>> priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0 >>>> max_retries=16 should_retry=0 >>>> >>>> There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to >>>> reclaim any more: >>>> insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal >>>> order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 >>>> wmark_check=0 >>>> >>>> >From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified) >>>> can meet should_reclaim_retry's requirement: >>>> Normal : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C) >>>> 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C) >>>> 46*4096kB (C) = 571796kB >>>> >>>> The reason of should_reclaim_retry early aborting was that is based on having the order >>>> pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free >>>> pages are the fundamental of compaction. It may not be suitable to stop reclaiming >>>> when lots of page cache there. Relax order by one to fix this issue. >>> >>> For the higher order request we rely on should_compact_retry which backs >>> on based on the compaction feedback. I would recommend looking why the >>> compaction fails. >> I think the reason of compaction failure is there're not enough free pages. >> Like in ftrace events showed, free pages(which include CMA) was only 592856 >> - 539988 = 52868 pages(reclaimable=539988 available=592856). >> >> There are some restrictions like suitable_migration_target() for free pages >> and suitable_migration_source() for movable pages. Hence eligible targets is >> fewer. > > If the compaction decides the retry is not worth it then either it is > making a wrong call or it doesn't make sense to retry. Yeah.. got it. This case compaction already tried its best with highest prio in ftrace logs. > >>> Also this patch doesn't really explain why it should work and honestly >>> it doesn't really make much sense to me either. >> Sorry, my fault. IMO, The reason it should work is, say for this case of >> order 3 allocation: we can perform direct reclaim more times as we have only >> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). >> The order requirement which I have lowered is should_reclaim_retry -> >> __zone_watermark_ok: >> for (o = order; o < MAX_ORDER; o++) { >> struct free_area *area = &z->free_area[o]; >> ... >> for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { >> if (!free_area_empty(area, mt)) >> return true; >> } >> >> Order 2 pages can be more easily met, hence VM has more chance to return >> true from should_reclaim_retry. > > This is a wrong approach to the problem because there is no real > guarantee the reclaim round will do anything useful. You should be > really looking at the compaction side of the thing. Thanks Michal for the advice, I'll look at from compaction side also. But I have one further question, IMO reclaim(~2GB LRU pages can be reclaimed) should be more feasible compared to compaction(already tried with highest prio and failed) in this case? Could you please elaborate more...it seems I still not fully understand why it's a wrong approach to check from reclaim phase. Thanks, Zhenhua
On Tue, Sep 20, 2022 at 05:38:30PM +0800, Zhenhua Huang wrote: > > > > Also this patch doesn't really explain why it should work and honestly > > > > it doesn't really make much sense to me either. > > > Sorry, my fault. IMO, The reason it should work is, say for this case of > > > order 3 allocation: we can perform direct reclaim more times as we have only > > > order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). > > > The order requirement which I have lowered is should_reclaim_retry -> > > > __zone_watermark_ok: > > > for (o = order; o < MAX_ORDER; o++) { > > > struct free_area *area = &z->free_area[o]; > > > ... > > > for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { > > > if (!free_area_empty(area, mt)) > > > return true; > > > } > > > > > > Order 2 pages can be more easily met, hence VM has more chance to return > > > true from should_reclaim_retry. > > > > This is a wrong approach to the problem because there is no real > > guarantee the reclaim round will do anything useful. You should be > > really looking at the compaction side of the thing. > > Thanks Michal for the advice, I'll look at from compaction side also. But I > have one further question, IMO reclaim(~2GB LRU pages can be reclaimed) > should be more feasible compared to compaction(already tried with highest > prio and failed) in this case? Could you please elaborate more...it seems I > still not fully understand why it's a wrong approach to check from reclaim > phase. > Because it risks major slowdowns due to excessive reclaim. Early support used "lumpy reclaim" instead of compaction and resulted in major stalls when trying to allocate THP resulting in THP often being disabled. The success rates were great but systems could become unusable for several minutes and ultimately this resulted in compaction and the current backoff logic of reclaim. Your scenario is similar, you want to aggressively trying to shrink slabs in case an order-3 block of pages gets freed. It might succeed but the system grinds to a halt with excessive re-reading of information from the disk for other use cases. Your focus likely should be on reclaim and compaction aborting prematurely because free CMA pages are available at the correct order but the calling context cannot use CMA pages. It's strange to hear of a driver that has a strict need for order-3 pages being available at all times due to a lack of an IOMMU because that is going to be fragile. One point of CMA was to carve out a region for such drivers so they could the contiguous regions they needed. I believe phone cameras were an early example. If your driver has strict requirements for high-order page availability then CMA probably should be configured and the driver should use CMA.
On Tue 20-09-22 17:38:30, Zhenhua Huang wrote: [...] > > > > Also this patch doesn't really explain why it should work and honestly > > > > it doesn't really make much sense to me either. > > > Sorry, my fault. IMO, The reason it should work is, say for this case of > > > order 3 allocation: we can perform direct reclaim more times as we have only > > > order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). > > > The order requirement which I have lowered is should_reclaim_retry -> > > > __zone_watermark_ok: > > > for (o = order; o < MAX_ORDER; o++) { > > > struct free_area *area = &z->free_area[o]; > > > ... > > > for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { > > > if (!free_area_empty(area, mt)) > > > return true; > > > } > > > > > > Order 2 pages can be more easily met, hence VM has more chance to return > > > true from should_reclaim_retry. > > > > This is a wrong approach to the problem because there is no real > > guarantee the reclaim round will do anything useful. You should be > > really looking at the compaction side of the thing. > > Thanks Michal for the advice, I'll look at from compaction side also. But I > have one further question, IMO reclaim(~2GB LRU pages can be reclaimed) > should be more feasible compared to compaction(already tried with highest > prio and failed) in this case? Could you please elaborate more...it seems I > still not fully understand why it's a wrong approach to check from reclaim > phase. Mel, has explained a large part I would go with let me just be more explicit about one part. Memory reclaim is not order aware and that means that probability of higher order pages are not directly related to the number of reclaimed pages. You might be lucky and reclaim a THP to form many order-3 pages or just reclaim those few that would create one order-3 but all that is very unpredictable and hence we rely on the compaction. Memory reclaim mostly assists compaction to allow some spare memory to do the migration.
Thanks Mel for your detailed comments! On 2022/9/20 19:02, Mel Gorman wrote: > On Tue, Sep 20, 2022 at 05:38:30PM +0800, Zhenhua Huang wrote: >>>>> Also this patch doesn't really explain why it should work and honestly >>>>> it doesn't really make much sense to me either. >>>> Sorry, my fault. IMO, The reason it should work is, say for this case of >>>> order 3 allocation: we can perform direct reclaim more times as we have only >>>> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). >>>> The order requirement which I have lowered is should_reclaim_retry -> >>>> __zone_watermark_ok: >>>> for (o = order; o < MAX_ORDER; o++) { >>>> struct free_area *area = &z->free_area[o]; >>>> ... >>>> for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { >>>> if (!free_area_empty(area, mt)) >>>> return true; >>>> } >>>> >>>> Order 2 pages can be more easily met, hence VM has more chance to return >>>> true from should_reclaim_retry. >>> >>> This is a wrong approach to the problem because there is no real >>> guarantee the reclaim round will do anything useful. You should be >>> really looking at the compaction side of the thing. >> >> Thanks Michal for the advice, I'll look at from compaction side also. But I >> have one further question, IMO reclaim(~2GB LRU pages can be reclaimed) >> should be more feasible compared to compaction(already tried with highest >> prio and failed) in this case? Could you please elaborate more...it seems I >> still not fully understand why it's a wrong approach to check from reclaim >> phase. >> > > Because it risks major slowdowns due to excessive reclaim. Early support > used "lumpy reclaim" instead of compaction and resulted in major stalls when > trying to allocate THP resulting in THP often being disabled. The success > rates were great but systems could become unusable for several minutes > and ultimately this resulted in compaction and the current backoff logic > of reclaim. Your scenario is similar, you want to aggressively trying to > shrink slabs in case an order-3 block of pages gets freed. It might succeed > but the system grinds to a halt with excessive re-reading of information > from the disk for other use cases. Thanks, I've also noticed. Contiguous reclaim obviously enlarged the time to OOM as I saw. > > Your focus likely should be on reclaim and compaction aborting > prematurely because free CMA pages are available at the correct order > but the calling context cannot use CMA pages. > > It's strange to hear of a driver that has a strict need for order-3 pages > being available at all times due to a lack of an IOMMU because that is > going to be fragile. One point of CMA was to carve out a region for such > drivers so they could the contiguous regions they needed. I believe phone > cameras were an early example. If your driver has strict requirements for > high-order page availability then CMA probably should be configured and > the driver should use CMA. > You point is to avoid to allocate order 3 unless that's really needed. Got it, thanks. Thanks, Zhenhua
Thanks Michal again! On 2022/9/20 19:20, Michal Hocko wrote: > On Tue 20-09-22 17:38:30, Zhenhua Huang wrote: > [...] >>>>> Also this patch doesn't really explain why it should work and honestly >>>>> it doesn't really make much sense to me either. >>>> Sorry, my fault. IMO, The reason it should work is, say for this case of >>>> order 3 allocation: we can perform direct reclaim more times as we have only >>>> order 2 pages(which *lowered* by this change) in free_list(8214*16kB (UEC)). >>>> The order requirement which I have lowered is should_reclaim_retry -> >>>> __zone_watermark_ok: >>>> for (o = order; o < MAX_ORDER; o++) { >>>> struct free_area *area = &z->free_area[o]; >>>> ... >>>> for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { >>>> if (!free_area_empty(area, mt)) >>>> return true; >>>> } >>>> >>>> Order 2 pages can be more easily met, hence VM has more chance to return >>>> true from should_reclaim_retry. >>> >>> This is a wrong approach to the problem because there is no real >>> guarantee the reclaim round will do anything useful. You should be >>> really looking at the compaction side of the thing. >> >> Thanks Michal for the advice, I'll look at from compaction side also. But I >> have one further question, IMO reclaim(~2GB LRU pages can be reclaimed) >> should be more feasible compared to compaction(already tried with highest >> prio and failed) in this case? Could you please elaborate more...it seems I >> still not fully understand why it's a wrong approach to check from reclaim >> phase. > > Mel, has explained a large part I would go with let me just be more > explicit about one part. Memory reclaim is not order aware and that > means that probability of higher order pages are not directly related to > the number of reclaimed pages. You might be lucky and reclaim a THP to > form many order-3 pages or just reclaim those few that would create one > order-3 but all that is very unpredictable and hence we rely on the > compaction. Memory reclaim mostly assists compaction to allow some spare > memory to do the migration. Thanks for your kind sharing between reclaim VS compaction for high order allocations, Michal. Now I agreed with Mel and you, compaction should play a more important role in this.. or we have to use CMA instead. Thanks, Zhenhua
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 36b2021..b4ca6d1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4954,8 +4954,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, /* * Would the allocation succeed if we reclaimed all * reclaimable pages? + * considering fragmentation, enough free pages are the + * fundamental of compaction: + * lower the order requirement by one */ - wmark = __zone_watermark_ok(zone, order, min_wmark, + wmark = __zone_watermark_ok(zone, order ? order - 1 : 0, min_wmark, ac->highest_zoneidx, alloc_flags, available); trace_reclaim_retry_zone(z, order, reclaimable, available, min_wmark, *no_progress_loops, wmark);
When a driver was continuously allocating order 3 pages, it would be very easily OOM even there were lots of reclaimable pages. A test module is used to reproduce this issue, several key ftrace events are as below: insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 wmark_check=0 insmod-6968 [005] .... 321.306009: compact_retry: order=3 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=1 insmod-6968 [004] .... 321.308220: mm_compaction_try_to_compact_pages: order=3 gfp_mask=GFP_KERNEL priority=0 insmod-6968 [004] .... 321.308964: mm_compaction_end: zone_start=0x80000 migrate_pfn=0xaa800 free_pfn=0x80800 zone_end=0x940000, mode=sync status=complete insmod-6968 [004] .... 321.308971: reclaim_retry_zone: node=0 zone=Normal order=3 reclaimable=539830 available=592776 min_wmark=21227 no_progress_loops=0 wmark_check=0 insmod-6968 [004] .... 321.308973: compact_retry: order=3 priority=COMPACT_PRIO_SYNC_FULL compaction_result=failed retries=0 max_retries=16 should_retry=0 There're ~2GB reclaimable pages(reclaimable=539988) but VM decides not to reclaim any more: insmod-6968 [005] .... 321.306007: reclaim_retry_zone: node=0 zone=Normal order=3 reclaimable=539988 available=592856 min_wmark=21227 no_progress_loops=0 wmark_check=0 From meminfo when oom, there was NO qualified order >= 3 pages(CMA page not qualified) can meet should_reclaim_retry's requirement: Normal : 24671*4kB (UMEC) 13807*8kB (UMEC) 8214*16kB (UEC) 190*32kB (C) 94*64kB (C) 28*128kB (C) 16*256kB (C) 7*512kB (C) 5*1024kB (C) 7*2048kB (C) 46*4096kB (C) = 571796kB The reason of should_reclaim_retry early aborting was that is based on having the order pages in its free_list. For order 3 pages, that's easily fragmented. Considering enough free pages are the fundamental of compaction. It may not be suitable to stop reclaiming when lots of page cache there. Relax order by one to fix this issue. With the change meminfo output when first OOM showing page cache was nearly exhausted: Normal free: 462956kB min:8644kB low:44672kB high:50844kB reserved_highatomic:4096KB active_anon:48kB inactive_anon:12kB active_file:508kB inactive_file:552kB unevictable:109016kB writepending:160kB present:7111680kB managed:6175004kB mlocked:107784kB pagetables:78732kB bounce:0kB free_pcp:996kB local_pcp:0kB free_cma:376412kB Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> --- mm/page_alloc.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)