Message ID | 1736929894-19228-1-git-send-email-yangge1116@126.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction | expand |
On 1/15/25 09:31, yangge1116@126.com wrote: > From: yangge <yangge1116@126.com> > > There are 4 NUMA nodes on my machine, and each NUMA node has 32GB > of memory. I have configured 16GB of CMA memory on each NUMA node, > and starting a 32GB virtual machine with device passthrough is > extremely slow, taking almost an hour. > > Long term GUP cannot allocate memory from CMA area, so a maximum of > 16 GB of no-CMA memory on a NUMA node can be used as virtual machine > memory. There is 16GB of free CMA memory on a NUMA node, which is > sufficient to pass the order-0 watermark check, causing the > __compaction_suitable() function to consistently return true. > > For costly allocations, if the __compaction_suitable() function always > returns true, it causes the __alloc_pages_slowpath() function to fail > to exit at the appropriate point. This prevents timely fallback to > allocating memory on other nodes, ultimately resulting in excessively > long virtual machine startup times. > Call trace: > __alloc_pages_slowpath > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; // should exit __alloc_pages_slowpath() from here > > We could use the real unmovable allocation context to have > __zone_watermark_unusable_free() subtract CMA pages, and thus we won't > pass the order-0 check anymore once the non-CMA part is exhausted. There > is some risk that in some different scenario the compaction could in > fact migrate pages from the exhausted non-CMA part of the zone to the > CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY > allocations should be affected in the immediate "goto nopage" when > compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY > anyway and won't fail without trying to compact-migrate the non-CMA > pageblocks into CMA pageblocks first, so it should be fine. > > After this fix, it only takes a few tens of seconds to start a 32GB > virtual machine with device passthrough functionality. So did you verify it works? I just realized there might be still cases it won't help. There might be enough free order-0 pages in the non-CMA pageblocks (so the additional check will not stop us) but fragmented and impossible to compact due to unmovable pages. Then we won't avoid your issue, right? > Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/ > Signed-off-by: yangge <yangge1116@126.com> In case it really helps reliably: Acked-by: Vlastimil Babka <vbabka@suse.cz> Some nits below: > --- > mm/compaction.c | 31 +++++++++++++++++++++++++++---- > 1 file changed, 27 insertions(+), 4 deletions(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 07bd227..9032bb6 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, > */ > static enum compact_result > compaction_suit_allocation_order(struct zone *zone, unsigned int order, > - int highest_zoneidx, unsigned int alloc_flags) > + int highest_zoneidx, unsigned int alloc_flags, > + bool async) > { > unsigned long watermark; > > @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order, > alloc_flags)) > return COMPACT_SUCCESS; > > + /* > + * For costly orders, during the async memory compaction process, use the > + * actual allocation context to determine the watermarks. There's some risk > + * that in some different scenario the compaction could in fact migrate > + * pages from the exhausted non-CMA part of the zone to the CMA part and > + * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations > + * should be affected in the immediate "goto nopage" when compaction is > + * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't > + * fail without trying to compact-migrate the non-CMA pageblocks into CMA > + * pageblocks first, so it should be fine. I think it's explaining too much about why not do this than why do this. How about: For unmovable allocations (without ALLOC_CMA), check if there is enough free memory in the non-CMA pageblocks. Otherwise compaction could form the high-order page in CMA pageblocks, which would not help the allocation to succeed. However, limit the check to costly order async compaction (such as opportunistic THP attempts) because there is the possibility that compaction would migrate pages from non-CMA to CMA pageblock. > + */ > + if (order > PAGE_ALLOC_COSTLY_ORDER && async) { We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the watermark check in the normal THP allocation case (not from pinned gup), because then it just repeats the watermark check that was done above. > + watermark = low_wmark_pages(zone) + compact_gap(order); > + if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, > + alloc_flags & ALLOC_CMA, And then here we can just pass 0. > + zone_page_state(zone, NR_FREE_PAGES))) > + return COMPACT_SKIPPED; > + } > + > if (!compaction_suitable(zone, order, highest_zoneidx)) > return COMPACT_SKIPPED; > > @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) > if (!is_via_compact_memory(cc->order)) { > ret = compaction_suit_allocation_order(cc->zone, cc->order, > cc->highest_zoneidx, > - cc->alloc_flags); > + cc->alloc_flags, > + cc->mode == MIGRATE_ASYNC); > if (ret != COMPACT_CONTINUE) > return ret; > } > @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) > > ret = compaction_suit_allocation_order(zone, > pgdat->kcompactd_max_order, > - highest_zoneidx, ALLOC_WMARK_MIN); > + highest_zoneidx, ALLOC_WMARK_MIN, > + 0); It's bool, so false instead of 0. > if (ret == COMPACT_CONTINUE) > return true; > } > @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat) > continue; > > ret = compaction_suit_allocation_order(zone, > - cc.order, zoneid, ALLOC_WMARK_MIN); > + cc.order, zoneid, ALLOC_WMARK_MIN, > + cc.mode == MIGRATE_ASYNC); We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and has no real alloc_context. > if (ret != COMPACT_CONTINUE) > continue; >
在 2025/1/15 17:56, Vlastimil Babka 写道: > On 1/15/25 09:31, yangge1116@126.com wrote: >> From: yangge <yangge1116@126.com> >> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB >> of memory. I have configured 16GB of CMA memory on each NUMA node, >> and starting a 32GB virtual machine with device passthrough is >> extremely slow, taking almost an hour. >> >> Long term GUP cannot allocate memory from CMA area, so a maximum of >> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine >> memory. There is 16GB of free CMA memory on a NUMA node, which is >> sufficient to pass the order-0 watermark check, causing the >> __compaction_suitable() function to consistently return true. >> >> For costly allocations, if the __compaction_suitable() function always >> returns true, it causes the __alloc_pages_slowpath() function to fail >> to exit at the appropriate point. This prevents timely fallback to >> allocating memory on other nodes, ultimately resulting in excessively >> long virtual machine startup times. >> Call trace: >> __alloc_pages_slowpath >> if (compact_result == COMPACT_SKIPPED || >> compact_result == COMPACT_DEFERRED) >> goto nopage; // should exit __alloc_pages_slowpath() from here >> >> We could use the real unmovable allocation context to have >> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't >> pass the order-0 check anymore once the non-CMA part is exhausted. There >> is some risk that in some different scenario the compaction could in >> fact migrate pages from the exhausted non-CMA part of the zone to the >> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY >> allocations should be affected in the immediate "goto nopage" when >> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY >> anyway and won't fail without trying to compact-migrate the non-CMA >> pageblocks into CMA pageblocks first, so it should be fine. >> >> After this fix, it only takes a few tens of seconds to start a 32GB >> virtual machine with device passthrough functionality. > > So did you verify it works? After multiple tests, it has been confirmed to work properly. Thank you. I just realized there might be still cases it > won't help. There might be enough free order-0 pages in the non-CMA > pageblocks (so the additional check will not stop us) but fragmented and > impossible to compact due to unmovable pages. Then we won't avoid your > issue, right? > The pages that are pinned are mostly Transparent Huge Pages (THP). Therefore, it is not common to find free order-0 pages in non-CMA pageblocks that are fragmented and impossible to compact due to the presence of unmovable pages. This patch can resolve my issue. >> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/ >> Signed-off-by: yangge <yangge1116@126.com> > > In case it really helps reliably: > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > Some nits below: > >> --- >> mm/compaction.c | 31 +++++++++++++++++++++++++++---- >> 1 file changed, 27 insertions(+), 4 deletions(-) >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 07bd227..9032bb6 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, >> */ >> static enum compact_result >> compaction_suit_allocation_order(struct zone *zone, unsigned int order, >> - int highest_zoneidx, unsigned int alloc_flags) >> + int highest_zoneidx, unsigned int alloc_flags, >> + bool async) >> { >> unsigned long watermark; >> >> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order, >> alloc_flags)) >> return COMPACT_SUCCESS; >> >> + /* >> + * For costly orders, during the async memory compaction process, use the >> + * actual allocation context to determine the watermarks. There's some risk >> + * that in some different scenario the compaction could in fact migrate >> + * pages from the exhausted non-CMA part of the zone to the CMA part and >> + * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations >> + * should be affected in the immediate "goto nopage" when compaction is >> + * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't >> + * fail without trying to compact-migrate the non-CMA pageblocks into CMA >> + * pageblocks first, so it should be fine. > > I think it's explaining too much about why not do this than why do this. How > about: > > For unmovable allocations (without ALLOC_CMA), check if there is enough free > memory in the non-CMA pageblocks. Otherwise compaction could form the > high-order page in CMA pageblocks, which would not help the allocation to > succeed. However, limit the check to costly order async compaction (such as > opportunistic THP attempts) because there is the possibility that compaction > would migrate pages from non-CMA to CMA pageblock. > >> + */ >> + if (order > PAGE_ALLOC_COSTLY_ORDER && async) { > > We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the > watermark check in the normal THP allocation case (not from pinned gup), > because then it just repeats the watermark check that was done above. > >> + watermark = low_wmark_pages(zone) + compact_gap(order); >> + if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, >> + alloc_flags & ALLOC_CMA, > > And then here we can just pass 0. > >> + zone_page_state(zone, NR_FREE_PAGES))) >> + return COMPACT_SKIPPED; >> + } >> + >> if (!compaction_suitable(zone, order, highest_zoneidx)) >> return COMPACT_SKIPPED; >> >> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) >> if (!is_via_compact_memory(cc->order)) { >> ret = compaction_suit_allocation_order(cc->zone, cc->order, >> cc->highest_zoneidx, >> - cc->alloc_flags); >> + cc->alloc_flags, >> + cc->mode == MIGRATE_ASYNC); >> if (ret != COMPACT_CONTINUE) >> return ret; >> } >> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) >> >> ret = compaction_suit_allocation_order(zone, >> pgdat->kcompactd_max_order, >> - highest_zoneidx, ALLOC_WMARK_MIN); >> + highest_zoneidx, ALLOC_WMARK_MIN, >> + 0); > > It's bool, so false instead of 0. > >> if (ret == COMPACT_CONTINUE) >> return true; >> } >> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat) >> continue; >> >> ret = compaction_suit_allocation_order(zone, >> - cc.order, zoneid, ALLOC_WMARK_MIN); >> + cc.order, zoneid, ALLOC_WMARK_MIN, >> + cc.mode == MIGRATE_ASYNC); > > We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and > has no real alloc_context. > >> if (ret != COMPACT_CONTINUE) >> continue; >>
diff --git a/mm/compaction.c b/mm/compaction.c index 07bd227..9032bb6 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, */ static enum compact_result compaction_suit_allocation_order(struct zone *zone, unsigned int order, - int highest_zoneidx, unsigned int alloc_flags) + int highest_zoneidx, unsigned int alloc_flags, + bool async) { unsigned long watermark; @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order, alloc_flags)) return COMPACT_SUCCESS; + /* + * For costly orders, during the async memory compaction process, use the + * actual allocation context to determine the watermarks. There's some risk + * that in some different scenario the compaction could in fact migrate + * pages from the exhausted non-CMA part of the zone to the CMA part and + * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations + * should be affected in the immediate "goto nopage" when compaction is + * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't + * fail without trying to compact-migrate the non-CMA pageblocks into CMA + * pageblocks first, so it should be fine. + */ + if (order > PAGE_ALLOC_COSTLY_ORDER && async) { + watermark = low_wmark_pages(zone) + compact_gap(order); + if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx, + alloc_flags & ALLOC_CMA, + zone_page_state(zone, NR_FREE_PAGES))) + return COMPACT_SKIPPED; + } + if (!compaction_suitable(zone, order, highest_zoneidx)) return COMPACT_SKIPPED; @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) if (!is_via_compact_memory(cc->order)) { ret = compaction_suit_allocation_order(cc->zone, cc->order, cc->highest_zoneidx, - cc->alloc_flags); + cc->alloc_flags, + cc->mode == MIGRATE_ASYNC); if (ret != COMPACT_CONTINUE) return ret; } @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat) ret = compaction_suit_allocation_order(zone, pgdat->kcompactd_max_order, - highest_zoneidx, ALLOC_WMARK_MIN); + highest_zoneidx, ALLOC_WMARK_MIN, + 0); if (ret == COMPACT_CONTINUE) return true; } @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat) continue; ret = compaction_suit_allocation_order(zone, - cc.order, zoneid, ALLOC_WMARK_MIN); + cc.order, zoneid, ALLOC_WMARK_MIN, + cc.mode == MIGRATE_ASYNC); if (ret != COMPACT_CONTINUE) continue;