diff mbox series

mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction

Message ID 1736929894-19228-1-git-send-email-yangge1116@126.com (mailing list archive)
State New
Headers show
Series mm: compaction: use the actual allocation context to determine the watermarks for costly order during async memory compaction | expand

Commit Message

Ge Yang Jan. 15, 2025, 8:31 a.m. UTC
From: yangge <yangge1116@126.com>

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.

Long term GUP cannot allocate memory from CMA area, so a maximum of
16 GB of no-CMA memory on a NUMA node can be used as virtual machine
memory. There is 16GB of free CMA memory on a NUMA node, which is
sufficient to pass the order-0 watermark check, causing the
__compaction_suitable() function to  consistently return true.

For costly allocations, if the __compaction_suitable() function always
returns true, it causes the __alloc_pages_slowpath() function to fail
to exit at the appropriate point. This prevents timely fallback to
allocating memory on other nodes, ultimately resulting in excessively
long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
    if (compact_result == COMPACT_SKIPPED ||
        compact_result == COMPACT_DEFERRED)
        goto nopage; // should exit __alloc_pages_slowpath() from here

We could use the real unmovable allocation context to have
__zone_watermark_unusable_free() subtract CMA pages, and thus we won't
pass the order-0 check anymore once the non-CMA part is exhausted. There
is some risk that in some different scenario the compaction could in
fact migrate pages from the exhausted non-CMA part of the zone to the
CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
allocations should be affected in the immediate "goto nopage" when
compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
anyway and won't fail without trying to compact-migrate the non-CMA
pageblocks into CMA pageblocks first, so it should be fine.

After this fix, it only takes a few tens of seconds to start a 32GB
virtual machine with device passthrough functionality.

Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
Signed-off-by: yangge <yangge1116@126.com>
---
 mm/compaction.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

Comments

Vlastimil Babka Jan. 15, 2025, 9:56 a.m. UTC | #1
On 1/15/25 09:31, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> Long term GUP cannot allocate memory from CMA area, so a maximum of
> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
> memory. There is 16GB of free CMA memory on a NUMA node, which is
> sufficient to pass the order-0 watermark check, causing the
> __compaction_suitable() function to  consistently return true.
> 
> For costly allocations, if the __compaction_suitable() function always
> returns true, it causes the __alloc_pages_slowpath() function to fail
> to exit at the appropriate point. This prevents timely fallback to
> allocating memory on other nodes, ultimately resulting in excessively
> long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>     if (compact_result == COMPACT_SKIPPED ||
>         compact_result == COMPACT_DEFERRED)
>         goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> We could use the real unmovable allocation context to have
> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
> pass the order-0 check anymore once the non-CMA part is exhausted. There
> is some risk that in some different scenario the compaction could in
> fact migrate pages from the exhausted non-CMA part of the zone to the
> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
> allocations should be affected in the immediate "goto nopage" when
> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
> anyway and won't fail without trying to compact-migrate the non-CMA
> pageblocks into CMA pageblocks first, so it should be fine.
> 
> After this fix, it only takes a few tens of seconds to start a 32GB
> virtual machine with device passthrough functionality.

So did you verify it works? I just realized there might be still cases it
won't help. There might be enough free order-0 pages in the non-CMA
pageblocks (so the additional check will not stop us) but fragmented and
impossible to compact due to unmovable pages. Then we won't avoid your
issue, right?

> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
> Signed-off-by: yangge <yangge1116@126.com>

In case it really helps reliably:

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Some nits below:

> ---
>  mm/compaction.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 07bd227..9032bb6 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>   */
>  static enum compact_result
>  compaction_suit_allocation_order(struct zone *zone, unsigned int order,
> -				 int highest_zoneidx, unsigned int alloc_flags)
> +				 int highest_zoneidx, unsigned int alloc_flags,
> +				 bool async)
>  {
>  	unsigned long watermark;
>  
> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>  			      alloc_flags))
>  		return COMPACT_SUCCESS;
>  
> +	/*
> +	 * For costly orders, during the async memory compaction process, use the
> +	 * actual allocation context to determine the watermarks. There's some risk
> +	 * that in some different scenario the compaction could in fact migrate
> +	 * pages from the exhausted non-CMA part of the zone to the CMA part and
> +	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
> +	 * should be affected in the immediate "goto nopage" when compaction is
> +	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
> +	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
> +	 * pageblocks first, so it should be fine.

I think it's explaining too much about why not do this than why do this. How
about:

For unmovable allocations (without ALLOC_CMA), check if there is enough free
memory in the non-CMA pageblocks. Otherwise compaction could form the
high-order page in CMA pageblocks, which would not help the allocation to
succeed. However, limit the check to costly order async compaction (such as
opportunistic THP attempts) because there is the possibility that compaction
would migrate pages from non-CMA to CMA pageblock.

> +	 */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {

We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
watermark check in the normal THP allocation case (not from pinned gup),
because then it just repeats the watermark check that was done above.

> +		watermark = low_wmark_pages(zone) + compact_gap(order);
> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
> +					   alloc_flags & ALLOC_CMA,

And then here we can just pass 0.

> +					   zone_page_state(zone, NR_FREE_PAGES)))
> +			return COMPACT_SKIPPED;
> +	}
> +
>  	if (!compaction_suitable(zone, order, highest_zoneidx))
>  		return COMPACT_SKIPPED;
>  
> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  	if (!is_via_compact_memory(cc->order)) {
>  		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>  						       cc->highest_zoneidx,
> -						       cc->alloc_flags);
> +						       cc->alloc_flags,
> +						       cc->mode == MIGRATE_ASYNC);
>  		if (ret != COMPACT_CONTINUE)
>  			return ret;
>  	}
> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>  
>  		ret = compaction_suit_allocation_order(zone,
>  				pgdat->kcompactd_max_order,
> -				highest_zoneidx, ALLOC_WMARK_MIN);
> +				highest_zoneidx, ALLOC_WMARK_MIN,
> +				0);

It's bool, so false instead of 0.

>  		if (ret == COMPACT_CONTINUE)
>  			return true;
>  	}
> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>  			continue;
>  
>  		ret = compaction_suit_allocation_order(zone,
> -				cc.order, zoneid, ALLOC_WMARK_MIN);
> +				cc.order, zoneid, ALLOC_WMARK_MIN,
> +				cc.mode == MIGRATE_ASYNC);

We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
has no real alloc_context.

>  		if (ret != COMPACT_CONTINUE)
>  			continue;
>
Ge Yang Jan. 16, 2025, 1:33 a.m. UTC | #2
在 2025/1/15 17:56, Vlastimil Babka 写道:
> On 1/15/25 09:31, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> Long term GUP cannot allocate memory from CMA area, so a maximum of
>> 16 GB of no-CMA memory on a NUMA node can be used as virtual machine
>> memory. There is 16GB of free CMA memory on a NUMA node, which is
>> sufficient to pass the order-0 watermark check, causing the
>> __compaction_suitable() function to  consistently return true.
>>
>> For costly allocations, if the __compaction_suitable() function always
>> returns true, it causes the __alloc_pages_slowpath() function to fail
>> to exit at the appropriate point. This prevents timely fallback to
>> allocating memory on other nodes, ultimately resulting in excessively
>> long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> We could use the real unmovable allocation context to have
>> __zone_watermark_unusable_free() subtract CMA pages, and thus we won't
>> pass the order-0 check anymore once the non-CMA part is exhausted. There
>> is some risk that in some different scenario the compaction could in
>> fact migrate pages from the exhausted non-CMA part of the zone to the
>> CMA part and succeed, and we'll skip it instead. But only __GFP_NORETRY
>> allocations should be affected in the immediate "goto nopage" when
>> compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
>> anyway and won't fail without trying to compact-migrate the non-CMA
>> pageblocks into CMA pageblocks first, so it should be fine.
>>
>> After this fix, it only takes a few tens of seconds to start a 32GB
>> virtual machine with device passthrough functionality.
> 
> So did you verify it works? 
After multiple tests, it has been confirmed to work properly. Thank you.
I just realized there might be still cases it
> won't help. There might be enough free order-0 pages in the non-CMA
> pageblocks (so the additional check will not stop us) but fragmented and
> impossible to compact due to unmovable pages. Then we won't avoid your
> issue, right?
> 
The pages that are pinned are mostly Transparent Huge Pages (THP). 
Therefore, it is not common to find free order-0 pages in non-CMA 
pageblocks that are fragmented and impossible to compact due to the 
presence of unmovable pages. This patch can resolve my issue.
>> Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
>> Signed-off-by: yangge <yangge1116@126.com>
> 
> In case it really helps reliably:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Some nits below:
> 
>> ---
>>   mm/compaction.c | 31 +++++++++++++++++++++++++++----
>>   1 file changed, 27 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 07bd227..9032bb6 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -2490,7 +2490,8 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>>    */
>>   static enum compact_result
>>   compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>> -				 int highest_zoneidx, unsigned int alloc_flags)
>> +				 int highest_zoneidx, unsigned int alloc_flags,
>> +				 bool async)
>>   {
>>   	unsigned long watermark;
>>   
>> @@ -2499,6 +2500,25 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
>>   			      alloc_flags))
>>   		return COMPACT_SUCCESS;
>>   
>> +	/*
>> +	 * For costly orders, during the async memory compaction process, use the
>> +	 * actual allocation context to determine the watermarks. There's some risk
>> +	 * that in some different scenario the compaction could in fact migrate
>> +	 * pages from the exhausted non-CMA part of the zone to the CMA part and
>> +	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
>> +	 * should be affected in the immediate "goto nopage" when compaction is
>> +	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
>> +	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
>> +	 * pageblocks first, so it should be fine.
> 
> I think it's explaining too much about why not do this than why do this. How
> about:
> 
> For unmovable allocations (without ALLOC_CMA), check if there is enough free
> memory in the non-CMA pageblocks. Otherwise compaction could form the
> high-order page in CMA pageblocks, which would not help the allocation to
> succeed. However, limit the check to costly order async compaction (such as
> opportunistic THP attempts) because there is the possibility that compaction
> would migrate pages from non-CMA to CMA pageblock.
> 
>> +	 */
>> +	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {
> 
> We could also check for !(alloc_flags & ALLOC_CMA) here to avoid the
> watermark check in the normal THP allocation case (not from pinned gup),
> because then it just repeats the watermark check that was done above.
> 
>> +		watermark = low_wmark_pages(zone) + compact_gap(order);
>> +		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
>> +					   alloc_flags & ALLOC_CMA,
> 
> And then here we can just pass 0.
> 
>> +					   zone_page_state(zone, NR_FREE_PAGES)))
>> +			return COMPACT_SKIPPED;
>> +	}
>> +
>>   	if (!compaction_suitable(zone, order, highest_zoneidx))
>>   		return COMPACT_SKIPPED;
>>   
>> @@ -2534,7 +2554,8 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>>   	if (!is_via_compact_memory(cc->order)) {
>>   		ret = compaction_suit_allocation_order(cc->zone, cc->order,
>>   						       cc->highest_zoneidx,
>> -						       cc->alloc_flags);
>> +						       cc->alloc_flags,
>> +						       cc->mode == MIGRATE_ASYNC);
>>   		if (ret != COMPACT_CONTINUE)
>>   			return ret;
>>   	}
>> @@ -3037,7 +3058,8 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>>   				pgdat->kcompactd_max_order,
>> -				highest_zoneidx, ALLOC_WMARK_MIN);
>> +				highest_zoneidx, ALLOC_WMARK_MIN,
>> +				0);
> 
> It's bool, so false instead of 0.
> 
>>   		if (ret == COMPACT_CONTINUE)
>>   			return true;
>>   	}
>> @@ -3078,7 +3100,8 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>>   			continue;
>>   
>>   		ret = compaction_suit_allocation_order(zone,
>> -				cc.order, zoneid, ALLOC_WMARK_MIN);
>> +				cc.order, zoneid, ALLOC_WMARK_MIN,
>> +				cc.mode == MIGRATE_ASYNC);
> 
> We could also just pass false here as kcompactd uses MIGRATE_SYNC_LIGHT and
> has no real alloc_context.
> 
>>   		if (ret != COMPACT_CONTINUE)
>>   			continue;
>>
diff mbox series

Patch

diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..9032bb6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2490,7 +2490,8 @@  bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
  */
 static enum compact_result
 compaction_suit_allocation_order(struct zone *zone, unsigned int order,
-				 int highest_zoneidx, unsigned int alloc_flags)
+				 int highest_zoneidx, unsigned int alloc_flags,
+				 bool async)
 {
 	unsigned long watermark;
 
@@ -2499,6 +2500,25 @@  compaction_suit_allocation_order(struct zone *zone, unsigned int order,
 			      alloc_flags))
 		return COMPACT_SUCCESS;
 
+	/*
+	 * For costly orders, during the async memory compaction process, use the
+	 * actual allocation context to determine the watermarks. There's some risk
+	 * that in some different scenario the compaction could in fact migrate
+	 * pages from the exhausted non-CMA part of the zone to the CMA part and
+	 * succeed, and we'll skip it instead. But only __GFP_NORETRY allocations
+	 * should be affected in the immediate "goto nopage" when compaction is
+	 * skipped, others will attempt with DEF_COMPACT_PRIORITY anyway and won't
+	 * fail without trying to compact-migrate the non-CMA pageblocks into CMA
+	 * pageblocks first, so it should be fine.
+	 */
+	if (order > PAGE_ALLOC_COSTLY_ORDER && async) {
+		watermark = low_wmark_pages(zone) + compact_gap(order);
+		if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
+					   alloc_flags & ALLOC_CMA,
+					   zone_page_state(zone, NR_FREE_PAGES)))
+			return COMPACT_SKIPPED;
+	}
+
 	if (!compaction_suitable(zone, order, highest_zoneidx))
 		return COMPACT_SKIPPED;
 
@@ -2534,7 +2554,8 @@  compact_zone(struct compact_control *cc, struct capture_control *capc)
 	if (!is_via_compact_memory(cc->order)) {
 		ret = compaction_suit_allocation_order(cc->zone, cc->order,
 						       cc->highest_zoneidx,
-						       cc->alloc_flags);
+						       cc->alloc_flags,
+						       cc->mode == MIGRATE_ASYNC);
 		if (ret != COMPACT_CONTINUE)
 			return ret;
 	}
@@ -3037,7 +3058,8 @@  static bool kcompactd_node_suitable(pg_data_t *pgdat)
 
 		ret = compaction_suit_allocation_order(zone,
 				pgdat->kcompactd_max_order,
-				highest_zoneidx, ALLOC_WMARK_MIN);
+				highest_zoneidx, ALLOC_WMARK_MIN,
+				0);
 		if (ret == COMPACT_CONTINUE)
 			return true;
 	}
@@ -3078,7 +3100,8 @@  static void kcompactd_do_work(pg_data_t *pgdat)
 			continue;
 
 		ret = compaction_suit_allocation_order(zone,
-				cc.order, zoneid, ALLOC_WMARK_MIN);
+				cc.order, zoneid, ALLOC_WMARK_MIN,
+				cc.mode == MIGRATE_ASYNC);
 		if (ret != COMPACT_CONTINUE)
 			continue;