diff mbox series

[V7] mm, compaction: don't use ALLOC_CMA for unmovable allocations

Message ID 1734436004-1212-1-git-send-email-yangge1116@126.com (mailing list archive)
State New
Headers show
Series [V7] mm, compaction: don't use ALLOC_CMA for unmovable allocations | expand

Commit Message

Ge Yang Dec. 17, 2024, 11:46 a.m. UTC
From: yangge <yangge1116@126.com>

Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.

During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.

For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
    if (compact_result == COMPACT_SKIPPED ||
        compact_result == COMPACT_DEFERRED)
        goto nopage; // should exit __alloc_pages_slowpath() from here

Other unmovable alloctions, like dma_buf, which can be large in a
Linux system, are also unable to allocate memory from CMA, and these
allocations suffer from the same problems described above. In order
to quickly fall back to remote node, we should remove ALLOC_CMA both
in __compaction_suitable() and __isolate_free_page() for unmovable
alloctions. After this fix, starting a 32GB virtual machine with
device passthrough takes only a few seconds.

Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable@vger.kernel.org>
Signed-off-by: yangge <yangge1116@126.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---

V7:
-- fix the changelog and code documentation

V6:
-- update cc->alloc_flags to keep the original loginc

V5:
- add 'alloc_flags' parameter for __isolate_free_page()
- remove 'usa_cma' variable

V4:
- rich the commit log description

V3:
- fix build errors
- add ALLOC_CMA both in should_continue_reclaim() and compaction_ready()

V2:
- using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed
- rich the commit log description

 include/linux/compaction.h |  6 ++++--
 mm/compaction.c            | 26 +++++++++++++++-----------
 mm/internal.h              |  3 ++-
 mm/page_alloc.c            |  7 +++++--
 mm/page_isolation.c        |  3 ++-
 mm/page_reporting.c        |  2 +-
 mm/vmscan.c                |  4 ++--
 7 files changed, 31 insertions(+), 20 deletions(-)

Comments

Johannes Weiner Dec. 17, 2024, 3:55 p.m. UTC | #1
Hello Yangge,

On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
> in __compaction_suitable()") allow compaction to proceed when free
> pages required for compaction reside in the CMA pageblocks, it's
> possible that __compaction_suitable() always returns true, and in
> some cases, it's not acceptable.
> 
> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> of memory. I have configured 16GB of CMA memory on each NUMA node,
> and starting a 32GB virtual machine with device passthrough is
> extremely slow, taking almost an hour.
> 
> During the start-up of the virtual machine, it will call
> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> Long term GUP cannot allocate memory from CMA area, so a maximum
> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
> machine memory. Since there is 16G of free CMA memory on the NUMA
> node, watermark for order-0 always be met for compaction, so
> __compaction_suitable() always returns true, even if the node is
> unable to allocate non-CMA memory for the virtual machine.
> 
> For costly allocations, because __compaction_suitable() always
> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> place, resulting in excessively long virtual machine startup times.
> Call trace:
> __alloc_pages_slowpath
>     if (compact_result == COMPACT_SKIPPED ||
>         compact_result == COMPACT_DEFERRED)
>         goto nopage; // should exit __alloc_pages_slowpath() from here
> 
> Other unmovable alloctions, like dma_buf, which can be large in a
> Linux system, are also unable to allocate memory from CMA, and these
> allocations suffer from the same problems described above. In order
> to quickly fall back to remote node, we should remove ALLOC_CMA both
> in __compaction_suitable() and __isolate_free_page() for unmovable
> alloctions. After this fix, starting a 32GB virtual machine with
> device passthrough takes only a few seconds.

The symptom is obviously bad, but I don't understand this fix.

The reason we do ALLOC_CMA is that, even for unmovable allocations,
you can create space in non-CMA space by moving migratable pages over
to CMA space. This is not a property we want to lose. But I also don't
see how it would interfere with your scenario.

There is the compaction_suitable() check in should_compact_retry(),
but that only applies when COMPACT_SKIPPED. IOW, it should only happen
when compaction_suitable() just now returned false. IOW, a race
condition. Which is why it's also not subject to limited retries.

What's the exact condition that traps the allocator inside the loop?
Ge Yang Dec. 18, 2024, 2:15 a.m. UTC | #2
在 2024/12/17 23:55, Johannes Weiner 写道:
> Hello Yangge,
> 
> On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
>> in __compaction_suitable()") allow compaction to proceed when free
>> pages required for compaction reside in the CMA pageblocks, it's
>> possible that __compaction_suitable() always returns true, and in
>> some cases, it's not acceptable.
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> During the start-up of the virtual machine, it will call
>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>> Long term GUP cannot allocate memory from CMA area, so a maximum
>> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
>> machine memory. Since there is 16G of free CMA memory on the NUMA
>> node, watermark for order-0 always be met for compaction, so
>> __compaction_suitable() always returns true, even if the node is
>> unable to allocate non-CMA memory for the virtual machine.
>>
>> For costly allocations, because __compaction_suitable() always
>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>> place, resulting in excessively long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> Other unmovable alloctions, like dma_buf, which can be large in a
>> Linux system, are also unable to allocate memory from CMA, and these
>> allocations suffer from the same problems described above. In order
>> to quickly fall back to remote node, we should remove ALLOC_CMA both
>> in __compaction_suitable() and __isolate_free_page() for unmovable
>> alloctions. After this fix, starting a 32GB virtual machine with
>> device passthrough takes only a few seconds.
> 
> The symptom is obviously bad, but I don't understand this fix.
> 
> The reason we do ALLOC_CMA is that, even for unmovable allocations,
> you can create space in non-CMA space by moving migratable pages over
> to CMA space. This is not a property we want to lose. But I also don't
> see how it would interfere with your scenario.

The __alloc_pages_slowpath() function was originally intended to exit at 
place 1, but due to __compaction_suitable() always returning true, it 
results in __alloc_pages_slowpath() exiting at place 2 instead. This 
ultimately leads to a significantly longer execution time for 
__alloc_pages_slowpath().

Call trace:
  __alloc_pages_slowpath
       if (compact_result == COMPACT_SKIPPED ||
          compact_result == COMPACT_DEFERRED)
           goto nopage; // place 1
       __alloc_pages_direct_reclaim() // Reclaim is very expensive
       __alloc_pages_direct_compact()
       if (gfp_mask & __GFP_NORETRY)
           goto nopage; // place 2

Every time memory allocation goes through the above slower process, it 
ultimately leads to significantly longer virtual machine startup times.

> 
> There is the compaction_suitable() check in should_compact_retry(),
> but that only applies when COMPACT_SKIPPED. IOW, it should only happen
> when compaction_suitable() just now returned false. IOW, a race
> condition. Which is why it's also not subject to limited retries.
> 
> What's the exact condition that traps the allocator inside the loop?
The should_compact_retry() function was not executed, and the slow here 
was mainly due to the execution of __alloc_pages_direct_reclaim().
Johannes Weiner Dec. 18, 2024, 3:29 a.m. UTC | #3
On Wed, Dec 18, 2024 at 10:15:06AM +0800, Ge Yang wrote:
> 
> 
> 在 2024/12/17 23:55, Johannes Weiner 写道:
> > Hello Yangge,
> > 
> > On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
> >> From: yangge <yangge1116@126.com>
> >>
> >> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
> >> in __compaction_suitable()") allow compaction to proceed when free
> >> pages required for compaction reside in the CMA pageblocks, it's
> >> possible that __compaction_suitable() always returns true, and in
> >> some cases, it's not acceptable.
> >>
> >> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
> >> of memory. I have configured 16GB of CMA memory on each NUMA node,
> >> and starting a 32GB virtual machine with device passthrough is
> >> extremely slow, taking almost an hour.
> >>
> >> During the start-up of the virtual machine, it will call
> >> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
> >> Long term GUP cannot allocate memory from CMA area, so a maximum
> >> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
> >> machine memory. Since there is 16G of free CMA memory on the NUMA
> >> node, watermark for order-0 always be met for compaction, so
> >> __compaction_suitable() always returns true, even if the node is
> >> unable to allocate non-CMA memory for the virtual machine.
> >>
> >> For costly allocations, because __compaction_suitable() always
> >> returns true, __alloc_pages_slowpath() can't exit at the appropriate
> >> place, resulting in excessively long virtual machine startup times.
> >> Call trace:
> >> __alloc_pages_slowpath
> >>      if (compact_result == COMPACT_SKIPPED ||
> >>          compact_result == COMPACT_DEFERRED)
> >>          goto nopage; // should exit __alloc_pages_slowpath() from here
> >>
> >> Other unmovable alloctions, like dma_buf, which can be large in a
> >> Linux system, are also unable to allocate memory from CMA, and these
> >> allocations suffer from the same problems described above. In order
> >> to quickly fall back to remote node, we should remove ALLOC_CMA both
> >> in __compaction_suitable() and __isolate_free_page() for unmovable
> >> alloctions. After this fix, starting a 32GB virtual machine with
> >> device passthrough takes only a few seconds.
> > 
> > The symptom is obviously bad, but I don't understand this fix.
> > 
> > The reason we do ALLOC_CMA is that, even for unmovable allocations,
> > you can create space in non-CMA space by moving migratable pages over
> > to CMA space. This is not a property we want to lose. But I also don't
> > see how it would interfere with your scenario.
> 
> The __alloc_pages_slowpath() function was originally intended to exit at 
> place 1, but due to __compaction_suitable() always returning true, it 
> results in __alloc_pages_slowpath() exiting at place 2 instead. This 
> ultimately leads to a significantly longer execution time for 
> __alloc_pages_slowpath().
> 
> Call trace:
>   __alloc_pages_slowpath
>        if (compact_result == COMPACT_SKIPPED ||
>           compact_result == COMPACT_DEFERRED)
>            goto nopage; // place 1
>        __alloc_pages_direct_reclaim() // Reclaim is very expensive
>        __alloc_pages_direct_compact()
>        if (gfp_mask & __GFP_NORETRY)
>            goto nopage; // place 2
> 
> Every time memory allocation goes through the above slower process, it 
> ultimately leads to significantly longer virtual machine startup times.

I still don't follow. Why do you want the allocation to fail?

The changelog says this is in order to fall back quickly to other
nodes. But there is a full node walk in get_page_from_freelist()
before the allocator even engages reclaim. There is something missing
from the story still.

But regardless - surely you can see that we can't make the allocator
generally weaker on large requests just because they happen to be
optional in your specific case?

> > There is the compaction_suitable() check in should_compact_retry(),
> > but that only applies when COMPACT_SKIPPED. IOW, it should only happen
> > when compaction_suitable() just now returned false. IOW, a race
> > condition. Which is why it's also not subject to limited retries.
> > 
> > What's the exact condition that traps the allocator inside the loop?
> The should_compact_retry() function was not executed, and the slow here 
> was mainly due to the execution of __alloc_pages_direct_reclaim().

Ok.
Ge Yang Dec. 18, 2024, 3:56 a.m. UTC | #4
在 2024/12/18 11:29, Johannes Weiner 写道:
> On Wed, Dec 18, 2024 at 10:15:06AM +0800, Ge Yang wrote:
>>
>>
>> 在 2024/12/17 23:55, Johannes Weiner 写道:
>>> Hello Yangge,
>>>
>>> On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
>>>> From: yangge <yangge1116@126.com>
>>>>
>>>> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
>>>> in __compaction_suitable()") allow compaction to proceed when free
>>>> pages required for compaction reside in the CMA pageblocks, it's
>>>> possible that __compaction_suitable() always returns true, and in
>>>> some cases, it's not acceptable.
>>>>
>>>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>>>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>>>> and starting a 32GB virtual machine with device passthrough is
>>>> extremely slow, taking almost an hour.
>>>>
>>>> During the start-up of the virtual machine, it will call
>>>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>>>> Long term GUP cannot allocate memory from CMA area, so a maximum
>>>> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
>>>> machine memory. Since there is 16G of free CMA memory on the NUMA
>>>> node, watermark for order-0 always be met for compaction, so
>>>> __compaction_suitable() always returns true, even if the node is
>>>> unable to allocate non-CMA memory for the virtual machine.
>>>>
>>>> For costly allocations, because __compaction_suitable() always
>>>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>>>> place, resulting in excessively long virtual machine startup times.
>>>> Call trace:
>>>> __alloc_pages_slowpath
>>>>       if (compact_result == COMPACT_SKIPPED ||
>>>>           compact_result == COMPACT_DEFERRED)
>>>>           goto nopage; // should exit __alloc_pages_slowpath() from here
>>>>
>>>> Other unmovable alloctions, like dma_buf, which can be large in a
>>>> Linux system, are also unable to allocate memory from CMA, and these
>>>> allocations suffer from the same problems described above. In order
>>>> to quickly fall back to remote node, we should remove ALLOC_CMA both
>>>> in __compaction_suitable() and __isolate_free_page() for unmovable
>>>> alloctions. After this fix, starting a 32GB virtual machine with
>>>> device passthrough takes only a few seconds.
>>>
>>> The symptom is obviously bad, but I don't understand this fix.
>>>
>>> The reason we do ALLOC_CMA is that, even for unmovable allocations,
>>> you can create space in non-CMA space by moving migratable pages over
>>> to CMA space. This is not a property we want to lose. But I also don't
>>> see how it would interfere with your scenario.
>>
>> The __alloc_pages_slowpath() function was originally intended to exit at
>> place 1, but due to __compaction_suitable() always returning true, it
>> results in __alloc_pages_slowpath() exiting at place 2 instead. This
>> ultimately leads to a significantly longer execution time for
>> __alloc_pages_slowpath().
>>
>> Call trace:
>>    __alloc_pages_slowpath
>>         if (compact_result == COMPACT_SKIPPED ||
>>            compact_result == COMPACT_DEFERRED)
>>             goto nopage; // place 1
>>         __alloc_pages_direct_reclaim() // Reclaim is very expensive
>>         __alloc_pages_direct_compact()
>>         if (gfp_mask & __GFP_NORETRY)
>>             goto nopage; // place 2
>>
>> Every time memory allocation goes through the above slower process, it
>> ultimately leads to significantly longer virtual machine startup times.
> 
> I still don't follow. Why do you want the allocation to fail?
>
pin_user_pages_remote(..., FOLL_LONGTERM, ...) first attemps to allocate 
THP only on local node, and then fall back to remote NUMA nodes if the 
local allocation fail. For detail, see alloc_pages_mpol().

static struct page *alloc_pages_mpol()
{
     page = __alloc_frozen_pages_noprof(__GFP_THISNODE,...); // 1, try 
to allocate THP only on local node

     if (page || !(gpf & __GFP_DIRECT_RECLAIM))
         return page;

     page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);//2, 
fall back to remote NUMA nodes
}
> The changelog says this is in order to fall back quickly to other
> nodes. But there is a full node walk in get_page_from_freelist()
> before the allocator even engages reclaim. There is something missing
> from the story still.
> 
> But regardless - surely you can see that we can't make the allocator
> generally weaker on large requests just because they happen to be
> optional in your specific case? >First, try to allocate THP on the local node as much as possible, and 
then fall back to a remote node if the local allocation fail. This is 
the default memory allocation strategy when starting virtual machines.

>>> There is the compaction_suitable() check in should_compact_retry(),
>>> but that only applies when COMPACT_SKIPPED. IOW, it should only happen
>>> when compaction_suitable() just now returned false. IOW, a race
>>> condition. Which is why it's also not subject to limited retries.
>>>
>>> What's the exact condition that traps the allocator inside the loop?
>> The should_compact_retry() function was not executed, and the slow here
>> was mainly due to the execution of __alloc_pages_direct_reclaim().
> 
> Ok.
Ge Yang Dec. 18, 2024, 4 a.m. UTC | #5
在 2024/12/18 11:29, Johannes Weiner 写道:
> On Wed, Dec 18, 2024 at 10:15:06AM +0800, Ge Yang wrote:
>>
>>
>> 在 2024/12/17 23:55, Johannes Weiner 写道:
>>> Hello Yangge,
>>>
>>> On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
>>>> From: yangge <yangge1116@126.com>
>>>>
>>>> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
>>>> in __compaction_suitable()") allow compaction to proceed when free
>>>> pages required for compaction reside in the CMA pageblocks, it's
>>>> possible that __compaction_suitable() always returns true, and in
>>>> some cases, it's not acceptable.
>>>>
>>>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>>>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>>>> and starting a 32GB virtual machine with device passthrough is
>>>> extremely slow, taking almost an hour.
>>>>
>>>> During the start-up of the virtual machine, it will call
>>>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>>>> Long term GUP cannot allocate memory from CMA area, so a maximum
>>>> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
>>>> machine memory. Since there is 16G of free CMA memory on the NUMA
>>>> node, watermark for order-0 always be met for compaction, so
>>>> __compaction_suitable() always returns true, even if the node is
>>>> unable to allocate non-CMA memory for the virtual machine.
>>>>
>>>> For costly allocations, because __compaction_suitable() always
>>>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>>>> place, resulting in excessively long virtual machine startup times.
>>>> Call trace:
>>>> __alloc_pages_slowpath
>>>>       if (compact_result == COMPACT_SKIPPED ||
>>>>           compact_result == COMPACT_DEFERRED)
>>>>           goto nopage; // should exit __alloc_pages_slowpath() from here
>>>>
>>>> Other unmovable alloctions, like dma_buf, which can be large in a
>>>> Linux system, are also unable to allocate memory from CMA, and these
>>>> allocations suffer from the same problems described above. In order
>>>> to quickly fall back to remote node, we should remove ALLOC_CMA both
>>>> in __compaction_suitable() and __isolate_free_page() for unmovable
>>>> alloctions. After this fix, starting a 32GB virtual machine with
>>>> device passthrough takes only a few seconds.
>>>
>>> The symptom is obviously bad, but I don't understand this fix.
>>>
>>> The reason we do ALLOC_CMA is that, even for unmovable allocations,
>>> you can create space in non-CMA space by moving migratable pages over
>>> to CMA space. This is not a property we want to lose. But I also don't
>>> see how it would interfere with your scenario.
>>
>> The __alloc_pages_slowpath() function was originally intended to exit at
>> place 1, but due to __compaction_suitable() always returning true, it
>> results in __alloc_pages_slowpath() exiting at place 2 instead. This
>> ultimately leads to a significantly longer execution time for
>> __alloc_pages_slowpath().
>>
>> Call trace:
>>    __alloc_pages_slowpath
>>         if (compact_result == COMPACT_SKIPPED ||
>>            compact_result == COMPACT_DEFERRED)
>>             goto nopage; // place 1
>>         __alloc_pages_direct_reclaim() // Reclaim is very expensive
>>         __alloc_pages_direct_compact()
>>         if (gfp_mask & __GFP_NORETRY)
>>             goto nopage; // place 2
>>
>> Every time memory allocation goes through the above slower process, it
>> ultimately leads to significantly longer virtual machine startup times.
> 
> I still don't follow. Why do you want the allocation to fail?
> 
pin_user_pages_remote(..., FOLL_LONGTERM, ...) first attemps to allocate 
THP only on local node, and then fall back to remote NUMA nodes if the 
local allocation fail. For detail, see alloc_pages_mpol().

static struct page *alloc_pages_mpol()
{
     page = __alloc_frozen_pages_noprof(__GFP_THISNODE,...); // 1, try 
to allocate THP only on local node

     if (page || !(gpf & __GFP_DIRECT_RECLAIM))
         return page;

     page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);//2, 
fall back to remote NUMA nodes
}

> The changelog says this is in order to fall back quickly to other
> nodes. But there is a full node walk in get_page_from_freelist()
> before the allocator even engages reclaim. There is something missing
> from the story still.
> 
> But regardless - surely you can see that we can't make the allocator
> generally weaker on large requests just because they happen to be
> optional in your specific case?
> 

First, try to allocate THP on the local node as much as possible, and
then fall back to a remote node if the local allocation fail. This is 
the default memory allocation strategy when starting virtual machines.

>>> There is the compaction_suitable() check in should_compact_retry(),
>>> but that only applies when COMPACT_SKIPPED. IOW, it should only happen
>>> when compaction_suitable() just now returned false. IOW, a race
>>> condition. Which is why it's also not subject to limited retries.
>>>
>>> What's the exact condition that traps the allocator inside the loop?
>> The should_compact_retry() function was not executed, and the slow here
>> was mainly due to the execution of __alloc_pages_direct_reclaim().
> 
> Ok.
Baolin Wang Dec. 18, 2024, 7:57 a.m. UTC | #6
On 2024/12/17 23:55, Johannes Weiner wrote:
> Hello Yangge,
> 
> On Tue, Dec 17, 2024 at 07:46:44PM +0800, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
>> in __compaction_suitable()") allow compaction to proceed when free
>> pages required for compaction reside in the CMA pageblocks, it's
>> possible that __compaction_suitable() always returns true, and in
>> some cases, it's not acceptable.
>>
>> There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
>> of memory. I have configured 16GB of CMA memory on each NUMA node,
>> and starting a 32GB virtual machine with device passthrough is
>> extremely slow, taking almost an hour.
>>
>> During the start-up of the virtual machine, it will call
>> pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
>> Long term GUP cannot allocate memory from CMA area, so a maximum
>> of 16 GB of no-CMA memory on a NUMA node can be used as virtual
>> machine memory. Since there is 16G of free CMA memory on the NUMA
>> node, watermark for order-0 always be met for compaction, so
>> __compaction_suitable() always returns true, even if the node is
>> unable to allocate non-CMA memory for the virtual machine.
>>
>> For costly allocations, because __compaction_suitable() always
>> returns true, __alloc_pages_slowpath() can't exit at the appropriate
>> place, resulting in excessively long virtual machine startup times.
>> Call trace:
>> __alloc_pages_slowpath
>>      if (compact_result == COMPACT_SKIPPED ||
>>          compact_result == COMPACT_DEFERRED)
>>          goto nopage; // should exit __alloc_pages_slowpath() from here
>>
>> Other unmovable alloctions, like dma_buf, which can be large in a
>> Linux system, are also unable to allocate memory from CMA, and these
>> allocations suffer from the same problems described above. In order
>> to quickly fall back to remote node, we should remove ALLOC_CMA both
>> in __compaction_suitable() and __isolate_free_page() for unmovable
>> alloctions. After this fix, starting a 32GB virtual machine with
>> device passthrough takes only a few seconds.
> 
> The symptom is obviously bad, but I don't understand this fix.
> 
> The reason we do ALLOC_CMA is that, even for unmovable allocations,
> you can create space in non-CMA space by moving migratable pages over
> to CMA space. This is not a property we want to lose. But I also don't

Good point. I missed that and I need to withdraw my reviewed tag. Thanks.
diff mbox series

Patch

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764..b4c3ac3 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,8 @@  extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		struct page **page);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern bool compaction_suitable(struct zone *zone, int order,
-					       int highest_zoneidx);
+					       int highest_zoneidx,
+					       unsigned int alloc_flags);
 
 extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
@@ -108,7 +109,8 @@  static inline void reset_isolation_suitable(pg_data_t *pgdat)
 }
 
 static inline bool compaction_suitable(struct zone *zone, int order,
-						      int highest_zoneidx)
+						      int highest_zoneidx,
+						      unsigned int alloc_flags)
 {
 	return false;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..223f2da 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -655,7 +655,7 @@  static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 		/* Found a free page, will break it into order-0 pages */
 		order = buddy_order(page);
-		isolated = __isolate_free_page(page, order);
+		isolated = __isolate_free_page(page, order, cc->alloc_flags);
 		if (!isolated)
 			break;
 		set_page_private(page, order);
@@ -1634,7 +1634,7 @@  static void fast_isolate_freepages(struct compact_control *cc)
 
 		/* Isolate the page if available */
 		if (page) {
-			if (__isolate_free_page(page, order)) {
+			if (__isolate_free_page(page, order, cc->alloc_flags)) {
 				set_page_private(page, order);
 				nr_isolated = 1 << order;
 				nr_scanned += nr_isolated - 1;
@@ -2381,6 +2381,7 @@  static enum compact_result compact_finished(struct compact_control *cc)
 
 static bool __compaction_suitable(struct zone *zone, int order,
 				  int highest_zoneidx,
+				  unsigned int alloc_flags,
 				  unsigned long wmark_target)
 {
 	unsigned long watermark;
@@ -2395,25 +2396,26 @@  static bool __compaction_suitable(struct zone *zone, int order,
 	 * even if compaction succeeds.
 	 * For costly orders, we require low watermark instead of min for
 	 * compaction to proceed to increase its chances.
-	 * ALLOC_CMA is used, as pages in CMA pageblocks are considered
-	 * suitable migration targets
+	 * In addition to unmovable allocations, ALLOC_CMA is used, as pages in
+	 * CMA pageblocks are considered suitable migration targets
 	 */
 	watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
 				low_wmark_pages(zone) : min_wmark_pages(zone);
 	watermark += compact_gap(order);
 	return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
-				   ALLOC_CMA, wmark_target);
+				   alloc_flags & ALLOC_CMA, wmark_target);
 }
 
 /*
  * compaction_suitable: Is this suitable to run compaction on this zone now?
  */
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+				   unsigned int alloc_flags)
 {
 	enum compact_result compact_result;
 	bool suitable;
 
-	suitable = __compaction_suitable(zone, order, highest_zoneidx,
+	suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
 					 zone_page_state(zone, NR_FREE_PAGES));
 	/*
 	 * fragmentation index determines if allocation failures are due to
@@ -2474,7 +2476,7 @@  bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 		if (__compaction_suitable(zone, order, ac->highest_zoneidx,
-					  available))
+					  alloc_flags, available))
 			return true;
 	}
 
@@ -2499,7 +2501,7 @@  compaction_suit_allocation_order(struct zone *zone, unsigned int order,
 			      alloc_flags))
 		return COMPACT_SUCCESS;
 
-	if (!compaction_suitable(zone, order, highest_zoneidx))
+	if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
 		return COMPACT_SKIPPED;
 
 	return COMPACT_CONTINUE;
@@ -2893,6 +2895,7 @@  static int compact_node(pg_data_t *pgdat, bool proactive)
 	struct compact_control cc = {
 		.order = -1,
 		.mode = proactive ? MIGRATE_SYNC_LIGHT : MIGRATE_SYNC,
+		.alloc_flags = ALLOC_CMA,
 		.ignore_skip_hint = true,
 		.whole_zone = true,
 		.gfp_mask = GFP_KERNEL,
@@ -3037,7 +3040,7 @@  static bool kcompactd_node_suitable(pg_data_t *pgdat)
 
 		ret = compaction_suit_allocation_order(zone,
 				pgdat->kcompactd_max_order,
-				highest_zoneidx, ALLOC_WMARK_MIN);
+				highest_zoneidx, ALLOC_CMA | ALLOC_WMARK_MIN);
 		if (ret == COMPACT_CONTINUE)
 			return true;
 	}
@@ -3058,6 +3061,7 @@  static void kcompactd_do_work(pg_data_t *pgdat)
 		.search_order = pgdat->kcompactd_max_order,
 		.highest_zoneidx = pgdat->kcompactd_highest_zoneidx,
 		.mode = MIGRATE_SYNC_LIGHT,
+		.alloc_flags = ALLOC_CMA | ALLOC_WMARK_MIN,
 		.ignore_skip_hint = false,
 		.gfp_mask = GFP_KERNEL,
 	};
@@ -3078,7 +3082,7 @@  static void kcompactd_do_work(pg_data_t *pgdat)
 			continue;
 
 		ret = compaction_suit_allocation_order(zone,
-				cc.order, zoneid, ALLOC_WMARK_MIN);
+				cc.order, zoneid, cc.alloc_flags);
 		if (ret != COMPACT_CONTINUE)
 			continue;
 
diff --git a/mm/internal.h b/mm/internal.h
index 3922788..6d257c8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,7 +662,8 @@  static inline void clear_zone_contiguous(struct zone *zone)
 	zone->contiguous = false;
 }
 
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+				    unsigned int alloc_flags);
 extern void __putback_isolated_page(struct page *page, unsigned int order,
 				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..1bfdca3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2809,7 +2809,8 @@  void split_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+				   unsigned int alloc_flags)
 {
 	struct zone *zone = page_zone(page);
 	int mt = get_pageblock_migratetype(page);
@@ -2823,7 +2824,8 @@  int __isolate_free_page(struct page *page, unsigned int order)
 		 * exists.
 		 */
 		watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		if (!zone_watermark_ok(zone, 0, watermark, 0,
+			    alloc_flags & ALLOC_CMA))
 			return 0;
 	}
 
@@ -6454,6 +6456,7 @@  int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 		.order = -1,
 		.zone = page_zone(pfn_to_page(start)),
 		.mode = MIGRATE_SYNC,
+		.alloc_flags = ALLOC_CMA,
 		.ignore_skip_hint = true,
 		.no_set_skip_hint = true,
 		.alloc_contig = true,
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index c608e9d..a1f2c79 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -229,7 +229,8 @@  static void unset_migratetype_isolate(struct page *page, int migratetype)
 			buddy = find_buddy_page_pfn(page, page_to_pfn(page),
 						    order, NULL);
 			if (buddy && !is_migrate_isolate_page(buddy)) {
-				isolated_page = !!__isolate_free_page(page, order);
+				isolated_page = !!__isolate_free_page(page, order,
+						    ALLOC_CMA);
 				/*
 				 * Isolating a free page in an isolated pageblock
 				 * is expected to always work as watermarks don't
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index e4c428e..fd3813b 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -198,7 +198,7 @@  page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
-			if (!__isolate_free_page(page, order)) {
+			if (!__isolate_free_page(page, order, ALLOC_CMA)) {
 				next = page;
 				break;
 			}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e03a61..33f5b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5815,7 +5815,7 @@  static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 				      sc->reclaim_idx, 0))
 			return false;
 
-		if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+		if (compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
 			return false;
 	}
 
@@ -6043,7 +6043,7 @@  static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 		return true;
 
 	/* Compaction cannot yet proceed. Do reclaim. */
-	if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+	if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, ALLOC_CMA))
 		return false;
 
 	/*