diff mbox series

mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

Message ID 1717492460-19457-1-git-send-email-yangge1116@126.com (mailing list archive)
State New
Headers show
Series mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page | expand

Commit Message

Ge Yang June 4, 2024, 9:14 a.m. UTC
From: yangge <yangge1116@126.com>

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migration type
of pages in THP-sized PCP list, it's possible to get a CMA page from
the list, in some cases, it's not acceptable, for example, allocating
a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.

The patch forbids allocating non-CMA THP-sized page from THP-sized
PCP list to avoid the issue above.

Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
Signed-off-by: yangge <yangge1116@126.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Baolin Wang June 4, 2024, 12:01 p.m. UTC | #1
Cc Johannes, Zi and Vlastimil.

On 2024/6/4 17:14, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> THP-sized allocations") no longer differentiates the migration type
> of pages in THP-sized PCP list, it's possible to get a CMA page from
> the list, in some cases, it's not acceptable, for example, allocating
> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> 
> The patch forbids allocating non-CMA THP-sized page from THP-sized
> PCP list to avoid the issue above.
> 
> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
> Signed-off-by: yangge <yangge1116@126.com>
> ---
>   mm/page_alloc.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5..0bdf471 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone,
>   	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>   
>   	if (likely(pcp_allowed_order(order))) {
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> +						order != HPAGE_PMD_ORDER) {

Seems you will also miss the non-CMA THP from the PCP, so I wonder if we 
can add a migratetype comparison in __rmqueue_pcplist(), and if it's not 
suitable, then fallback to buddy?

> +			page = rmqueue_pcplist(preferred_zone, zone, order,
> +						migratetype, alloc_flags);
> +			if (likely(page))
> +				goto out;
> +		}
> +#else
>   		page = rmqueue_pcplist(preferred_zone, zone, order,
>   				       migratetype, alloc_flags);
>   		if (likely(page))
>   			goto out;
> +#endif
>   	}
>   
>   	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
Ge Yang June 4, 2024, 12:36 p.m. UTC | #2
在 2024/6/4 下午8:01, Baolin Wang 写道:
> Cc Johannes, Zi and Vlastimil.
> 
> On 2024/6/4 17:14, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>> THP-sized allocations") no longer differentiates the migration type
>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>> the list, in some cases, it's not acceptable, for example, allocating
>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>
>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>> PCP list to avoid the issue above.
>>
>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
>> THP-sized allocations")
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>   mm/page_alloc.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2e22ce5..0bdf471 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone,
>>       WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>       if (likely(pcp_allowed_order(order))) {
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
>> +                        order != HPAGE_PMD_ORDER) {
> 
> Seems you will also miss the non-CMA THP from the PCP, so I wonder if we 
> can add a migratetype comparison in __rmqueue_pcplist(), and if it's not 
> suitable, then fallback to buddy?

Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype 
comparison in __rmqueue_pcplist(), we may need to compare many times 
because of pcp batch.
Baolin Wang June 6, 2024, 3:06 a.m. UTC | #3
On 2024/6/4 20:36, yangge1116 wrote:
> 
> 
> 在 2024/6/4 下午8:01, Baolin Wang 写道:
>> Cc Johannes, Zi and Vlastimil.
>>
>> On 2024/6/4 17:14, yangge1116@126.com wrote:
>>> From: yangge <yangge1116@126.com>
>>>
>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>> THP-sized allocations") no longer differentiates the migration type
>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>> the list, in some cases, it's not acceptable, for example, allocating
>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>
>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>> PCP list to avoid the issue above.
>>>
>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
>>> THP-sized allocations")
>>> Signed-off-by: yangge <yangge1116@126.com>
>>> ---
>>>   mm/page_alloc.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 2e22ce5..0bdf471 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone 
>>> *preferred_zone,
>>>       WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>       if (likely(pcp_allowed_order(order))) {
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
>>> +                        order != HPAGE_PMD_ORDER) {
>>
>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if 
>> we can add a migratetype comparison in __rmqueue_pcplist(), and if 
>> it's not suitable, then fallback to buddy?
> 
> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype 
> comparison in __rmqueue_pcplist(), we may need to compare many times 
> because of pcp batch.

I mean we can only compare once, focusing on CMA pages.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3734fe7e67c0..960a3b5744d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, 
unsigned int order,
                 }

                 page = list_first_entry(list, struct page, pcp_list);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+               if (order == HPAGE_PMD_ORDER && 
!is_migrate_movable(migratetype) &&
+                   is_migrate_cma(get_pageblock_migratetype(page)))
+                       return NULL;
+#endif
                 list_del(&page->pcp_list);
                 pcp->count -= 1 << order;
         } while (check_new_pages(page, order));
Ge Yang June 6, 2024, 9:10 a.m. UTC | #4
在 2024/6/6 上午11:06, Baolin Wang 写道:
> 
> 
> On 2024/6/4 20:36, yangge1116 wrote:
>>
>>
>> 在 2024/6/4 下午8:01, Baolin Wang 写道:
>>> Cc Johannes, Zi and Vlastimil.
>>>
>>> On 2024/6/4 17:14, yangge1116@126.com wrote:
>>>> From: yangge <yangge1116@126.com>
>>>>
>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>> THP-sized allocations") no longer differentiates the migration type
>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>
>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>> PCP list to avoid the issue above.
>>>>
>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
>>>> THP-sized allocations")
>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>> ---
>>>>   mm/page_alloc.c | 10 ++++++++++
>>>>   1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 2e22ce5..0bdf471 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone 
>>>> *preferred_zone,
>>>>       WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>       if (likely(pcp_allowed_order(order))) {
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
>>>> +                        order != HPAGE_PMD_ORDER) {
>>>
>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if 
>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if 
>>> it's not suitable, then fallback to buddy?
>>
>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a 
>> migratetype comparison in __rmqueue_pcplist(), we may need to compare 
>> many times because of pcp batch.
> 
> I mean we can only compare once, focusing on CMA pages.

pcp_list may contains CMA and no-CMA pages, why only compare once, just 
increase one chance of using the pcp_list?


> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3734fe7e67c0..960a3b5744d8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, 
> unsigned int order,
>                  }
> 
>                  page = list_first_entry(list, struct page, pcp_list);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +               if (order == HPAGE_PMD_ORDER && 
> !is_migrate_movable(migratetype) &&
> +                   is_migrate_cma(get_pageblock_migratetype(page)))
> +                       return NULL;
> +#endif
>                  list_del(&page->pcp_list);
>                  pcp->count -= 1 << order;
>          } while (check_new_pages(page, order));
Barry Song June 17, 2024, 10:26 a.m. UTC | #5
On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>
> From: yangge <yangge1116@126.com>
>
> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> THP-sized allocations") no longer differentiates the migration type
> of pages in THP-sized PCP list, it's possible to get a CMA page from
> the list, in some cases, it's not acceptable, for example, allocating
> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>
> The patch forbids allocating non-CMA THP-sized page from THP-sized
> PCP list to avoid the issue above.

Could you please describe the impact on users in the commit log?
Is it possible that some CMA memory might be used by non-movable
allocation requests? If so, will CMA somehow become unable to
migrate, causing cma_alloc() to fail?

>
> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
> Signed-off-by: yangge <yangge1116@126.com>
> ---
>  mm/page_alloc.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e22ce5..0bdf471 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone,
>         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>
>         if (likely(pcp_allowed_order(order))) {
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> +                                               order != HPAGE_PMD_ORDER) {
> +                       page = rmqueue_pcplist(preferred_zone, zone, order,
> +                                               migratetype, alloc_flags);
> +                       if (likely(page))
> +                               goto out;
> +               }

This seems not ideal, because non-CMA THP gets no chance to use PCP. But it
still seems better than causing the failure of CMA allocation.

Is there a possible approach to avoiding adding CMA THP into pcp from the first
beginning? Otherwise, we might need a separate PCP for CMA.

> +#else
>                 page = rmqueue_pcplist(preferred_zone, zone, order,
>                                        migratetype, alloc_flags);
>                 if (likely(page))
>                         goto out;
> +#endif
>         }
>
>         page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> --
> 2.7.4

Thanks
Barry
Barry Song June 17, 2024, 10:43 a.m. UTC | #6
On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/6/4 20:36, yangge1116 wrote:
> >
> >
> > 在 2024/6/4 下午8:01, Baolin Wang 写道:
> >> Cc Johannes, Zi and Vlastimil.
> >>
> >> On 2024/6/4 17:14, yangge1116@126.com wrote:
> >>> From: yangge <yangge1116@126.com>
> >>>
> >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>> THP-sized allocations") no longer differentiates the migration type
> >>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>> the list, in some cases, it's not acceptable, for example, allocating
> >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>
> >>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>> PCP list to avoid the issue above.
> >>>
> >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>> THP-sized allocations")
> >>> Signed-off-by: yangge <yangge1116@126.com>
> >>> ---
> >>>   mm/page_alloc.c | 10 ++++++++++
> >>>   1 file changed, 10 insertions(+)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 2e22ce5..0bdf471 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>> *preferred_zone,
> >>>       WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>       if (likely(pcp_allowed_order(order))) {
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> >>> +                        order != HPAGE_PMD_ORDER) {
> >>
> >> Seems you will also miss the non-CMA THP from the PCP, so I wonder if
> >> we can add a migratetype comparison in __rmqueue_pcplist(), and if
> >> it's not suitable, then fallback to buddy?
> >
> > Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype
> > comparison in __rmqueue_pcplist(), we may need to compare many times
> > because of pcp batch.
>
> I mean we can only compare once, focusing on CMA pages.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3734fe7e67c0..960a3b5744d8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone,
> unsigned int order,
>                  }
>
>                  page = list_first_entry(list, struct page, pcp_list);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +               if (order == HPAGE_PMD_ORDER &&
> !is_migrate_movable(migratetype) &&
> +                   is_migrate_cma(get_pageblock_migratetype(page)))
> +                       return NULL;
> +#endif

This doesn't seem ideal either. It's possible that the PCP still has many
non-CMA folios, but due to bad luck, the first entry is "always" CMA.
In this case,
allocations with is_migrate_movable(migratetype) == false will always lose the
chance to use the PCP.   It also appears to incur a PCP spin lock/unlock.

I don't see an ideal solution unless we bring back the CMA PCP :-)

>                  list_del(&page->pcp_list);
>                  pcp->count -= 1 << order;
>          } while (check_new_pages(page, order));
>
Baolin Wang June 17, 2024, 11:36 a.m. UTC | #7
On 2024/6/17 18:43, Barry Song wrote:
> On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/6/4 20:36, yangge1116 wrote:
>>>
>>>
>>> 在 2024/6/4 下午8:01, Baolin Wang 写道:
>>>> Cc Johannes, Zi and Vlastimil.
>>>>
>>>> On 2024/6/4 17:14, yangge1116@126.com wrote:
>>>>> From: yangge <yangge1116@126.com>
>>>>>
>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>
>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>> PCP list to avoid the issue above.
>>>>>
>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations")
>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>> ---
>>>>>    mm/page_alloc.c | 10 ++++++++++
>>>>>    1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 2e22ce5..0bdf471 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>> *preferred_zone,
>>>>>        WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>        if (likely(pcp_allowed_order(order))) {
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
>>>>> +                        order != HPAGE_PMD_ORDER) {
>>>>
>>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if
>>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if
>>>> it's not suitable, then fallback to buddy?
>>>
>>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype
>>> comparison in __rmqueue_pcplist(), we may need to compare many times
>>> because of pcp batch.
>>
>> I mean we can only compare once, focusing on CMA pages.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3734fe7e67c0..960a3b5744d8 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone,
>> unsigned int order,
>>                   }
>>
>>                   page = list_first_entry(list, struct page, pcp_list);
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +               if (order == HPAGE_PMD_ORDER &&
>> !is_migrate_movable(migratetype) &&
>> +                   is_migrate_cma(get_pageblock_migratetype(page)))
>> +                       return NULL;
>> +#endif
> 
> This doesn't seem ideal either. It's possible that the PCP still has many
> non-CMA folios, but due to bad luck, the first entry is "always" CMA.
> In this case,
> allocations with is_migrate_movable(migratetype) == false will always lose the
> chance to use the PCP.   It also appears to incur a PCP spin lock/unlock.

Yes, just some ideas to to mitigate the issue...

> 
> I don't see an ideal solution unless we bring back the CMA PCP :-)

Tend to agree, and adding a CMA PCP seems the overhead can be acceptable?
Barry Song June 17, 2024, 11:55 a.m. UTC | #8
On Mon, Jun 17, 2024 at 7:36 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/6/17 18:43, Barry Song wrote:
> > On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 2024/6/4 20:36, yangge1116 wrote:
> >>>
> >>>
> >>> 在 2024/6/4 下午8:01, Baolin Wang 写道:
> >>>> Cc Johannes, Zi and Vlastimil.
> >>>>
> >>>> On 2024/6/4 17:14, yangge1116@126.com wrote:
> >>>>> From: yangge <yangge1116@126.com>
> >>>>>
> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations") no longer differentiates the migration type
> >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>>>> the list, in some cases, it's not acceptable, for example, allocating
> >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>
> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>>>> PCP list to avoid the issue above.
> >>>>>
> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations")
> >>>>> Signed-off-by: yangge <yangge1116@126.com>
> >>>>> ---
> >>>>>    mm/page_alloc.c | 10 ++++++++++
> >>>>>    1 file changed, 10 insertions(+)
> >>>>>
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index 2e22ce5..0bdf471 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>> *preferred_zone,
> >>>>>        WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>>>        if (likely(pcp_allowed_order(order))) {
> >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> >>>>> +                        order != HPAGE_PMD_ORDER) {
> >>>>
> >>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if
> >>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if
> >>>> it's not suitable, then fallback to buddy?
> >>>
> >>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype
> >>> comparison in __rmqueue_pcplist(), we may need to compare many times
> >>> because of pcp batch.
> >>
> >> I mean we can only compare once, focusing on CMA pages.
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 3734fe7e67c0..960a3b5744d8 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone,
> >> unsigned int order,
> >>                   }
> >>
> >>                   page = list_first_entry(list, struct page, pcp_list);
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +               if (order == HPAGE_PMD_ORDER &&
> >> !is_migrate_movable(migratetype) &&
> >> +                   is_migrate_cma(get_pageblock_migratetype(page)))
> >> +                       return NULL;
> >> +#endif
> >
> > This doesn't seem ideal either. It's possible that the PCP still has many
> > non-CMA folios, but due to bad luck, the first entry is "always" CMA.
> > In this case,
> > allocations with is_migrate_movable(migratetype) == false will always lose the
> > chance to use the PCP.   It also appears to incur a PCP spin lock/unlock.
>
> Yes, just some ideas to to mitigate the issue...
>
> >
> > I don't see an ideal solution unless we bring back the CMA PCP :-)
>
> Tend to agree, and adding a CMA PCP seems the overhead can be acceptable?

yes. probably. Hi Ge,

Could we printk the size before and after adding 1 to NR_PCP_LISTS?
Does it increase one cacheline?

struct per_cpu_pages {
spinlock_t lock; /* Protects lists field */
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int high_min; /* min high watermark */
int high_max; /* max high watermark */
int batch; /* chunk size for buddy add/remove */
u8 flags; /* protected by pcp->lock */
u8 alloc_factor; /* batch scaling factor during allocate */
#ifdef CONFIG_NUMA
u8 expire; /* When 0, remote pagesets are drained */
#endif
short free_count; /* consecutive free count */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
} ____cacheline_aligned_in_smp;
Ge Yang June 17, 2024, 12:47 p.m. UTC | #9
在 2024/6/17 下午6:26, Barry Song 写道:
> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>
>> From: yangge <yangge1116@126.com>
>>
>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>> THP-sized allocations") no longer differentiates the migration type
>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>> the list, in some cases, it's not acceptable, for example, allocating
>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>
>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>> PCP list to avoid the issue above.
> 
> Could you please describe the impact on users in the commit log?

If a large number of CMA memory are configured in the system (for 
example, the CMA memory accounts for 50% of the system memory), starting 
virtual machine with device passthrough will get stuck.

During starting virtual machine, it will call pin_user_pages_remote(..., 
FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, 
pin_user_pages_remote() will migrate the page from CMA area to non-CMA 
area because of FOLL_LONGTERM flag. If non-movable allocation requests 
return CMA memory, pin_user_pages_remote() will enter endless loops.

backtrace:
pin_user_pages_remote
----__gup_longterm_locked //cause endless loops in this function
--------__get_user_pages_locked
--------check_and_migrate_movable_pages //always check fail and continue 
to migrate
------------migrate_longterm_unpinnable_pages
----------------alloc_migration_target // non-movable allocation

> Is it possible that some CMA memory might be used by non-movable
> allocation requests? 

Yes.


> If so, will CMA somehow become unable to migrate, causing cma_alloc() to fail?


No, it will cause endless loops in __gup_longterm_locked(). If 
non-movable allocation requests return CMA memory, 
migrate_longterm_unpinnable_pages() will migrate a CMA page to another 
CMA page, which is useless and cause endless loops in 
__gup_longterm_locked().

backtrace:
pin_user_pages_remote
----__gup_longterm_locked //cause endless loops in this function
--------__get_user_pages_locked
--------check_and_migrate_movable_pages //always check fail and continue 
to migrate
------------migrate_longterm_unpinnable_pages





>>
>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>   mm/page_alloc.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 2e22ce5..0bdf471 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone,
>>          WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>
>>          if (likely(pcp_allowed_order(order))) {
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
>> +                                               order != HPAGE_PMD_ORDER) {
>> +                       page = rmqueue_pcplist(preferred_zone, zone, order,
>> +                                               migratetype, alloc_flags);
>> +                       if (likely(page))
>> +                               goto out;
>> +               }
> 
> This seems not ideal, because non-CMA THP gets no chance to use PCP. But it
> still seems better than causing the failure of CMA allocation.
> 
> Is there a possible approach to avoiding adding CMA THP into pcp from the first
> beginning? Otherwise, we might need a separate PCP for CMA.
> 
>> +#else
>>                  page = rmqueue_pcplist(preferred_zone, zone, order,
>>                                         migratetype, alloc_flags);
>>                  if (likely(page))
>>                          goto out;
>> +#endif
>>          }
>>
>>          page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>> --
>> 2.7.4
> 
> Thanks
> Barry
>
Ge Yang June 18, 2024, 1:34 a.m. UTC | #10
在 2024/6/17 下午8:47, yangge1116 写道:
> 
> 
> 在 2024/6/17 下午6:26, Barry Song 写道:
>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>
>>> From: yangge <yangge1116@126.com>
>>>
>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>> THP-sized allocations") no longer differentiates the migration type
>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>> the list, in some cases, it's not acceptable, for example, allocating
>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>
>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>> PCP list to avoid the issue above.
>>
>> Could you please describe the impact on users in the commit log?
> 
> If a large number of CMA memory are configured in the system (for 
> example, the CMA memory accounts for 50% of the system memory), starting 
> virtual machine with device passthrough will get stuck.
> 
> During starting virtual machine, it will call pin_user_pages_remote(..., 
> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, 
> pin_user_pages_remote() will migrate the page from CMA area to non-CMA 
> area because of FOLL_LONGTERM flag. If non-movable allocation requests 
> return CMA memory, pin_user_pages_remote() will enter endless loops.
> 
> backtrace:
> pin_user_pages_remote
> ----__gup_longterm_locked //cause endless loops in this function
> --------__get_user_pages_locked
> --------check_and_migrate_movable_pages //always check fail and continue 
> to migrate
> ------------migrate_longterm_unpinnable_pages
> ----------------alloc_migration_target // non-movable allocation
> 
>> Is it possible that some CMA memory might be used by non-movable
>> allocation requests? 
> 
> Yes.
> 
> 
>> If so, will CMA somehow become unable to migrate, causing cma_alloc() 
>> to fail?
> 
> 
> No, it will cause endless loops in __gup_longterm_locked(). If 
> non-movable allocation requests return CMA memory, 
> migrate_longterm_unpinnable_pages() will migrate a CMA page to another 
> CMA page, which is useless and cause endless loops in 
> __gup_longterm_locked().
> 
> backtrace:
> pin_user_pages_remote
> ----__gup_longterm_locked //cause endless loops in this function
> --------__get_user_pages_locked
> --------check_and_migrate_movable_pages //always check fail and continue 
> to migrate
> ------------migrate_longterm_unpinnable_pages
> 
> 
> 
> 
> 
>>>
>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
>>> THP-sized allocations")
>>> Signed-off-by: yangge <yangge1116@126.com>
>>> ---
>>>   mm/page_alloc.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 2e22ce5..0bdf471 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone 
>>> *preferred_zone,
>>>          WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>
>>>          if (likely(pcp_allowed_order(order))) {
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & 
>>> ALLOC_CMA ||
>>> +                                               order != 
>>> HPAGE_PMD_ORDER) {
>>> +                       page = rmqueue_pcplist(preferred_zone, zone, 
>>> order,
>>> +                                               migratetype, 
>>> alloc_flags);
>>> +                       if (likely(page))
>>> +                               goto out;
>>> +               }
>>
>> This seems not ideal, because non-CMA THP gets no chance to use PCP. 
>> But it
>> still seems better than causing the failure of CMA allocation.
>>
>> Is there a possible approach to avoiding adding CMA THP into pcp from 
>> the first
>> beginning? Otherwise, we might need a separate PCP for CMA.
>>

The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding 
adding CMA THP into pcp may incur a slight performance penalty.

Commit 1d91df85f399 takes a similar approach to filter, and I mainly 
refer to it.


>>> +#else
>>>                  page = rmqueue_pcplist(preferred_zone, zone, order,
>>>                                         migratetype, alloc_flags);
>>>                  if (likely(page))
>>>                          goto out;
>>> +#endif
>>>          }
>>>
>>>          page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>> -- 
>>> 2.7.4
>>
>> Thanks
>> Barry
>>
Barry Song June 18, 2024, 1:55 a.m. UTC | #11
On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>
>
>
> 在 2024/6/17 下午8:47, yangge1116 写道:
> >
> >
> > 在 2024/6/17 下午6:26, Barry Song 写道:
> >> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
> >>>
> >>> From: yangge <yangge1116@126.com>
> >>>
> >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>> THP-sized allocations") no longer differentiates the migration type
> >>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>> the list, in some cases, it's not acceptable, for example, allocating
> >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>
> >>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>> PCP list to avoid the issue above.
> >>
> >> Could you please describe the impact on users in the commit log?
> >
> > If a large number of CMA memory are configured in the system (for
> > example, the CMA memory accounts for 50% of the system memory), starting
> > virtual machine with device passthrough will get stuck.
> >
> > During starting virtual machine, it will call pin_user_pages_remote(...,
> > FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> > pin_user_pages_remote() will migrate the page from CMA area to non-CMA
> > area because of FOLL_LONGTERM flag. If non-movable allocation requests
> > return CMA memory, pin_user_pages_remote() will enter endless loops.
> >
> > backtrace:
> > pin_user_pages_remote
> > ----__gup_longterm_locked //cause endless loops in this function
> > --------__get_user_pages_locked
> > --------check_and_migrate_movable_pages //always check fail and continue
> > to migrate
> > ------------migrate_longterm_unpinnable_pages
> > ----------------alloc_migration_target // non-movable allocation
> >
> >> Is it possible that some CMA memory might be used by non-movable
> >> allocation requests?
> >
> > Yes.
> >
> >
> >> If so, will CMA somehow become unable to migrate, causing cma_alloc()
> >> to fail?
> >
> >
> > No, it will cause endless loops in __gup_longterm_locked(). If
> > non-movable allocation requests return CMA memory,
> > migrate_longterm_unpinnable_pages() will migrate a CMA page to another
> > CMA page, which is useless and cause endless loops in
> > __gup_longterm_locked().

This is only one perspective. We also need to consider the impact on
CMA itself. For example,
when CMA is borrowed by THP, and we need to reclaim it through
cma_alloc() or dma_alloc_coherent(),
we must move those pages out to ensure CMA's users can retrieve that
contiguous memory.

Currently, CMA's memory is occupied by non-movable pages, meaning we
can't relocate them.
As a result, cma_alloc() is more likely to fail.

> >
> > backtrace:
> > pin_user_pages_remote
> > ----__gup_longterm_locked //cause endless loops in this function
> > --------__get_user_pages_locked
> > --------check_and_migrate_movable_pages //always check fail and continue
> > to migrate
> > ------------migrate_longterm_unpinnable_pages
> >
> >
> >
> >
> >
> >>>
> >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>> THP-sized allocations")
> >>> Signed-off-by: yangge <yangge1116@126.com>
> >>> ---
> >>>   mm/page_alloc.c | 10 ++++++++++
> >>>   1 file changed, 10 insertions(+)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 2e22ce5..0bdf471 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>> *preferred_zone,
> >>>          WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>
> >>>          if (likely(pcp_allowed_order(order))) {
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>> ALLOC_CMA ||
> >>> +                                               order !=
> >>> HPAGE_PMD_ORDER) {
> >>> +                       page = rmqueue_pcplist(preferred_zone, zone,
> >>> order,
> >>> +                                               migratetype,
> >>> alloc_flags);
> >>> +                       if (likely(page))
> >>> +                               goto out;
> >>> +               }
> >>
> >> This seems not ideal, because non-CMA THP gets no chance to use PCP.
> >> But it
> >> still seems better than causing the failure of CMA allocation.
> >>
> >> Is there a possible approach to avoiding adding CMA THP into pcp from
> >> the first
> >> beginning? Otherwise, we might need a separate PCP for CMA.
> >>
>
> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> adding CMA THP into pcp may incur a slight performance penalty.
>

But the majority of movable pages aren't CMA, right? Do we have an estimate for
adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
the original intention for THP was to avoid by having only one PCP[1]?

[1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/


> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> refer to it.
>
>
> >>> +#else
> >>>                  page = rmqueue_pcplist(preferred_zone, zone, order,
> >>>                                         migratetype, alloc_flags);
> >>>                  if (likely(page))
> >>>                          goto out;
> >>> +#endif
> >>>          }
> >>>
> >>>          page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> >>> --
> >>> 2.7.4
> >>
> >> Thanks
> >> Barry
> >>
>
>
Ge Yang June 18, 2024, 3:31 a.m. UTC | #12
在 2024/6/18 上午9:55, Barry Song 写道:
> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>
>>>
>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>
>>>>> From: yangge <yangge1116@126.com>
>>>>>
>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>
>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>> PCP list to avoid the issue above.
>>>>
>>>> Could you please describe the impact on users in the commit log?
>>>
>>> If a large number of CMA memory are configured in the system (for
>>> example, the CMA memory accounts for 50% of the system memory), starting
>>> virtual machine with device passthrough will get stuck.
>>>
>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>
>>> backtrace:
>>> pin_user_pages_remote
>>> ----__gup_longterm_locked //cause endless loops in this function
>>> --------__get_user_pages_locked
>>> --------check_and_migrate_movable_pages //always check fail and continue
>>> to migrate
>>> ------------migrate_longterm_unpinnable_pages
>>> ----------------alloc_migration_target // non-movable allocation
>>>
>>>> Is it possible that some CMA memory might be used by non-movable
>>>> allocation requests?
>>>
>>> Yes.
>>>
>>>
>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>> to fail?
>>>
>>>
>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>> non-movable allocation requests return CMA memory,
>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>> CMA page, which is useless and cause endless loops in
>>> __gup_longterm_locked().
> 
> This is only one perspective. We also need to consider the impact on
> CMA itself. For example,
> when CMA is borrowed by THP, and we need to reclaim it through
> cma_alloc() or dma_alloc_coherent(),
> we must move those pages out to ensure CMA's users can retrieve that
> contiguous memory.
> 
> Currently, CMA's memory is occupied by non-movable pages, meaning we
> can't relocate them.
> As a result, cma_alloc() is more likely to fail.
> 
>>>
>>> backtrace:
>>> pin_user_pages_remote
>>> ----__gup_longterm_locked //cause endless loops in this function
>>> --------__get_user_pages_locked
>>> --------check_and_migrate_movable_pages //always check fail and continue
>>> to migrate
>>> ------------migrate_longterm_unpinnable_pages
>>>
>>>
>>>
>>>
>>>
>>>>>
>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations")
>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>> ---
>>>>>    mm/page_alloc.c | 10 ++++++++++
>>>>>    1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 2e22ce5..0bdf471 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>> *preferred_zone,
>>>>>           WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>
>>>>>           if (likely(pcp_allowed_order(order))) {
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>> ALLOC_CMA ||
>>>>> +                                               order !=
>>>>> HPAGE_PMD_ORDER) {
>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>> order,
>>>>> +                                               migratetype,
>>>>> alloc_flags);
>>>>> +                       if (likely(page))
>>>>> +                               goto out;
>>>>> +               }
>>>>
>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>> But it
>>>> still seems better than causing the failure of CMA allocation.
>>>>
>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>> the first
>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>
>>
>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>> adding CMA THP into pcp may incur a slight performance penalty.
>>
> 
> But the majority of movable pages aren't CMA, right?

> Do we have an estimate for
> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
> the original intention for THP was to avoid by having only one PCP[1]?
> 
> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
> 

The size of struct per_cpu_pages is 256 bytes in current code containing 
commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized 
allocations").
crash> struct per_cpu_pages
struct per_cpu_pages {
     spinlock_t lock;
     int count;
     int high;
     int high_min;
     int high_max;
     int batch;
     u8 flags;
     u8 alloc_factor;
     u8 expire;
     short free_count;
     struct list_head lists[13];
}
SIZE: 256

After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list 
for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
crash> struct per_cpu_pages
struct per_cpu_pages {
     spinlock_t lock;
     int count;
     int high;
     int high_min;
     int high_max;
     int batch;
     u8 flags;
     u8 alloc_factor;
     u8 expire;
     short free_count;
     struct list_head lists[15];
}
SIZE: 272

Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
THP-sized allocations") decrease one cacheline.

> 
>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>> refer to it.
>>
>>
>>>>> +#else
>>>>>                   page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>                                          migratetype, alloc_flags);
>>>>>                   if (likely(page))
>>>>>                           goto out;
>>>>> +#endif
>>>>>           }
>>>>>
>>>>>           page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>> --
>>>>> 2.7.4
>>>>
>>>> Thanks
>>>> Barry
>>>>
>>
>>
Ge Yang June 18, 2024, 3:31 a.m. UTC | #13
在 2024/6/17 下午7:55, Barry Song 写道:
> On Mon, Jun 17, 2024 at 7:36 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/6/17 18:43, Barry Song wrote:
>>> On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/6/4 20:36, yangge1116 wrote:
>>>>>
>>>>>
>>>>> 在 2024/6/4 下午8:01, Baolin Wang 写道:
>>>>>> Cc Johannes, Zi and Vlastimil.
>>>>>>
>>>>>> On 2024/6/4 17:14, yangge1116@126.com wrote:
>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>
>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>
>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>> PCP list to avoid the issue above.
>>>>>>>
>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations")
>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>> ---
>>>>>>>     mm/page_alloc.c | 10 ++++++++++
>>>>>>>     1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>> *preferred_zone,
>>>>>>>         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>         if (likely(pcp_allowed_order(order))) {
>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> +        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
>>>>>>> +                        order != HPAGE_PMD_ORDER) {
>>>>>>
>>>>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if
>>>>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if
>>>>>> it's not suitable, then fallback to buddy?
>>>>>
>>>>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype
>>>>> comparison in __rmqueue_pcplist(), we may need to compare many times
>>>>> because of pcp batch.
>>>>
>>>> I mean we can only compare once, focusing on CMA pages.
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 3734fe7e67c0..960a3b5744d8 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone,
>>>> unsigned int order,
>>>>                    }
>>>>
>>>>                    page = list_first_entry(list, struct page, pcp_list);
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +               if (order == HPAGE_PMD_ORDER &&
>>>> !is_migrate_movable(migratetype) &&
>>>> +                   is_migrate_cma(get_pageblock_migratetype(page)))
>>>> +                       return NULL;
>>>> +#endif
>>>
>>> This doesn't seem ideal either. It's possible that the PCP still has many
>>> non-CMA folios, but due to bad luck, the first entry is "always" CMA.
>>> In this case,
>>> allocations with is_migrate_movable(migratetype) == false will always lose the
>>> chance to use the PCP.   It also appears to incur a PCP spin lock/unlock.
>>
>> Yes, just some ideas to to mitigate the issue...
>>
>>>
>>> I don't see an ideal solution unless we bring back the CMA PCP :-)
>>
>> Tend to agree, and adding a CMA PCP seems the overhead can be acceptable?
> 
> yes. probably. Hi Ge,
> 
> Could we printk the size before and after adding 1 to NR_PCP_LISTS?
> Does it increase one cacheline?
> 
> struct per_cpu_pages {
> spinlock_t lock; /* Protects lists field */
> int count; /* number of pages in the list */
> int high; /* high watermark, emptying needed */
> int high_min; /* min high watermark */
> int high_max; /* max high watermark */
> int batch; /* chunk size for buddy add/remove */
> u8 flags; /* protected by pcp->lock */
> u8 alloc_factor; /* batch scaling factor during allocate */
> #ifdef CONFIG_NUMA
> u8 expire; /* When 0, remote pagesets are drained */
> #endif
> short free_count; /* consecutive free count */
> 
> /* Lists of pages, one per migrate type stored on the pcp-lists */
> struct list_head lists[NR_PCP_LISTS];
> } ____cacheline_aligned_in_smp;
> 

OK.

The size of struct per_cpu_pages is 256 bytes in current code containing 
commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized 
allocations").
crash> struct per_cpu_pages
struct per_cpu_pages {
     spinlock_t lock;
     int count;
     int high;
     int high_min;
     int high_max;
     int batch;
     u8 flags;
     u8 alloc_factor;
     u8 expire;
     short free_count;
     struct list_head lists[13];
}
SIZE: 256

After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list 
for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
crash> struct per_cpu_pages
struct per_cpu_pages {
     spinlock_t lock;
     int count;
     int high;
     int high_min;
     int high_max;
     int batch;
     u8 flags;
     u8 alloc_factor;
     u8 expire;
     short free_count;
     struct list_head lists[15];
}
SIZE: 272

Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
THP-sized allocations") decrease one cacheline.
Ge Yang June 18, 2024, 3:40 a.m. UTC | #14
在 2024/6/18 上午9:55, Barry Song 写道:
> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>
>>>
>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>
>>>>> From: yangge <yangge1116@126.com>
>>>>>
>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>
>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>> PCP list to avoid the issue above.
>>>>
>>>> Could you please describe the impact on users in the commit log?
>>>
>>> If a large number of CMA memory are configured in the system (for
>>> example, the CMA memory accounts for 50% of the system memory), starting
>>> virtual machine with device passthrough will get stuck.
>>>
>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>
>>> backtrace:
>>> pin_user_pages_remote
>>> ----__gup_longterm_locked //cause endless loops in this function
>>> --------__get_user_pages_locked
>>> --------check_and_migrate_movable_pages //always check fail and continue
>>> to migrate
>>> ------------migrate_longterm_unpinnable_pages
>>> ----------------alloc_migration_target // non-movable allocation
>>>
>>>> Is it possible that some CMA memory might be used by non-movable
>>>> allocation requests?
>>>
>>> Yes.
>>>
>>>
>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>> to fail?
>>>
>>>
>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>> non-movable allocation requests return CMA memory,
>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>> CMA page, which is useless and cause endless loops in
>>> __gup_longterm_locked().
> 
> This is only one perspective. We also need to consider the impact on
> CMA itself. For example,
> when CMA is borrowed by THP, and we need to reclaim it through
> cma_alloc() or dma_alloc_coherent(),
> we must move those pages out to ensure CMA's users can retrieve that
> contiguous memory.
> 
> Currently, CMA's memory is occupied by non-movable pages, meaning we
> can't relocate them.
> As a result, cma_alloc() is more likely to fail.
> 
>>>
>>> backtrace:
>>> pin_user_pages_remote
>>> ----__gup_longterm_locked //cause endless loops in this function
>>> --------__get_user_pages_locked
>>> --------check_and_migrate_movable_pages //always check fail and continue
>>> to migrate
>>> ------------migrate_longterm_unpinnable_pages
>>>
>>>
>>>
>>>
>>>
>>>>>
>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations")
>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>> ---
>>>>>    mm/page_alloc.c | 10 ++++++++++
>>>>>    1 file changed, 10 insertions(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 2e22ce5..0bdf471 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>> *preferred_zone,
>>>>>           WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>
>>>>>           if (likely(pcp_allowed_order(order))) {
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>> ALLOC_CMA ||
>>>>> +                                               order !=
>>>>> HPAGE_PMD_ORDER) {
>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>> order,
>>>>> +                                               migratetype,
>>>>> alloc_flags);
>>>>> +                       if (likely(page))
>>>>> +                               goto out;
>>>>> +               }
>>>>
>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>> But it
>>>> still seems better than causing the failure of CMA allocation.
>>>>
>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>> the first
>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>
>>
>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>> adding CMA THP into pcp may incur a slight performance penalty.
>>
> 
> But the majority of movable pages aren't CMA, right?

Yes.

> Do we have an estimate for
> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
> the original intention for THP was to avoid by having only one PCP[1]?
> 
> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
> 
> 
>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>> refer to it.
>>
>>
>>>>> +#else
>>>>>                   page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>                                          migratetype, alloc_flags);
>>>>>                   if (likely(page))
>>>>>                           goto out;
>>>>> +#endif
>>>>>           }
>>>>>
>>>>>           page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>> --
>>>>> 2.7.4
>>>>
>>>> Thanks
>>>> Barry
>>>>
>>
>>
Barry Song June 18, 2024, 4:10 a.m. UTC | #15
On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>
>
>
> 在 2024/6/18 上午9:55, Barry Song 写道:
> > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
> >>
> >>
> >>
> >> 在 2024/6/17 下午8:47, yangge1116 写道:
> >>>
> >>>
> >>> 在 2024/6/17 下午6:26, Barry Song 写道:
> >>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
> >>>>>
> >>>>> From: yangge <yangge1116@126.com>
> >>>>>
> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations") no longer differentiates the migration type
> >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>>>> the list, in some cases, it's not acceptable, for example, allocating
> >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>
> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>>>> PCP list to avoid the issue above.
> >>>>
> >>>> Could you please describe the impact on users in the commit log?
> >>>
> >>> If a large number of CMA memory are configured in the system (for
> >>> example, the CMA memory accounts for 50% of the system memory), starting
> >>> virtual machine with device passthrough will get stuck.
> >>>
> >>> During starting virtual machine, it will call pin_user_pages_remote(...,
> >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
> >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
> >>> return CMA memory, pin_user_pages_remote() will enter endless loops.
> >>>
> >>> backtrace:
> >>> pin_user_pages_remote
> >>> ----__gup_longterm_locked //cause endless loops in this function
> >>> --------__get_user_pages_locked
> >>> --------check_and_migrate_movable_pages //always check fail and continue
> >>> to migrate
> >>> ------------migrate_longterm_unpinnable_pages
> >>> ----------------alloc_migration_target // non-movable allocation
> >>>
> >>>> Is it possible that some CMA memory might be used by non-movable
> >>>> allocation requests?
> >>>
> >>> Yes.
> >>>
> >>>
> >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
> >>>> to fail?
> >>>
> >>>
> >>> No, it will cause endless loops in __gup_longterm_locked(). If
> >>> non-movable allocation requests return CMA memory,
> >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
> >>> CMA page, which is useless and cause endless loops in
> >>> __gup_longterm_locked().
> >
> > This is only one perspective. We also need to consider the impact on
> > CMA itself. For example,
> > when CMA is borrowed by THP, and we need to reclaim it through
> > cma_alloc() or dma_alloc_coherent(),
> > we must move those pages out to ensure CMA's users can retrieve that
> > contiguous memory.
> >
> > Currently, CMA's memory is occupied by non-movable pages, meaning we
> > can't relocate them.
> > As a result, cma_alloc() is more likely to fail.
> >
> >>>
> >>> backtrace:
> >>> pin_user_pages_remote
> >>> ----__gup_longterm_locked //cause endless loops in this function
> >>> --------__get_user_pages_locked
> >>> --------check_and_migrate_movable_pages //always check fail and continue
> >>> to migrate
> >>> ------------migrate_longterm_unpinnable_pages
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>
> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations")
> >>>>> Signed-off-by: yangge <yangge1116@126.com>
> >>>>> ---
> >>>>>    mm/page_alloc.c | 10 ++++++++++
> >>>>>    1 file changed, 10 insertions(+)
> >>>>>
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index 2e22ce5..0bdf471 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>> *preferred_zone,
> >>>>>           WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>>>
> >>>>>           if (likely(pcp_allowed_order(order))) {
> >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>>>> ALLOC_CMA ||
> >>>>> +                                               order !=
> >>>>> HPAGE_PMD_ORDER) {
> >>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
> >>>>> order,
> >>>>> +                                               migratetype,
> >>>>> alloc_flags);
> >>>>> +                       if (likely(page))
> >>>>> +                               goto out;
> >>>>> +               }
> >>>>
> >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
> >>>> But it
> >>>> still seems better than causing the failure of CMA allocation.
> >>>>
> >>>> Is there a possible approach to avoiding adding CMA THP into pcp from
> >>>> the first
> >>>> beginning? Otherwise, we might need a separate PCP for CMA.
> >>>>
> >>
> >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> >> adding CMA THP into pcp may incur a slight performance penalty.
> >>
> >
> > But the majority of movable pages aren't CMA, right?
>
> > Do we have an estimate for
> > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
> > the original intention for THP was to avoid by having only one PCP[1]?
> >
> > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
> >
>
> The size of struct per_cpu_pages is 256 bytes in current code containing
> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> allocations").
> crash> struct per_cpu_pages
> struct per_cpu_pages {
>      spinlock_t lock;
>      int count;
>      int high;
>      int high_min;
>      int high_max;
>      int batch;
>      u8 flags;
>      u8 alloc_factor;
>      u8 expire;
>      short free_count;
>      struct list_head lists[13];
> }
> SIZE: 256
>
> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
> crash> struct per_cpu_pages
> struct per_cpu_pages {
>      spinlock_t lock;
>      int count;
>      int high;
>      int high_min;
>      int high_max;
>      int batch;
>      u8 flags;
>      u8 alloc_factor;
>      u8 expire;
>      short free_count;
>      struct list_head lists[15];
> }
> SIZE: 272
>
> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> THP-sized allocations") decrease one cacheline.

the proposal is not reverting the patch but adding one CMA pcp.
so it is "struct list_head lists[14]"; in this case, the size is still
256?


>
> >
> >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> >> refer to it.
> >>
> >>
> >>>>> +#else
> >>>>>                   page = rmqueue_pcplist(preferred_zone, zone, order,
> >>>>>                                          migratetype, alloc_flags);
> >>>>>                   if (likely(page))
> >>>>>                           goto out;
> >>>>> +#endif
> >>>>>           }
> >>>>>
> >>>>>           page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> >>>>> --
> >>>>> 2.7.4
> >>>>
> >>>> Thanks
> >>>> Barry
> >>>>
> >>
> >>
>
Ge Yang June 18, 2024, 5:49 a.m. UTC | #16
在 2024/6/18 下午12:10, Barry Song 写道:
> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>
>>>>>
>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>
>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>
>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>
>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>> PCP list to avoid the issue above.
>>>>>>
>>>>>> Could you please describe the impact on users in the commit log?
>>>>>
>>>>> If a large number of CMA memory are configured in the system (for
>>>>> example, the CMA memory accounts for 50% of the system memory), starting
>>>>> virtual machine with device passthrough will get stuck.
>>>>>
>>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>
>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>> allocation requests?
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>>>> to fail?
>>>>>
>>>>>
>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>> non-movable allocation requests return CMA memory,
>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>>>> CMA page, which is useless and cause endless loops in
>>>>> __gup_longterm_locked().
>>>
>>> This is only one perspective. We also need to consider the impact on
>>> CMA itself. For example,
>>> when CMA is borrowed by THP, and we need to reclaim it through
>>> cma_alloc() or dma_alloc_coherent(),
>>> we must move those pages out to ensure CMA's users can retrieve that
>>> contiguous memory.
>>>
>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>> can't relocate them.
>>> As a result, cma_alloc() is more likely to fail.
>>>
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations")
>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>> ---
>>>>>>>     mm/page_alloc.c | 10 ++++++++++
>>>>>>>     1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>> *preferred_zone,
>>>>>>>            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>
>>>>>>>            if (likely(pcp_allowed_order(order))) {
>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>> ALLOC_CMA ||
>>>>>>> +                                               order !=
>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>>>> order,
>>>>>>> +                                               migratetype,
>>>>>>> alloc_flags);
>>>>>>> +                       if (likely(page))
>>>>>>> +                               goto out;
>>>>>>> +               }
>>>>>>
>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>>>> But it
>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>
>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>>>> the first
>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>
>>>>
>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>
>>>
>>> But the majority of movable pages aren't CMA, right?
>>
>>> Do we have an estimate for
>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
>>> the original intention for THP was to avoid by having only one PCP[1]?
>>>
>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>
>>
>> The size of struct per_cpu_pages is 256 bytes in current code containing
>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[13];
>> }
>> SIZE: 256
>>
>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[15];
>> }
>> SIZE: 272
>>
>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>> THP-sized allocations") decrease one cacheline.
> 
> the proposal is not reverting the patch but adding one CMA pcp.
> so it is "struct list_head lists[14]"; in this case, the size is still
> 256?
> 

Yes, if only add one CMA pcp, the size is still 256. Seems adding one 
CMA pcp is more reasonable.

> 
>>
>>>
>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>> refer to it.
>>>>
>>>>
>>>>>>> +#else
>>>>>>>                    page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>>>                                           migratetype, alloc_flags);
>>>>>>>                    if (likely(page))
>>>>>>>                            goto out;
>>>>>>> +#endif
>>>>>>>            }
>>>>>>>
>>>>>>>            page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>
>>>>>> Thanks
>>>>>> Barry
>>>>>>
>>>>
>>>>
>>
Ge Yang June 18, 2024, 6:55 a.m. UTC | #17
在 2024/6/18 下午12:10, Barry Song 写道:
> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>
>>>>>
>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>
>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>
>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>
>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>> PCP list to avoid the issue above.
>>>>>>
>>>>>> Could you please describe the impact on users in the commit log?
>>>>>
>>>>> If a large number of CMA memory are configured in the system (for
>>>>> example, the CMA memory accounts for 50% of the system memory), starting
>>>>> virtual machine with device passthrough will get stuck.
>>>>>
>>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>
>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>> allocation requests?
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>>>> to fail?
>>>>>
>>>>>
>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>> non-movable allocation requests return CMA memory,
>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>>>> CMA page, which is useless and cause endless loops in
>>>>> __gup_longterm_locked().
>>>
>>> This is only one perspective. We also need to consider the impact on
>>> CMA itself. For example,
>>> when CMA is borrowed by THP, and we need to reclaim it through
>>> cma_alloc() or dma_alloc_coherent(),
>>> we must move those pages out to ensure CMA's users can retrieve that
>>> contiguous memory.
>>>
>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>> can't relocate them.
>>> As a result, cma_alloc() is more likely to fail.
>>>
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations")
>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>> ---
>>>>>>>     mm/page_alloc.c | 10 ++++++++++
>>>>>>>     1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>> *preferred_zone,
>>>>>>>            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>
>>>>>>>            if (likely(pcp_allowed_order(order))) {
>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>> ALLOC_CMA ||
>>>>>>> +                                               order !=
>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>>>> order,
>>>>>>> +                                               migratetype,
>>>>>>> alloc_flags);
>>>>>>> +                       if (likely(page))
>>>>>>> +                               goto out;
>>>>>>> +               }
>>>>>>
>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>>>> But it
>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>
>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>>>> the first
>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>
>>>>
>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>
>>>
>>> But the majority of movable pages aren't CMA, right?
>>
>>> Do we have an estimate for
>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
>>> the original intention for THP was to avoid by having only one PCP[1]?
>>>
>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>
>>
>> The size of struct per_cpu_pages is 256 bytes in current code containing
>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[13];
>> }
>> SIZE: 256
>>
>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[15];
>> }
>> SIZE: 272
>>
>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>> THP-sized allocations") decrease one cacheline.
> 
> the proposal is not reverting the patch but adding one CMA pcp.
> so it is "struct list_head lists[14]"; in this case, the size is still
> 256?
> 

Yes, the size is still 256. If add one PCP list, we will have 2 PCP 
lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other 
PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?

> 
>>
>>>
>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>> refer to it.
>>>>
>>>>
>>>>>>> +#else
>>>>>>>                    page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>>>                                           migratetype, alloc_flags);
>>>>>>>                    if (likely(page))
>>>>>>>                            goto out;
>>>>>>> +#endif
>>>>>>>            }
>>>>>>>
>>>>>>>            page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>
>>>>>> Thanks
>>>>>> Barry
>>>>>>
>>>>
>>>>
>>
Barry Song June 18, 2024, 6:58 a.m. UTC | #18
On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote:
>
>
>
> 在 2024/6/18 下午12:10, Barry Song 写道:
> > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
> >>
> >>
> >>
> >> 在 2024/6/18 上午9:55, Barry Song 写道:
> >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2024/6/17 下午8:47, yangge1116 写道:
> >>>>>
> >>>>>
> >>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
> >>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
> >>>>>>>
> >>>>>>> From: yangge <yangge1116@126.com>
> >>>>>>>
> >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>>>> THP-sized allocations") no longer differentiates the migration type
> >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>>>>>> the list, in some cases, it's not acceptable, for example, allocating
> >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>>>
> >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>>>>>> PCP list to avoid the issue above.
> >>>>>>
> >>>>>> Could you please describe the impact on users in the commit log?
> >>>>>
> >>>>> If a large number of CMA memory are configured in the system (for
> >>>>> example, the CMA memory accounts for 50% of the system memory), starting
> >>>>> virtual machine with device passthrough will get stuck.
> >>>>>
> >>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
> >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
> >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
> >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
> >>>>>
> >>>>> backtrace:
> >>>>> pin_user_pages_remote
> >>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>> --------__get_user_pages_locked
> >>>>> --------check_and_migrate_movable_pages //always check fail and continue
> >>>>> to migrate
> >>>>> ------------migrate_longterm_unpinnable_pages
> >>>>> ----------------alloc_migration_target // non-movable allocation
> >>>>>
> >>>>>> Is it possible that some CMA memory might be used by non-movable
> >>>>>> allocation requests?
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>
> >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
> >>>>>> to fail?
> >>>>>
> >>>>>
> >>>>> No, it will cause endless loops in __gup_longterm_locked(). If
> >>>>> non-movable allocation requests return CMA memory,
> >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
> >>>>> CMA page, which is useless and cause endless loops in
> >>>>> __gup_longterm_locked().
> >>>
> >>> This is only one perspective. We also need to consider the impact on
> >>> CMA itself. For example,
> >>> when CMA is borrowed by THP, and we need to reclaim it through
> >>> cma_alloc() or dma_alloc_coherent(),
> >>> we must move those pages out to ensure CMA's users can retrieve that
> >>> contiguous memory.
> >>>
> >>> Currently, CMA's memory is occupied by non-movable pages, meaning we
> >>> can't relocate them.
> >>> As a result, cma_alloc() is more likely to fail.
> >>>
> >>>>>
> >>>>> backtrace:
> >>>>> pin_user_pages_remote
> >>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>> --------__get_user_pages_locked
> >>>>> --------check_and_migrate_movable_pages //always check fail and continue
> >>>>> to migrate
> >>>>> ------------migrate_longterm_unpinnable_pages
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>>>> THP-sized allocations")
> >>>>>>> Signed-off-by: yangge <yangge1116@126.com>
> >>>>>>> ---
> >>>>>>>     mm/page_alloc.c | 10 ++++++++++
> >>>>>>>     1 file changed, 10 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>>>> index 2e22ce5..0bdf471 100644
> >>>>>>> --- a/mm/page_alloc.c
> >>>>>>> +++ b/mm/page_alloc.c
> >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>>>> *preferred_zone,
> >>>>>>>            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>>>>>
> >>>>>>>            if (likely(pcp_allowed_order(order))) {
> >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>>>>>> ALLOC_CMA ||
> >>>>>>> +                                               order !=
> >>>>>>> HPAGE_PMD_ORDER) {
> >>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
> >>>>>>> order,
> >>>>>>> +                                               migratetype,
> >>>>>>> alloc_flags);
> >>>>>>> +                       if (likely(page))
> >>>>>>> +                               goto out;
> >>>>>>> +               }
> >>>>>>
> >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
> >>>>>> But it
> >>>>>> still seems better than causing the failure of CMA allocation.
> >>>>>>
> >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
> >>>>>> the first
> >>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
> >>>>>>
> >>>>
> >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> >>>> adding CMA THP into pcp may incur a slight performance penalty.
> >>>>
> >>>
> >>> But the majority of movable pages aren't CMA, right?
> >>
> >>> Do we have an estimate for
> >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
> >>> the original intention for THP was to avoid by having only one PCP[1]?
> >>>
> >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
> >>>
> >>
> >> The size of struct per_cpu_pages is 256 bytes in current code containing
> >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> >> allocations").
> >> crash> struct per_cpu_pages
> >> struct per_cpu_pages {
> >>       spinlock_t lock;
> >>       int count;
> >>       int high;
> >>       int high_min;
> >>       int high_max;
> >>       int batch;
> >>       u8 flags;
> >>       u8 alloc_factor;
> >>       u8 expire;
> >>       short free_count;
> >>       struct list_head lists[13];
> >> }
> >> SIZE: 256
> >>
> >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
> >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
> >> crash> struct per_cpu_pages
> >> struct per_cpu_pages {
> >>       spinlock_t lock;
> >>       int count;
> >>       int high;
> >>       int high_min;
> >>       int high_max;
> >>       int batch;
> >>       u8 flags;
> >>       u8 alloc_factor;
> >>       u8 expire;
> >>       short free_count;
> >>       struct list_head lists[15];
> >> }
> >> SIZE: 272
> >>
> >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >> THP-sized allocations") decrease one cacheline.
> >
> > the proposal is not reverting the patch but adding one CMA pcp.
> > so it is "struct list_head lists[14]"; in this case, the size is still
> > 256?
> >
>
> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?

i am not quite sure about MIGRATE_RECLAIMABLE as we want to
CMA is only used by movable.
So it might be:
MOVABLE and NON-MOVABLE.

>
> >
> >>
> >>>
> >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> >>>> refer to it.
> >>>>
> >>>>
> >>>>>>> +#else
> >>>>>>>                    page = rmqueue_pcplist(preferred_zone, zone, order,
> >>>>>>>                                           migratetype, alloc_flags);
> >>>>>>>                    if (likely(page))
> >>>>>>>                            goto out;
> >>>>>>> +#endif
> >>>>>>>            }
> >>>>>>>
> >>>>>>>            page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> >>>>>>> --
> >>>>>>> 2.7.4
> >>>>>>
> >>>>>> Thanks
> >>>>>> Barry
> >>>>>>
> >>>>
> >>>>
> >>
>
Ge Yang June 18, 2024, 7:51 a.m. UTC | #19
在 2024/6/18 下午2:58, Barry Song 写道:
> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>>>
>>>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>>>
>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>
>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>
>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>
>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>> example, the CMA memory accounts for 50% of the system memory), starting
>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>
>>>>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>>>>>
>>>>>>> backtrace:
>>>>>>> pin_user_pages_remote
>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>> --------__get_user_pages_locked
>>>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>>>> to migrate
>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>
>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>> allocation requests?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>
>>>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>>>>>> to fail?
>>>>>>>
>>>>>>>
>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>> __gup_longterm_locked().
>>>>>
>>>>> This is only one perspective. We also need to consider the impact on
>>>>> CMA itself. For example,
>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>> contiguous memory.
>>>>>
>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>> can't relocate them.
>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>
>>>>>>>
>>>>>>> backtrace:
>>>>>>> pin_user_pages_remote
>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>> --------__get_user_pages_locked
>>>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>>>> to migrate
>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>> THP-sized allocations")
>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>>>> ---
>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>      1 file changed, 10 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>> *preferred_zone,
>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>>>
>>>>>>>>>             if (likely(pcp_allowed_order(order))) {
>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>> ALLOC_CMA ||
>>>>>>>>> +                                               order !=
>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>>>>>> order,
>>>>>>>>> +                                               migratetype,
>>>>>>>>> alloc_flags);
>>>>>>>>> +                       if (likely(page))
>>>>>>>>> +                               goto out;
>>>>>>>>> +               }
>>>>>>>>
>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>>>>>> But it
>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>
>>>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>>>>>> the first
>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>
>>>>>>
>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>
>>>>>
>>>>> But the majority of movable pages aren't CMA, right?
>>>>
>>>>> Do we have an estimate for
>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
>>>>> the original intention for THP was to avoid by having only one PCP[1]?
>>>>>
>>>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>
>>>>
>>>> The size of struct per_cpu_pages is 256 bytes in current code containing
>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>>>> allocations").
>>>> crash> struct per_cpu_pages
>>>> struct per_cpu_pages {
>>>>        spinlock_t lock;
>>>>        int count;
>>>>        int high;
>>>>        int high_min;
>>>>        int high_max;
>>>>        int batch;
>>>>        u8 flags;
>>>>        u8 alloc_factor;
>>>>        u8 expire;
>>>>        short free_count;
>>>>        struct list_head lists[13];
>>>> }
>>>> SIZE: 256
>>>>
>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
>>>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
>>>> crash> struct per_cpu_pages
>>>> struct per_cpu_pages {
>>>>        spinlock_t lock;
>>>>        int count;
>>>>        int high;
>>>>        int high_min;
>>>>        int high_max;
>>>>        int batch;
>>>>        u8 flags;
>>>>        u8 alloc_factor;
>>>>        u8 expire;
>>>>        short free_count;
>>>>        struct list_head lists[15];
>>>> }
>>>> SIZE: 272
>>>>
>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>> THP-sized allocations") decrease one cacheline.
>>>
>>> the proposal is not reverting the patch but adding one CMA pcp.
>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>> 256?
>>>
>>
>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?
> 
> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
> CMA is only used by movable.
> So it might be:
> MOVABLE and NON-MOVABLE.

One PCP list is used by UNMOVABLE pages, and the other PCP list is used 
by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains 
MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP 
list contains MIGRATE_MOVABLE pages.

> 
>>
>>>
>>>>
>>>>>
>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>> refer to it.
>>>>>>
>>>>>>
>>>>>>>>> +#else
>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>>>>>                                            migratetype, alloc_flags);
>>>>>>>>>                     if (likely(page))
>>>>>>>>>                             goto out;
>>>>>>>>> +#endif
>>>>>>>>>             }
>>>>>>>>>
>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>>>>>> --
>>>>>>>>> 2.7.4
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Barry
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>
Ge Yang June 19, 2024, 5:34 a.m. UTC | #20
在 2024/6/18 15:51, yangge1116 写道:
> 
> 
> 在 2024/6/18 下午2:58, Barry Song 写道:
>> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote:
>>>
>>>
>>>
>>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>>>>
>>>>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>>>>
>>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP 
>>>>>>>>>> list for
>>>>>>>>>> THP-sized allocations") no longer differentiates the migration 
>>>>>>>>>> type
>>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA 
>>>>>>>>>> page from
>>>>>>>>>> the list, in some cases, it's not acceptable, for example, 
>>>>>>>>>> allocating
>>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>>
>>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from 
>>>>>>>>>> THP-sized
>>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>>
>>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>>
>>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>>> example, the CMA memory accounts for 50% of the system memory), 
>>>>>>>> starting
>>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>>
>>>>>>>> During starting virtual machine, it will call 
>>>>>>>> pin_user_pages_remote(...,
>>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to 
>>>>>>>> non-CMA
>>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation 
>>>>>>>> requests
>>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless 
>>>>>>>> loops.
>>>>>>>>
>>>>>>>> backtrace:
>>>>>>>> pin_user_pages_remote
>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>> --------__get_user_pages_locked
>>>>>>>> --------check_and_migrate_movable_pages //always check fail and 
>>>>>>>> continue
>>>>>>>> to migrate
>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>>
>>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>>> allocation requests?
>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>
>>>>>>>>> If so, will CMA somehow become unable to migrate, causing 
>>>>>>>>> cma_alloc()
>>>>>>>>> to fail?
>>>>>>>>
>>>>>>>>
>>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to 
>>>>>>>> another
>>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>>> __gup_longterm_locked().
>>>>>>
>>>>>> This is only one perspective. We also need to consider the impact on
>>>>>> CMA itself. For example,
>>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>>> contiguous memory.
>>>>>>
>>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>>> can't relocate them.
>>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>>
>>>>>>>>
>>>>>>>> backtrace:
>>>>>>>> pin_user_pages_remote
>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>> --------__get_user_pages_locked
>>>>>>>> --------check_and_migrate_movable_pages //always check fail and 
>>>>>>>> continue
>>>>>>>> to migrate
>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>>> THP-sized allocations")
>>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>>>>> ---
>>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>>      1 file changed, 10 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>>> *preferred_zone,
>>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order 
>>>>>>>>>> > 1));
>>>>>>>>>>
>>>>>>>>>>             if (likely(pcp_allowed_order(order))) {
>>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>>> ALLOC_CMA ||
>>>>>>>>>> +                                               order !=
>>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, 
>>>>>>>>>> zone,
>>>>>>>>>> order,
>>>>>>>>>> +                                               migratetype,
>>>>>>>>>> alloc_flags);
>>>>>>>>>> +                       if (likely(page))
>>>>>>>>>> +                               goto out;
>>>>>>>>>> +               }
>>>>>>>>>
>>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use 
>>>>>>>>> PCP.
>>>>>>>>> But it
>>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>>
>>>>>>>>> Is there a possible approach to avoiding adding CMA THP into 
>>>>>>>>> pcp from
>>>>>>>>> the first
>>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>>
>>>>>>>
>>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>>
>>>>>>
>>>>>> But the majority of movable pages aren't CMA, right?
>>>>>
>>>>>> Do we have an estimate for
>>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new 
>>>>>> cacheline, which
>>>>>> the original intention for THP was to avoid by having only one 
>>>>>> PCP[1]?
>>>>>>
>>>>>> [1] 
>>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>>
>>>>>
>>>>> The size of struct per_cpu_pages is 256 bytes in current code 
>>>>> containing
>>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for 
>>>>> THP-sized
>>>>> allocations").
>>>>> crash> struct per_cpu_pages
>>>>> struct per_cpu_pages {
>>>>>        spinlock_t lock;
>>>>>        int count;
>>>>>        int high;
>>>>>        int high_min;
>>>>>        int high_max;
>>>>>        int batch;
>>>>>        u8 flags;
>>>>>        u8 alloc_factor;
>>>>>        u8 expire;
>>>>>        short free_count;
>>>>>        struct list_head lists[13];
>>>>> }
>>>>> SIZE: 256
>>>>>
>>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP 
>>>>> list
>>>>> for THP-sized allocations"), the size of struct per_cpu_pages is 
>>>>> 272 bytes.
>>>>> crash> struct per_cpu_pages
>>>>> struct per_cpu_pages {
>>>>>        spinlock_t lock;
>>>>>        int count;
>>>>>        int high;
>>>>>        int high_min;
>>>>>        int high_max;
>>>>>        int batch;
>>>>>        u8 flags;
>>>>>        u8 alloc_factor;
>>>>>        u8 expire;
>>>>>        short free_count;
>>>>>        struct list_head lists[15];
>>>>> }
>>>>> SIZE: 272
>>>>>
>>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>> THP-sized allocations") decrease one cacheline.
>>>>
>>>> the proposal is not reverting the patch but adding one CMA pcp.
>>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>>> 256?
>>>>
>>>
>>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that 
>>> right?
>>
>> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
>> CMA is only used by movable.
>> So it might be:
>> MOVABLE and NON-MOVABLE.
> 
> One PCP list is used by UNMOVABLE pages, and the other PCP list is used 
> by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains 
> MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP 
> list contains MIGRATE_MOVABLE pages.
> 

Is the following modification feasiable?

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP 2
+#define PCP_THP_MOVABLE 0
+#define PCP_THP_UNMOVABLE 1
  #else
  #define NR_PCP_THP 0
  #endif

  static inline unsigned int order_to_pindex(int migratetype, int order)
  {
+       int pcp_type = migratetype;
+
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
         if (order > PAGE_ALLOC_COSTLY_ORDER) {
                 VM_BUG_ON(order != HPAGE_PMD_ORDER);
-               return NR_LOWORDER_PCP_LISTS;
+
+               if (migratetype != MIGRATE_MOVABLE)
+                       pcp_type = PCP_THP_UNMOVABLE;
+               else
+                       pcp_type = PCP_THP_MOVABLE;
+
+               return NR_LOWORDER_PCP_LISTS + pcp_type;
         }
  #else
         VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
  #endif

-       return (MIGRATE_PCPTYPES * order) + migratetype;
+       return (MIGRATE_PCPTYPES * order) + pcp_type;
  }



@@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
         int order = pindex / MIGRATE_PCPTYPES;

  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       if (pindex == NR_LOWORDER_PCP_LISTS)
+       if (order > PAGE_ALLOC_COSTLY_ORDER)
                 order = HPAGE_PMD_ORDER;
  #else
         VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);



>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>>> refer to it.
>>>>>>>
>>>>>>>
>>>>>>>>>> +#else
>>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone, 
>>>>>>>>>> zone, order,
>>>>>>>>>>                                            migratetype, 
>>>>>>>>>> alloc_flags);
>>>>>>>>>>                     if (likely(page))
>>>>>>>>>>                             goto out;
>>>>>>>>>> +#endif
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order, 
>>>>>>>>>> alloc_flags,
>>>>>>>>>> -- 
>>>>>>>>>> 2.7.4
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
Barry Song June 19, 2024, 8:20 a.m. UTC | #21
On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@126.com> wrote:
>
>
>
> 在 2024/6/18 15:51, yangge1116 写道:
> >
> >
> > 在 2024/6/18 下午2:58, Barry Song 写道:
> >> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote:
> >>>
> >>>
> >>>
> >>> 在 2024/6/18 下午12:10, Barry Song 写道:
> >>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
> >>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
> >>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> From: yangge <yangge1116@126.com>
> >>>>>>>>>>
> >>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP
> >>>>>>>>>> list for
> >>>>>>>>>> THP-sized allocations") no longer differentiates the migration
> >>>>>>>>>> type
> >>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA
> >>>>>>>>>> page from
> >>>>>>>>>> the list, in some cases, it's not acceptable, for example,
> >>>>>>>>>> allocating
> >>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>>>>>>
> >>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from
> >>>>>>>>>> THP-sized
> >>>>>>>>>> PCP list to avoid the issue above.
> >>>>>>>>>
> >>>>>>>>> Could you please describe the impact on users in the commit log?
> >>>>>>>>
> >>>>>>>> If a large number of CMA memory are configured in the system (for
> >>>>>>>> example, the CMA memory accounts for 50% of the system memory),
> >>>>>>>> starting
> >>>>>>>> virtual machine with device passthrough will get stuck.
> >>>>>>>>
> >>>>>>>> During starting virtual machine, it will call
> >>>>>>>> pin_user_pages_remote(...,
> >>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> >>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to
> >>>>>>>> non-CMA
> >>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation
> >>>>>>>> requests
> >>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless
> >>>>>>>> loops.
> >>>>>>>>
> >>>>>>>> backtrace:
> >>>>>>>> pin_user_pages_remote
> >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>>>>> --------__get_user_pages_locked
> >>>>>>>> --------check_and_migrate_movable_pages //always check fail and
> >>>>>>>> continue
> >>>>>>>> to migrate
> >>>>>>>> ------------migrate_longterm_unpinnable_pages
> >>>>>>>> ----------------alloc_migration_target // non-movable allocation
> >>>>>>>>
> >>>>>>>>> Is it possible that some CMA memory might be used by non-movable
> >>>>>>>>> allocation requests?
> >>>>>>>>
> >>>>>>>> Yes.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> If so, will CMA somehow become unable to migrate, causing
> >>>>>>>>> cma_alloc()
> >>>>>>>>> to fail?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
> >>>>>>>> non-movable allocation requests return CMA memory,
> >>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to
> >>>>>>>> another
> >>>>>>>> CMA page, which is useless and cause endless loops in
> >>>>>>>> __gup_longterm_locked().
> >>>>>>
> >>>>>> This is only one perspective. We also need to consider the impact on
> >>>>>> CMA itself. For example,
> >>>>>> when CMA is borrowed by THP, and we need to reclaim it through
> >>>>>> cma_alloc() or dma_alloc_coherent(),
> >>>>>> we must move those pages out to ensure CMA's users can retrieve that
> >>>>>> contiguous memory.
> >>>>>>
> >>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
> >>>>>> can't relocate them.
> >>>>>> As a result, cma_alloc() is more likely to fail.
> >>>>>>
> >>>>>>>>
> >>>>>>>> backtrace:
> >>>>>>>> pin_user_pages_remote
> >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>>>>> --------__get_user_pages_locked
> >>>>>>>> --------check_and_migrate_movable_pages //always check fail and
> >>>>>>>> continue
> >>>>>>>> to migrate
> >>>>>>>> ------------migrate_longterm_unpinnable_pages
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>>>>>>> THP-sized allocations")
> >>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
> >>>>>>>>>> ---
> >>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++
> >>>>>>>>>>      1 file changed, 10 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>>>>>>> index 2e22ce5..0bdf471 100644
> >>>>>>>>>> --- a/mm/page_alloc.c
> >>>>>>>>>> +++ b/mm/page_alloc.c
> >>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>>>>>>> *preferred_zone,
> >>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order
> >>>>>>>>>> > 1));
> >>>>>>>>>>
> >>>>>>>>>>             if (likely(pcp_allowed_order(order))) {
> >>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>>>>>>>>> ALLOC_CMA ||
> >>>>>>>>>> +                                               order !=
> >>>>>>>>>> HPAGE_PMD_ORDER) {
> >>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone,
> >>>>>>>>>> zone,
> >>>>>>>>>> order,
> >>>>>>>>>> +                                               migratetype,
> >>>>>>>>>> alloc_flags);
> >>>>>>>>>> +                       if (likely(page))
> >>>>>>>>>> +                               goto out;
> >>>>>>>>>> +               }
> >>>>>>>>>
> >>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use
> >>>>>>>>> PCP.
> >>>>>>>>> But it
> >>>>>>>>> still seems better than causing the failure of CMA allocation.
> >>>>>>>>>
> >>>>>>>>> Is there a possible approach to avoiding adding CMA THP into
> >>>>>>>>> pcp from
> >>>>>>>>> the first
> >>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
> >>>>>>>>>
> >>>>>>>
> >>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> >>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
> >>>>>>>
> >>>>>>
> >>>>>> But the majority of movable pages aren't CMA, right?
> >>>>>
> >>>>>> Do we have an estimate for
> >>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new
> >>>>>> cacheline, which
> >>>>>> the original intention for THP was to avoid by having only one
> >>>>>> PCP[1]?
> >>>>>>
> >>>>>> [1]
> >>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
> >>>>>>
> >>>>>
> >>>>> The size of struct per_cpu_pages is 256 bytes in current code
> >>>>> containing
> >>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized
> >>>>> allocations").
> >>>>> crash> struct per_cpu_pages
> >>>>> struct per_cpu_pages {
> >>>>>        spinlock_t lock;
> >>>>>        int count;
> >>>>>        int high;
> >>>>>        int high_min;
> >>>>>        int high_max;
> >>>>>        int batch;
> >>>>>        u8 flags;
> >>>>>        u8 alloc_factor;
> >>>>>        u8 expire;
> >>>>>        short free_count;
> >>>>>        struct list_head lists[13];
> >>>>> }
> >>>>> SIZE: 256
> >>>>>
> >>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP
> >>>>> list
> >>>>> for THP-sized allocations"), the size of struct per_cpu_pages is
> >>>>> 272 bytes.
> >>>>> crash> struct per_cpu_pages
> >>>>> struct per_cpu_pages {
> >>>>>        spinlock_t lock;
> >>>>>        int count;
> >>>>>        int high;
> >>>>>        int high_min;
> >>>>>        int high_max;
> >>>>>        int batch;
> >>>>>        u8 flags;
> >>>>>        u8 alloc_factor;
> >>>>>        u8 expire;
> >>>>>        short free_count;
> >>>>>        struct list_head lists[15];
> >>>>> }
> >>>>> SIZE: 272
> >>>>>
> >>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations") decrease one cacheline.
> >>>>
> >>>> the proposal is not reverting the patch but adding one CMA pcp.
> >>>> so it is "struct list_head lists[14]"; in this case, the size is still
> >>>> 256?
> >>>>
> >>>
> >>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
> >>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
> >>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that
> >>> right?
> >>
> >> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
> >> CMA is only used by movable.
> >> So it might be:
> >> MOVABLE and NON-MOVABLE.
> >
> > One PCP list is used by UNMOVABLE pages, and the other PCP list is used
> > by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains
> > MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP
> > list contains MIGRATE_MOVABLE pages.
> >
>
> Is the following modification feasiable?
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define NR_PCP_THP 1
> +#define NR_PCP_THP 2
> +#define PCP_THP_MOVABLE 0
> +#define PCP_THP_UNMOVABLE 1
>   #else
>   #define NR_PCP_THP 0
>   #endif
>
>   static inline unsigned int order_to_pindex(int migratetype, int order)
>   {
> +       int pcp_type = migratetype;
> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>          if (order > PAGE_ALLOC_COSTLY_ORDER) {
>                  VM_BUG_ON(order != HPAGE_PMD_ORDER);
> -               return NR_LOWORDER_PCP_LISTS;
> +
> +               if (migratetype != MIGRATE_MOVABLE)
> +                       pcp_type = PCP_THP_UNMOVABLE;
> +               else
> +                       pcp_type = PCP_THP_MOVABLE;
> +
> +               return NR_LOWORDER_PCP_LISTS + pcp_type;
>          }
>   #else
>          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>   #endif
>
> -       return (MIGRATE_PCPTYPES * order) + migratetype;
> +       return (MIGRATE_PCPTYPES * order) + pcp_type;
>   }
>

a minimum change might be, then you can drop most new code.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 120a317d0938..cfe1e0625e38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason)

 static inline unsigned int order_to_pindex(int migratetype, int order)
 {
+       bool __maybe_unused movable;
 #ifdef CONFIG_CMA
        /*
         * We shouldn't get here for MIGRATE_CMA if those pages don't
@@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int
migratetype, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
        if (order > PAGE_ALLOC_COSTLY_ORDER) {
                VM_BUG_ON(order != pageblock_order);
-               return NR_LOWORDER_PCP_LISTS;
+               movable = migratetype == MIGRATE_MOVABLE;
+               return NR_LOWORDER_PCP_LISTS + movable;
        }
 #else
        VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);


>
>
> @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
>          int order = pindex / MIGRATE_PCPTYPES;
>
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -       if (pindex == NR_LOWORDER_PCP_LISTS)
> +       if (order > PAGE_ALLOC_COSTLY_ORDER)
>                  order = HPAGE_PMD_ORDER;
>   #else
>          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>
>
>
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> >>>>>>> refer to it.
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> +#else
> >>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone,
> >>>>>>>>>> zone, order,
> >>>>>>>>>>                                            migratetype,
> >>>>>>>>>> alloc_flags);
> >>>>>>>>>>                     if (likely(page))
> >>>>>>>>>>                             goto out;
> >>>>>>>>>> +#endif
> >>>>>>>>>>             }
> >>>>>>>>>>
> >>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order,
> >>>>>>>>>> alloc_flags,
> >>>>>>>>>> --
> >>>>>>>>>> 2.7.4
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Barry
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>
>
Ge Yang June 19, 2024, 8:35 a.m. UTC | #22
在 2024/6/19 16:20, Barry Song 写道:
> On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 15:51, yangge1116 写道:
>>>
>>>
>>> 在 2024/6/18 下午2:58, Barry Song 写道:
>>>> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/6/18 下午12:10, Barry Song 写道:
>>>>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>>>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP
>>>>>>>>>>>> list for
>>>>>>>>>>>> THP-sized allocations") no longer differentiates the migration
>>>>>>>>>>>> type
>>>>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA
>>>>>>>>>>>> page from
>>>>>>>>>>>> the list, in some cases, it's not acceptable, for example,
>>>>>>>>>>>> allocating
>>>>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>>>>>>
>>>>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from
>>>>>>>>>>>> THP-sized
>>>>>>>>>>>> PCP list to avoid the issue above.
>>>>>>>>>>>
>>>>>>>>>>> Could you please describe the impact on users in the commit log?
>>>>>>>>>>
>>>>>>>>>> If a large number of CMA memory are configured in the system (for
>>>>>>>>>> example, the CMA memory accounts for 50% of the system memory),
>>>>>>>>>> starting
>>>>>>>>>> virtual machine with device passthrough will get stuck.
>>>>>>>>>>
>>>>>>>>>> During starting virtual machine, it will call
>>>>>>>>>> pin_user_pages_remote(...,
>>>>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to
>>>>>>>>>> non-CMA
>>>>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation
>>>>>>>>>> requests
>>>>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless
>>>>>>>>>> loops.
>>>>>>>>>>
>>>>>>>>>> backtrace:
>>>>>>>>>> pin_user_pages_remote
>>>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>>>> --------__get_user_pages_locked
>>>>>>>>>> --------check_and_migrate_movable_pages //always check fail and
>>>>>>>>>> continue
>>>>>>>>>> to migrate
>>>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>>>>>>
>>>>>>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>>>>>>> allocation requests?
>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> If so, will CMA somehow become unable to migrate, causing
>>>>>>>>>>> cma_alloc()
>>>>>>>>>>> to fail?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>>>>>>> non-movable allocation requests return CMA memory,
>>>>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to
>>>>>>>>>> another
>>>>>>>>>> CMA page, which is useless and cause endless loops in
>>>>>>>>>> __gup_longterm_locked().
>>>>>>>>
>>>>>>>> This is only one perspective. We also need to consider the impact on
>>>>>>>> CMA itself. For example,
>>>>>>>> when CMA is borrowed by THP, and we need to reclaim it through
>>>>>>>> cma_alloc() or dma_alloc_coherent(),
>>>>>>>> we must move those pages out to ensure CMA's users can retrieve that
>>>>>>>> contiguous memory.
>>>>>>>>
>>>>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>>>>>>> can't relocate them.
>>>>>>>> As a result, cma_alloc() is more likely to fail.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> backtrace:
>>>>>>>>>> pin_user_pages_remote
>>>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>>>>>>> --------__get_user_pages_locked
>>>>>>>>>> --------check_and_migrate_movable_pages //always check fail and
>>>>>>>>>> continue
>>>>>>>>>> to migrate
>>>>>>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>>>>>>> THP-sized allocations")
>>>>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       mm/page_alloc.c | 10 ++++++++++
>>>>>>>>>>>>       1 file changed, 10 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>>>>>>> *preferred_zone,
>>>>>>>>>>>>              WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order
>>>>>>>>>>>>> 1));
>>>>>>>>>>>>
>>>>>>>>>>>>              if (likely(pcp_allowed_order(order))) {
>>>>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>>>>>>> ALLOC_CMA ||
>>>>>>>>>>>> +                                               order !=
>>>>>>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone,
>>>>>>>>>>>> zone,
>>>>>>>>>>>> order,
>>>>>>>>>>>> +                                               migratetype,
>>>>>>>>>>>> alloc_flags);
>>>>>>>>>>>> +                       if (likely(page))
>>>>>>>>>>>> +                               goto out;
>>>>>>>>>>>> +               }
>>>>>>>>>>>
>>>>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use
>>>>>>>>>>> PCP.
>>>>>>>>>>> But it
>>>>>>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>>>>>>
>>>>>>>>>>> Is there a possible approach to avoiding adding CMA THP into
>>>>>>>>>>> pcp from
>>>>>>>>>>> the first
>>>>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>>>>>>
>>>>>>>>
>>>>>>>> But the majority of movable pages aren't CMA, right?
>>>>>>>
>>>>>>>> Do we have an estimate for
>>>>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new
>>>>>>>> cacheline, which
>>>>>>>> the original intention for THP was to avoid by having only one
>>>>>>>> PCP[1]?
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>>>>>>
>>>>>>>
>>>>>>> The size of struct per_cpu_pages is 256 bytes in current code
>>>>>>> containing
>>>>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized
>>>>>>> allocations").
>>>>>>> crash> struct per_cpu_pages
>>>>>>> struct per_cpu_pages {
>>>>>>>         spinlock_t lock;
>>>>>>>         int count;
>>>>>>>         int high;
>>>>>>>         int high_min;
>>>>>>>         int high_max;
>>>>>>>         int batch;
>>>>>>>         u8 flags;
>>>>>>>         u8 alloc_factor;
>>>>>>>         u8 expire;
>>>>>>>         short free_count;
>>>>>>>         struct list_head lists[13];
>>>>>>> }
>>>>>>> SIZE: 256
>>>>>>>
>>>>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP
>>>>>>> list
>>>>>>> for THP-sized allocations"), the size of struct per_cpu_pages is
>>>>>>> 272 bytes.
>>>>>>> crash> struct per_cpu_pages
>>>>>>> struct per_cpu_pages {
>>>>>>>         spinlock_t lock;
>>>>>>>         int count;
>>>>>>>         int high;
>>>>>>>         int high_min;
>>>>>>>         int high_max;
>>>>>>>         int batch;
>>>>>>>         u8 flags;
>>>>>>>         u8 alloc_factor;
>>>>>>>         u8 expire;
>>>>>>>         short free_count;
>>>>>>>         struct list_head lists[15];
>>>>>>> }
>>>>>>> SIZE: 272
>>>>>>>
>>>>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") decrease one cacheline.
>>>>>>
>>>>>> the proposal is not reverting the patch but adding one CMA pcp.
>>>>>> so it is "struct list_head lists[14]"; in this case, the size is still
>>>>>> 256?
>>>>>>
>>>>>
>>>>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
>>>>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
>>>>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that
>>>>> right?
>>>>
>>>> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
>>>> CMA is only used by movable.
>>>> So it might be:
>>>> MOVABLE and NON-MOVABLE.
>>>
>>> One PCP list is used by UNMOVABLE pages, and the other PCP list is used
>>> by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains
>>> MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP
>>> list contains MIGRATE_MOVABLE pages.
>>>
>>
>> Is the following modification feasiable?
>>
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -#define NR_PCP_THP 1
>> +#define NR_PCP_THP 2
>> +#define PCP_THP_MOVABLE 0
>> +#define PCP_THP_UNMOVABLE 1
>>    #else
>>    #define NR_PCP_THP 0
>>    #endif
>>
>>    static inline unsigned int order_to_pindex(int migratetype, int order)
>>    {
>> +       int pcp_type = migratetype;
>> +
>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>           if (order > PAGE_ALLOC_COSTLY_ORDER) {
>>                   VM_BUG_ON(order != HPAGE_PMD_ORDER);
>> -               return NR_LOWORDER_PCP_LISTS;
>> +
>> +               if (migratetype != MIGRATE_MOVABLE)
>> +                       pcp_type = PCP_THP_UNMOVABLE;
>> +               else
>> +                       pcp_type = PCP_THP_MOVABLE;
>> +
>> +               return NR_LOWORDER_PCP_LISTS + pcp_type;
>>           }
>>    #else
>>           VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>>    #endif
>>
>> -       return (MIGRATE_PCPTYPES * order) + migratetype;
>> +       return (MIGRATE_PCPTYPES * order) + pcp_type;
>>    }
>>
> 
> a minimum change might be, then you can drop most new code.
> 

Thanks, I will prepare the V2.


> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 120a317d0938..cfe1e0625e38 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason)
> 
>   static inline unsigned int order_to_pindex(int migratetype, int order)
>   {
> +       bool __maybe_unused movable;
>   #ifdef CONFIG_CMA
>          /*
>           * We shouldn't get here for MIGRATE_CMA if those pages don't
> @@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int
> migratetype, int order)
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>          if (order > PAGE_ALLOC_COSTLY_ORDER) {
>                  VM_BUG_ON(order != pageblock_order);
> -               return NR_LOWORDER_PCP_LISTS;
> +               movable = migratetype == MIGRATE_MOVABLE;
> +               return NR_LOWORDER_PCP_LISTS + movable;
>          }
>   #else
>          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
> 
> 
>>
>>
>> @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
>>           int order = pindex / MIGRATE_PCPTYPES;
>>
>>    #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -       if (pindex == NR_LOWORDER_PCP_LISTS)
>> +       if (order > PAGE_ALLOC_COSTLY_ORDER)
>>                   order = HPAGE_PMD_ORDER;
>>    #else
>>           VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>>
>>
>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>>>>>>> refer to it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> +#else
>>>>>>>>>>>>                      page = rmqueue_pcplist(preferred_zone,
>>>>>>>>>>>> zone, order,
>>>>>>>>>>>>                                             migratetype,
>>>>>>>>>>>> alloc_flags);
>>>>>>>>>>>>                      if (likely(page))
>>>>>>>>>>>>                              goto out;
>>>>>>>>>>>> +#endif
>>>>>>>>>>>>              }
>>>>>>>>>>>>
>>>>>>>>>>>>              page = rmqueue_buddy(preferred_zone, zone, order,
>>>>>>>>>>>> alloc_flags,
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.7.4
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Barry
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5..0bdf471 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2987,10 +2987,20 @@  struct page *rmqueue(struct zone *preferred_zone,
 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 
 	if (likely(pcp_allowed_order(order))) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
+						order != HPAGE_PMD_ORDER) {
+			page = rmqueue_pcplist(preferred_zone, zone, order,
+						migratetype, alloc_flags);
+			if (likely(page))
+				goto out;
+		}
+#else
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				       migratetype, alloc_flags);
 		if (likely(page))
 			goto out;
+#endif
 	}
 
 	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,