Message ID | 1717492460-19457-1-git-send-email-yangge1116@126.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page | expand |
Cc Johannes, Zi and Vlastimil. On 2024/6/4 17:14, yangge1116@126.com wrote: > From: yangge <yangge1116@126.com> > > Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > THP-sized allocations") no longer differentiates the migration type > of pages in THP-sized PCP list, it's possible to get a CMA page from > the list, in some cases, it's not acceptable, for example, allocating > a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > > The patch forbids allocating non-CMA THP-sized page from THP-sized > PCP list to avoid the issue above. > > Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") > Signed-off-by: yangge <yangge1116@126.com> > --- > mm/page_alloc.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2e22ce5..0bdf471 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone, > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > if (likely(pcp_allowed_order(order))) { > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || > + order != HPAGE_PMD_ORDER) { Seems you will also miss the non-CMA THP from the PCP, so I wonder if we can add a migratetype comparison in __rmqueue_pcplist(), and if it's not suitable, then fallback to buddy? > + page = rmqueue_pcplist(preferred_zone, zone, order, > + migratetype, alloc_flags); > + if (likely(page)) > + goto out; > + } > +#else > page = rmqueue_pcplist(preferred_zone, zone, order, > migratetype, alloc_flags); > if (likely(page)) > goto out; > +#endif > } > > page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
在 2024/6/4 下午8:01, Baolin Wang 写道: > Cc Johannes, Zi and Vlastimil. > > On 2024/6/4 17:14, yangge1116@126.com wrote: >> From: yangge <yangge1116@126.com> >> >> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >> THP-sized allocations") no longer differentiates the migration type >> of pages in THP-sized PCP list, it's possible to get a CMA page from >> the list, in some cases, it's not acceptable, for example, allocating >> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >> >> The patch forbids allocating non-CMA THP-sized page from THP-sized >> PCP list to avoid the issue above. >> >> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >> THP-sized allocations") >> Signed-off-by: yangge <yangge1116@126.com> >> --- >> mm/page_alloc.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 2e22ce5..0bdf471 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone, >> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >> if (likely(pcp_allowed_order(order))) { >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || >> + order != HPAGE_PMD_ORDER) { > > Seems you will also miss the non-CMA THP from the PCP, so I wonder if we > can add a migratetype comparison in __rmqueue_pcplist(), and if it's not > suitable, then fallback to buddy? Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype comparison in __rmqueue_pcplist(), we may need to compare many times because of pcp batch.
On 2024/6/4 20:36, yangge1116 wrote: > > > 在 2024/6/4 下午8:01, Baolin Wang 写道: >> Cc Johannes, Zi and Vlastimil. >> >> On 2024/6/4 17:14, yangge1116@126.com wrote: >>> From: yangge <yangge1116@126.com> >>> >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>> THP-sized allocations") no longer differentiates the migration type >>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>> the list, in some cases, it's not acceptable, for example, allocating >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>> >>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>> PCP list to avoid the issue above. >>> >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>> THP-sized allocations") >>> Signed-off-by: yangge <yangge1116@126.com> >>> --- >>> mm/page_alloc.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 2e22ce5..0bdf471 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>> *preferred_zone, >>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>> if (likely(pcp_allowed_order(order))) { >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || >>> + order != HPAGE_PMD_ORDER) { >> >> Seems you will also miss the non-CMA THP from the PCP, so I wonder if >> we can add a migratetype comparison in __rmqueue_pcplist(), and if >> it's not suitable, then fallback to buddy? > > Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype > comparison in __rmqueue_pcplist(), we may need to compare many times > because of pcp batch. I mean we can only compare once, focusing on CMA pages. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3734fe7e67c0..960a3b5744d8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, } page = list_first_entry(list, struct page, pcp_list); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (order == HPAGE_PMD_ORDER && !is_migrate_movable(migratetype) && + is_migrate_cma(get_pageblock_migratetype(page))) + return NULL; +#endif list_del(&page->pcp_list); pcp->count -= 1 << order; } while (check_new_pages(page, order));
在 2024/6/6 上午11:06, Baolin Wang 写道: > > > On 2024/6/4 20:36, yangge1116 wrote: >> >> >> 在 2024/6/4 下午8:01, Baolin Wang 写道: >>> Cc Johannes, Zi and Vlastimil. >>> >>> On 2024/6/4 17:14, yangge1116@126.com wrote: >>>> From: yangge <yangge1116@126.com> >>>> >>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>> THP-sized allocations") no longer differentiates the migration type >>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>> the list, in some cases, it's not acceptable, for example, allocating >>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>> >>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>> PCP list to avoid the issue above. >>>> >>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>> THP-sized allocations") >>>> Signed-off-by: yangge <yangge1116@126.com> >>>> --- >>>> mm/page_alloc.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 2e22ce5..0bdf471 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>> *preferred_zone, >>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>> if (likely(pcp_allowed_order(order))) { >>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || >>>> + order != HPAGE_PMD_ORDER) { >>> >>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if >>> we can add a migratetype comparison in __rmqueue_pcplist(), and if >>> it's not suitable, then fallback to buddy? >> >> Yes, we may miss some non-CMA THPs in the PCP. But, if add a >> migratetype comparison in __rmqueue_pcplist(), we may need to compare >> many times because of pcp batch. > > I mean we can only compare once, focusing on CMA pages. pcp_list may contains CMA and no-CMA pages, why only compare once, just increase one chance of using the pcp_list? > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3734fe7e67c0..960a3b5744d8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, > unsigned int order, > } > > page = list_first_entry(list, struct page, pcp_list); > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if (order == HPAGE_PMD_ORDER && > !is_migrate_movable(migratetype) && > + is_migrate_cma(get_pageblock_migratetype(page))) > + return NULL; > +#endif > list_del(&page->pcp_list); > pcp->count -= 1 << order; > } while (check_new_pages(page, order));
On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: > > From: yangge <yangge1116@126.com> > > Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > THP-sized allocations") no longer differentiates the migration type > of pages in THP-sized PCP list, it's possible to get a CMA page from > the list, in some cases, it's not acceptable, for example, allocating > a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > > The patch forbids allocating non-CMA THP-sized page from THP-sized > PCP list to avoid the issue above. Could you please describe the impact on users in the commit log? Is it possible that some CMA memory might be used by non-movable allocation requests? If so, will CMA somehow become unable to migrate, causing cma_alloc() to fail? > > Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") > Signed-off-by: yangge <yangge1116@126.com> > --- > mm/page_alloc.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2e22ce5..0bdf471 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone, > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > if (likely(pcp_allowed_order(order))) { > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || > + order != HPAGE_PMD_ORDER) { > + page = rmqueue_pcplist(preferred_zone, zone, order, > + migratetype, alloc_flags); > + if (likely(page)) > + goto out; > + } This seems not ideal, because non-CMA THP gets no chance to use PCP. But it still seems better than causing the failure of CMA allocation. Is there a possible approach to avoiding adding CMA THP into pcp from the first beginning? Otherwise, we might need a separate PCP for CMA. > +#else > page = rmqueue_pcplist(preferred_zone, zone, order, > migratetype, alloc_flags); > if (likely(page)) > goto out; > +#endif > } > > page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > -- > 2.7.4 Thanks Barry
On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > > On 2024/6/4 20:36, yangge1116 wrote: > > > > > > 在 2024/6/4 下午8:01, Baolin Wang 写道: > >> Cc Johannes, Zi and Vlastimil. > >> > >> On 2024/6/4 17:14, yangge1116@126.com wrote: > >>> From: yangge <yangge1116@126.com> > >>> > >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>> THP-sized allocations") no longer differentiates the migration type > >>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>> the list, in some cases, it's not acceptable, for example, allocating > >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>> > >>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>> PCP list to avoid the issue above. > >>> > >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>> THP-sized allocations") > >>> Signed-off-by: yangge <yangge1116@126.com> > >>> --- > >>> mm/page_alloc.c | 10 ++++++++++ > >>> 1 file changed, 10 insertions(+) > >>> > >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>> index 2e22ce5..0bdf471 100644 > >>> --- a/mm/page_alloc.c > >>> +++ b/mm/page_alloc.c > >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>> *preferred_zone, > >>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>> if (likely(pcp_allowed_order(order))) { > >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || > >>> + order != HPAGE_PMD_ORDER) { > >> > >> Seems you will also miss the non-CMA THP from the PCP, so I wonder if > >> we can add a migratetype comparison in __rmqueue_pcplist(), and if > >> it's not suitable, then fallback to buddy? > > > > Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype > > comparison in __rmqueue_pcplist(), we may need to compare many times > > because of pcp batch. > > I mean we can only compare once, focusing on CMA pages. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3734fe7e67c0..960a3b5744d8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, > unsigned int order, > } > > page = list_first_entry(list, struct page, pcp_list); > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if (order == HPAGE_PMD_ORDER && > !is_migrate_movable(migratetype) && > + is_migrate_cma(get_pageblock_migratetype(page))) > + return NULL; > +#endif This doesn't seem ideal either. It's possible that the PCP still has many non-CMA folios, but due to bad luck, the first entry is "always" CMA. In this case, allocations with is_migrate_movable(migratetype) == false will always lose the chance to use the PCP. It also appears to incur a PCP spin lock/unlock. I don't see an ideal solution unless we bring back the CMA PCP :-) > list_del(&page->pcp_list); > pcp->count -= 1 << order; > } while (check_new_pages(page, order)); >
On 2024/6/17 18:43, Barry Song wrote: > On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: >> >> >> >> On 2024/6/4 20:36, yangge1116 wrote: >>> >>> >>> 在 2024/6/4 下午8:01, Baolin Wang 写道: >>>> Cc Johannes, Zi and Vlastimil. >>>> >>>> On 2024/6/4 17:14, yangge1116@126.com wrote: >>>>> From: yangge <yangge1116@126.com> >>>>> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") no longer differentiates the migration type >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>> PCP list to avoid the issue above. >>>>> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") >>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>> --- >>>>> mm/page_alloc.c | 10 ++++++++++ >>>>> 1 file changed, 10 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 2e22ce5..0bdf471 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>> *preferred_zone, >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>> if (likely(pcp_allowed_order(order))) { >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || >>>>> + order != HPAGE_PMD_ORDER) { >>>> >>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if >>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if >>>> it's not suitable, then fallback to buddy? >>> >>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype >>> comparison in __rmqueue_pcplist(), we may need to compare many times >>> because of pcp batch. >> >> I mean we can only compare once, focusing on CMA pages. >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3734fe7e67c0..960a3b5744d8 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, >> unsigned int order, >> } >> >> page = list_first_entry(list, struct page, pcp_list); >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + if (order == HPAGE_PMD_ORDER && >> !is_migrate_movable(migratetype) && >> + is_migrate_cma(get_pageblock_migratetype(page))) >> + return NULL; >> +#endif > > This doesn't seem ideal either. It's possible that the PCP still has many > non-CMA folios, but due to bad luck, the first entry is "always" CMA. > In this case, > allocations with is_migrate_movable(migratetype) == false will always lose the > chance to use the PCP. It also appears to incur a PCP spin lock/unlock. Yes, just some ideas to to mitigate the issue... > > I don't see an ideal solution unless we bring back the CMA PCP :-) Tend to agree, and adding a CMA PCP seems the overhead can be acceptable?
On Mon, Jun 17, 2024 at 7:36 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > > On 2024/6/17 18:43, Barry Song wrote: > > On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang > > <baolin.wang@linux.alibaba.com> wrote: > >> > >> > >> > >> On 2024/6/4 20:36, yangge1116 wrote: > >>> > >>> > >>> 在 2024/6/4 下午8:01, Baolin Wang 写道: > >>>> Cc Johannes, Zi and Vlastimil. > >>>> > >>>> On 2024/6/4 17:14, yangge1116@126.com wrote: > >>>>> From: yangge <yangge1116@126.com> > >>>>> > >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") no longer differentiates the migration type > >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>>>> the list, in some cases, it's not acceptable, for example, allocating > >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>> > >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>>>> PCP list to avoid the issue above. > >>>>> > >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") > >>>>> Signed-off-by: yangge <yangge1116@126.com> > >>>>> --- > >>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>> 1 file changed, 10 insertions(+) > >>>>> > >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>> index 2e22ce5..0bdf471 100644 > >>>>> --- a/mm/page_alloc.c > >>>>> +++ b/mm/page_alloc.c > >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>> *preferred_zone, > >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>>>> if (likely(pcp_allowed_order(order))) { > >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || > >>>>> + order != HPAGE_PMD_ORDER) { > >>>> > >>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if > >>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if > >>>> it's not suitable, then fallback to buddy? > >>> > >>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype > >>> comparison in __rmqueue_pcplist(), we may need to compare many times > >>> because of pcp batch. > >> > >> I mean we can only compare once, focusing on CMA pages. > >> > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index 3734fe7e67c0..960a3b5744d8 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, > >> unsigned int order, > >> } > >> > >> page = list_first_entry(list, struct page, pcp_list); > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >> + if (order == HPAGE_PMD_ORDER && > >> !is_migrate_movable(migratetype) && > >> + is_migrate_cma(get_pageblock_migratetype(page))) > >> + return NULL; > >> +#endif > > > > This doesn't seem ideal either. It's possible that the PCP still has many > > non-CMA folios, but due to bad luck, the first entry is "always" CMA. > > In this case, > > allocations with is_migrate_movable(migratetype) == false will always lose the > > chance to use the PCP. It also appears to incur a PCP spin lock/unlock. > > Yes, just some ideas to to mitigate the issue... > > > > > I don't see an ideal solution unless we bring back the CMA PCP :-) > > Tend to agree, and adding a CMA PCP seems the overhead can be acceptable? yes. probably. Hi Ge, Could we printk the size before and after adding 1 to NR_PCP_LISTS? Does it increase one cacheline? struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int high_min; /* min high watermark */ int high_max; /* max high watermark */ int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ u8 alloc_factor; /* batch scaling factor during allocate */ #ifdef CONFIG_NUMA u8 expire; /* When 0, remote pagesets are drained */ #endif short free_count; /* consecutive free count */ /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; } ____cacheline_aligned_in_smp;
在 2024/6/17 下午6:26, Barry Song 写道: > On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >> >> From: yangge <yangge1116@126.com> >> >> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >> THP-sized allocations") no longer differentiates the migration type >> of pages in THP-sized PCP list, it's possible to get a CMA page from >> the list, in some cases, it's not acceptable, for example, allocating >> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >> >> The patch forbids allocating non-CMA THP-sized page from THP-sized >> PCP list to avoid the issue above. > > Could you please describe the impact on users in the commit log? If a large number of CMA memory are configured in the system (for example, the CMA memory accounts for 50% of the system memory), starting virtual machine with device passthrough will get stuck. During starting virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, pin_user_pages_remote() will migrate the page from CMA area to non-CMA area because of FOLL_LONGTERM flag. If non-movable allocation requests return CMA memory, pin_user_pages_remote() will enter endless loops. backtrace: pin_user_pages_remote ----__gup_longterm_locked //cause endless loops in this function --------__get_user_pages_locked --------check_and_migrate_movable_pages //always check fail and continue to migrate ------------migrate_longterm_unpinnable_pages ----------------alloc_migration_target // non-movable allocation > Is it possible that some CMA memory might be used by non-movable > allocation requests? Yes. > If so, will CMA somehow become unable to migrate, causing cma_alloc() to fail? No, it will cause endless loops in __gup_longterm_locked(). If non-movable allocation requests return CMA memory, migrate_longterm_unpinnable_pages() will migrate a CMA page to another CMA page, which is useless and cause endless loops in __gup_longterm_locked(). backtrace: pin_user_pages_remote ----__gup_longterm_locked //cause endless loops in this function --------__get_user_pages_locked --------check_and_migrate_movable_pages //always check fail and continue to migrate ------------migrate_longterm_unpinnable_pages >> >> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") >> Signed-off-by: yangge <yangge1116@126.com> >> --- >> mm/page_alloc.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 2e22ce5..0bdf471 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone, >> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >> >> if (likely(pcp_allowed_order(order))) { >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || >> + order != HPAGE_PMD_ORDER) { >> + page = rmqueue_pcplist(preferred_zone, zone, order, >> + migratetype, alloc_flags); >> + if (likely(page)) >> + goto out; >> + } > > This seems not ideal, because non-CMA THP gets no chance to use PCP. But it > still seems better than causing the failure of CMA allocation. > > Is there a possible approach to avoiding adding CMA THP into pcp from the first > beginning? Otherwise, we might need a separate PCP for CMA. > >> +#else >> page = rmqueue_pcplist(preferred_zone, zone, order, >> migratetype, alloc_flags); >> if (likely(page)) >> goto out; >> +#endif >> } >> >> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >> -- >> 2.7.4 > > Thanks > Barry >
在 2024/6/17 下午8:47, yangge1116 写道: > > > 在 2024/6/17 下午6:26, Barry Song 写道: >> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>> >>> From: yangge <yangge1116@126.com> >>> >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>> THP-sized allocations") no longer differentiates the migration type >>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>> the list, in some cases, it's not acceptable, for example, allocating >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>> >>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>> PCP list to avoid the issue above. >> >> Could you please describe the impact on users in the commit log? > > If a large number of CMA memory are configured in the system (for > example, the CMA memory accounts for 50% of the system memory), starting > virtual machine with device passthrough will get stuck. > > During starting virtual machine, it will call pin_user_pages_remote(..., > FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > pin_user_pages_remote() will migrate the page from CMA area to non-CMA > area because of FOLL_LONGTERM flag. If non-movable allocation requests > return CMA memory, pin_user_pages_remote() will enter endless loops. > > backtrace: > pin_user_pages_remote > ----__gup_longterm_locked //cause endless loops in this function > --------__get_user_pages_locked > --------check_and_migrate_movable_pages //always check fail and continue > to migrate > ------------migrate_longterm_unpinnable_pages > ----------------alloc_migration_target // non-movable allocation > >> Is it possible that some CMA memory might be used by non-movable >> allocation requests? > > Yes. > > >> If so, will CMA somehow become unable to migrate, causing cma_alloc() >> to fail? > > > No, it will cause endless loops in __gup_longterm_locked(). If > non-movable allocation requests return CMA memory, > migrate_longterm_unpinnable_pages() will migrate a CMA page to another > CMA page, which is useless and cause endless loops in > __gup_longterm_locked(). > > backtrace: > pin_user_pages_remote > ----__gup_longterm_locked //cause endless loops in this function > --------__get_user_pages_locked > --------check_and_migrate_movable_pages //always check fail and continue > to migrate > ------------migrate_longterm_unpinnable_pages > > > > > >>> >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>> THP-sized allocations") >>> Signed-off-by: yangge <yangge1116@126.com> >>> --- >>> mm/page_alloc.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 2e22ce5..0bdf471 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>> *preferred_zone, >>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>> >>> if (likely(pcp_allowed_order(order))) { >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>> ALLOC_CMA || >>> + order != >>> HPAGE_PMD_ORDER) { >>> + page = rmqueue_pcplist(preferred_zone, zone, >>> order, >>> + migratetype, >>> alloc_flags); >>> + if (likely(page)) >>> + goto out; >>> + } >> >> This seems not ideal, because non-CMA THP gets no chance to use PCP. >> But it >> still seems better than causing the failure of CMA allocation. >> >> Is there a possible approach to avoiding adding CMA THP into pcp from >> the first >> beginning? Otherwise, we might need a separate PCP for CMA. >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding adding CMA THP into pcp may incur a slight performance penalty. Commit 1d91df85f399 takes a similar approach to filter, and I mainly refer to it. >>> +#else >>> page = rmqueue_pcplist(preferred_zone, zone, order, >>> migratetype, alloc_flags); >>> if (likely(page)) >>> goto out; >>> +#endif >>> } >>> >>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>> -- >>> 2.7.4 >> >> Thanks >> Barry >>
On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: > > > > 在 2024/6/17 下午8:47, yangge1116 写道: > > > > > > 在 2024/6/17 下午6:26, Barry Song 写道: > >> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: > >>> > >>> From: yangge <yangge1116@126.com> > >>> > >>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>> THP-sized allocations") no longer differentiates the migration type > >>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>> the list, in some cases, it's not acceptable, for example, allocating > >>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>> > >>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>> PCP list to avoid the issue above. > >> > >> Could you please describe the impact on users in the commit log? > > > > If a large number of CMA memory are configured in the system (for > > example, the CMA memory accounts for 50% of the system memory), starting > > virtual machine with device passthrough will get stuck. > > > > During starting virtual machine, it will call pin_user_pages_remote(..., > > FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > > pin_user_pages_remote() will migrate the page from CMA area to non-CMA > > area because of FOLL_LONGTERM flag. If non-movable allocation requests > > return CMA memory, pin_user_pages_remote() will enter endless loops. > > > > backtrace: > > pin_user_pages_remote > > ----__gup_longterm_locked //cause endless loops in this function > > --------__get_user_pages_locked > > --------check_and_migrate_movable_pages //always check fail and continue > > to migrate > > ------------migrate_longterm_unpinnable_pages > > ----------------alloc_migration_target // non-movable allocation > > > >> Is it possible that some CMA memory might be used by non-movable > >> allocation requests? > > > > Yes. > > > > > >> If so, will CMA somehow become unable to migrate, causing cma_alloc() > >> to fail? > > > > > > No, it will cause endless loops in __gup_longterm_locked(). If > > non-movable allocation requests return CMA memory, > > migrate_longterm_unpinnable_pages() will migrate a CMA page to another > > CMA page, which is useless and cause endless loops in > > __gup_longterm_locked(). This is only one perspective. We also need to consider the impact on CMA itself. For example, when CMA is borrowed by THP, and we need to reclaim it through cma_alloc() or dma_alloc_coherent(), we must move those pages out to ensure CMA's users can retrieve that contiguous memory. Currently, CMA's memory is occupied by non-movable pages, meaning we can't relocate them. As a result, cma_alloc() is more likely to fail. > > > > backtrace: > > pin_user_pages_remote > > ----__gup_longterm_locked //cause endless loops in this function > > --------__get_user_pages_locked > > --------check_and_migrate_movable_pages //always check fail and continue > > to migrate > > ------------migrate_longterm_unpinnable_pages > > > > > > > > > > > >>> > >>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>> THP-sized allocations") > >>> Signed-off-by: yangge <yangge1116@126.com> > >>> --- > >>> mm/page_alloc.c | 10 ++++++++++ > >>> 1 file changed, 10 insertions(+) > >>> > >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>> index 2e22ce5..0bdf471 100644 > >>> --- a/mm/page_alloc.c > >>> +++ b/mm/page_alloc.c > >>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>> *preferred_zone, > >>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>> > >>> if (likely(pcp_allowed_order(order))) { > >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>> ALLOC_CMA || > >>> + order != > >>> HPAGE_PMD_ORDER) { > >>> + page = rmqueue_pcplist(preferred_zone, zone, > >>> order, > >>> + migratetype, > >>> alloc_flags); > >>> + if (likely(page)) > >>> + goto out; > >>> + } > >> > >> This seems not ideal, because non-CMA THP gets no chance to use PCP. > >> But it > >> still seems better than causing the failure of CMA allocation. > >> > >> Is there a possible approach to avoiding adding CMA THP into pcp from > >> the first > >> beginning? Otherwise, we might need a separate PCP for CMA. > >> > > The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > adding CMA THP into pcp may incur a slight performance penalty. > But the majority of movable pages aren't CMA, right? Do we have an estimate for adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which the original intention for THP was to avoid by having only one PCP[1]? [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > Commit 1d91df85f399 takes a similar approach to filter, and I mainly > refer to it. > > > >>> +#else > >>> page = rmqueue_pcplist(preferred_zone, zone, order, > >>> migratetype, alloc_flags); > >>> if (likely(page)) > >>> goto out; > >>> +#endif > >>> } > >>> > >>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > >>> -- > >>> 2.7.4 > >> > >> Thanks > >> Barry > >> > >
在 2024/6/18 上午9:55, Barry Song 写道: > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: >> >> >> >> 在 2024/6/17 下午8:47, yangge1116 写道: >>> >>> >>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>> >>>>> From: yangge <yangge1116@126.com> >>>>> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") no longer differentiates the migration type >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>> PCP list to avoid the issue above. >>>> >>>> Could you please describe the impact on users in the commit log? >>> >>> If a large number of CMA memory are configured in the system (for >>> example, the CMA memory accounts for 50% of the system memory), starting >>> virtual machine with device passthrough will get stuck. >>> >>> During starting virtual machine, it will call pin_user_pages_remote(..., >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>> >>> backtrace: >>> pin_user_pages_remote >>> ----__gup_longterm_locked //cause endless loops in this function >>> --------__get_user_pages_locked >>> --------check_and_migrate_movable_pages //always check fail and continue >>> to migrate >>> ------------migrate_longterm_unpinnable_pages >>> ----------------alloc_migration_target // non-movable allocation >>> >>>> Is it possible that some CMA memory might be used by non-movable >>>> allocation requests? >>> >>> Yes. >>> >>> >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>> to fail? >>> >>> >>> No, it will cause endless loops in __gup_longterm_locked(). If >>> non-movable allocation requests return CMA memory, >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>> CMA page, which is useless and cause endless loops in >>> __gup_longterm_locked(). > > This is only one perspective. We also need to consider the impact on > CMA itself. For example, > when CMA is borrowed by THP, and we need to reclaim it through > cma_alloc() or dma_alloc_coherent(), > we must move those pages out to ensure CMA's users can retrieve that > contiguous memory. > > Currently, CMA's memory is occupied by non-movable pages, meaning we > can't relocate them. > As a result, cma_alloc() is more likely to fail. > >>> >>> backtrace: >>> pin_user_pages_remote >>> ----__gup_longterm_locked //cause endless loops in this function >>> --------__get_user_pages_locked >>> --------check_and_migrate_movable_pages //always check fail and continue >>> to migrate >>> ------------migrate_longterm_unpinnable_pages >>> >>> >>> >>> >>> >>>>> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") >>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>> --- >>>>> mm/page_alloc.c | 10 ++++++++++ >>>>> 1 file changed, 10 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 2e22ce5..0bdf471 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>> *preferred_zone, >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>> >>>>> if (likely(pcp_allowed_order(order))) { >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>> ALLOC_CMA || >>>>> + order != >>>>> HPAGE_PMD_ORDER) { >>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>> order, >>>>> + migratetype, >>>>> alloc_flags); >>>>> + if (likely(page)) >>>>> + goto out; >>>>> + } >>>> >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>> But it >>>> still seems better than causing the failure of CMA allocation. >>>> >>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>> the first >>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>> >> >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >> adding CMA THP into pcp may incur a slight performance penalty. >> > > But the majority of movable pages aren't CMA, right? > Do we have an estimate for > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > the original intention for THP was to avoid by having only one PCP[1]? > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > The size of struct per_cpu_pages is 256 bytes in current code containing commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"). crash> struct per_cpu_pages struct per_cpu_pages { spinlock_t lock; int count; int high; int high_min; int high_max; int batch; u8 flags; u8 alloc_factor; u8 expire; short free_count; struct list_head lists[13]; } SIZE: 256 After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. crash> struct per_cpu_pages struct per_cpu_pages { spinlock_t lock; int count; int high; int high_min; int high_max; int batch; u8 flags; u8 alloc_factor; u8 expire; short free_count; struct list_head lists[15]; } SIZE: 272 Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") decrease one cacheline. > >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >> refer to it. >> >> >>>>> +#else >>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>> migratetype, alloc_flags); >>>>> if (likely(page)) >>>>> goto out; >>>>> +#endif >>>>> } >>>>> >>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>> -- >>>>> 2.7.4 >>>> >>>> Thanks >>>> Barry >>>> >> >>
在 2024/6/17 下午7:55, Barry Song 写道: > On Mon, Jun 17, 2024 at 7:36 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: >> >> >> >> On 2024/6/17 18:43, Barry Song wrote: >>> On Thu, Jun 6, 2024 at 3:07 PM Baolin Wang >>> <baolin.wang@linux.alibaba.com> wrote: >>>> >>>> >>>> >>>> On 2024/6/4 20:36, yangge1116 wrote: >>>>> >>>>> >>>>> 在 2024/6/4 下午8:01, Baolin Wang 写道: >>>>>> Cc Johannes, Zi and Vlastimil. >>>>>> >>>>>> On 2024/6/4 17:14, yangge1116@126.com wrote: >>>>>>> From: yangge <yangge1116@126.com> >>>>>>> >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") no longer differentiates the migration type >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>> >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>>>> PCP list to avoid the issue above. >>>>>>> >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") >>>>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>>>> --- >>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>> 1 file changed, 10 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>> *preferred_zone, >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || >>>>>>> + order != HPAGE_PMD_ORDER) { >>>>>> >>>>>> Seems you will also miss the non-CMA THP from the PCP, so I wonder if >>>>>> we can add a migratetype comparison in __rmqueue_pcplist(), and if >>>>>> it's not suitable, then fallback to buddy? >>>>> >>>>> Yes, we may miss some non-CMA THPs in the PCP. But, if add a migratetype >>>>> comparison in __rmqueue_pcplist(), we may need to compare many times >>>>> because of pcp batch. >>>> >>>> I mean we can only compare once, focusing on CMA pages. >>>> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 3734fe7e67c0..960a3b5744d8 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -2973,6 +2973,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, >>>> unsigned int order, >>>> } >>>> >>>> page = list_first_entry(list, struct page, pcp_list); >>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>> + if (order == HPAGE_PMD_ORDER && >>>> !is_migrate_movable(migratetype) && >>>> + is_migrate_cma(get_pageblock_migratetype(page))) >>>> + return NULL; >>>> +#endif >>> >>> This doesn't seem ideal either. It's possible that the PCP still has many >>> non-CMA folios, but due to bad luck, the first entry is "always" CMA. >>> In this case, >>> allocations with is_migrate_movable(migratetype) == false will always lose the >>> chance to use the PCP. It also appears to incur a PCP spin lock/unlock. >> >> Yes, just some ideas to to mitigate the issue... >> >>> >>> I don't see an ideal solution unless we bring back the CMA PCP :-) >> >> Tend to agree, and adding a CMA PCP seems the overhead can be acceptable? > > yes. probably. Hi Ge, > > Could we printk the size before and after adding 1 to NR_PCP_LISTS? > Does it increase one cacheline? > > struct per_cpu_pages { > spinlock_t lock; /* Protects lists field */ > int count; /* number of pages in the list */ > int high; /* high watermark, emptying needed */ > int high_min; /* min high watermark */ > int high_max; /* max high watermark */ > int batch; /* chunk size for buddy add/remove */ > u8 flags; /* protected by pcp->lock */ > u8 alloc_factor; /* batch scaling factor during allocate */ > #ifdef CONFIG_NUMA > u8 expire; /* When 0, remote pagesets are drained */ > #endif > short free_count; /* consecutive free count */ > > /* Lists of pages, one per migrate type stored on the pcp-lists */ > struct list_head lists[NR_PCP_LISTS]; > } ____cacheline_aligned_in_smp; > OK. The size of struct per_cpu_pages is 256 bytes in current code containing commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"). crash> struct per_cpu_pages struct per_cpu_pages { spinlock_t lock; int count; int high; int high_min; int high_max; int batch; u8 flags; u8 alloc_factor; u8 expire; short free_count; struct list_head lists[13]; } SIZE: 256 After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. crash> struct per_cpu_pages struct per_cpu_pages { spinlock_t lock; int count; int high; int high_min; int high_max; int batch; u8 flags; u8 alloc_factor; u8 expire; short free_count; struct list_head lists[15]; } SIZE: 272 Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") decrease one cacheline.
在 2024/6/18 上午9:55, Barry Song 写道: > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: >> >> >> >> 在 2024/6/17 下午8:47, yangge1116 写道: >>> >>> >>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>> >>>>> From: yangge <yangge1116@126.com> >>>>> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") no longer differentiates the migration type >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>> PCP list to avoid the issue above. >>>> >>>> Could you please describe the impact on users in the commit log? >>> >>> If a large number of CMA memory are configured in the system (for >>> example, the CMA memory accounts for 50% of the system memory), starting >>> virtual machine with device passthrough will get stuck. >>> >>> During starting virtual machine, it will call pin_user_pages_remote(..., >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>> >>> backtrace: >>> pin_user_pages_remote >>> ----__gup_longterm_locked //cause endless loops in this function >>> --------__get_user_pages_locked >>> --------check_and_migrate_movable_pages //always check fail and continue >>> to migrate >>> ------------migrate_longterm_unpinnable_pages >>> ----------------alloc_migration_target // non-movable allocation >>> >>>> Is it possible that some CMA memory might be used by non-movable >>>> allocation requests? >>> >>> Yes. >>> >>> >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>> to fail? >>> >>> >>> No, it will cause endless loops in __gup_longterm_locked(). If >>> non-movable allocation requests return CMA memory, >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>> CMA page, which is useless and cause endless loops in >>> __gup_longterm_locked(). > > This is only one perspective. We also need to consider the impact on > CMA itself. For example, > when CMA is borrowed by THP, and we need to reclaim it through > cma_alloc() or dma_alloc_coherent(), > we must move those pages out to ensure CMA's users can retrieve that > contiguous memory. > > Currently, CMA's memory is occupied by non-movable pages, meaning we > can't relocate them. > As a result, cma_alloc() is more likely to fail. > >>> >>> backtrace: >>> pin_user_pages_remote >>> ----__gup_longterm_locked //cause endless loops in this function >>> --------__get_user_pages_locked >>> --------check_and_migrate_movable_pages //always check fail and continue >>> to migrate >>> ------------migrate_longterm_unpinnable_pages >>> >>> >>> >>> >>> >>>>> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") >>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>> --- >>>>> mm/page_alloc.c | 10 ++++++++++ >>>>> 1 file changed, 10 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 2e22ce5..0bdf471 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>> *preferred_zone, >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>> >>>>> if (likely(pcp_allowed_order(order))) { >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>> ALLOC_CMA || >>>>> + order != >>>>> HPAGE_PMD_ORDER) { >>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>> order, >>>>> + migratetype, >>>>> alloc_flags); >>>>> + if (likely(page)) >>>>> + goto out; >>>>> + } >>>> >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>> But it >>>> still seems better than causing the failure of CMA allocation. >>>> >>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>> the first >>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>> >> >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >> adding CMA THP into pcp may incur a slight performance penalty. >> > > But the majority of movable pages aren't CMA, right? Yes. > Do we have an estimate for > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > the original intention for THP was to avoid by having only one PCP[1]? > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > > >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >> refer to it. >> >> >>>>> +#else >>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>> migratetype, alloc_flags); >>>>> if (likely(page)) >>>>> goto out; >>>>> +#endif >>>>> } >>>>> >>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>> -- >>>>> 2.7.4 >>>> >>>> Thanks >>>> Barry >>>> >> >>
On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: > > > > 在 2024/6/18 上午9:55, Barry Song 写道: > > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: > >> > >> > >> > >> 在 2024/6/17 下午8:47, yangge1116 写道: > >>> > >>> > >>> 在 2024/6/17 下午6:26, Barry Song 写道: > >>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: > >>>>> > >>>>> From: yangge <yangge1116@126.com> > >>>>> > >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") no longer differentiates the migration type > >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>>>> the list, in some cases, it's not acceptable, for example, allocating > >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>> > >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>>>> PCP list to avoid the issue above. > >>>> > >>>> Could you please describe the impact on users in the commit log? > >>> > >>> If a large number of CMA memory are configured in the system (for > >>> example, the CMA memory accounts for 50% of the system memory), starting > >>> virtual machine with device passthrough will get stuck. > >>> > >>> During starting virtual machine, it will call pin_user_pages_remote(..., > >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA > >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests > >>> return CMA memory, pin_user_pages_remote() will enter endless loops. > >>> > >>> backtrace: > >>> pin_user_pages_remote > >>> ----__gup_longterm_locked //cause endless loops in this function > >>> --------__get_user_pages_locked > >>> --------check_and_migrate_movable_pages //always check fail and continue > >>> to migrate > >>> ------------migrate_longterm_unpinnable_pages > >>> ----------------alloc_migration_target // non-movable allocation > >>> > >>>> Is it possible that some CMA memory might be used by non-movable > >>>> allocation requests? > >>> > >>> Yes. > >>> > >>> > >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() > >>>> to fail? > >>> > >>> > >>> No, it will cause endless loops in __gup_longterm_locked(). If > >>> non-movable allocation requests return CMA memory, > >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another > >>> CMA page, which is useless and cause endless loops in > >>> __gup_longterm_locked(). > > > > This is only one perspective. We also need to consider the impact on > > CMA itself. For example, > > when CMA is borrowed by THP, and we need to reclaim it through > > cma_alloc() or dma_alloc_coherent(), > > we must move those pages out to ensure CMA's users can retrieve that > > contiguous memory. > > > > Currently, CMA's memory is occupied by non-movable pages, meaning we > > can't relocate them. > > As a result, cma_alloc() is more likely to fail. > > > >>> > >>> backtrace: > >>> pin_user_pages_remote > >>> ----__gup_longterm_locked //cause endless loops in this function > >>> --------__get_user_pages_locked > >>> --------check_and_migrate_movable_pages //always check fail and continue > >>> to migrate > >>> ------------migrate_longterm_unpinnable_pages > >>> > >>> > >>> > >>> > >>> > >>>>> > >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") > >>>>> Signed-off-by: yangge <yangge1116@126.com> > >>>>> --- > >>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>> 1 file changed, 10 insertions(+) > >>>>> > >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>> index 2e22ce5..0bdf471 100644 > >>>>> --- a/mm/page_alloc.c > >>>>> +++ b/mm/page_alloc.c > >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>> *preferred_zone, > >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>>>> > >>>>> if (likely(pcp_allowed_order(order))) { > >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>> ALLOC_CMA || > >>>>> + order != > >>>>> HPAGE_PMD_ORDER) { > >>>>> + page = rmqueue_pcplist(preferred_zone, zone, > >>>>> order, > >>>>> + migratetype, > >>>>> alloc_flags); > >>>>> + if (likely(page)) > >>>>> + goto out; > >>>>> + } > >>>> > >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. > >>>> But it > >>>> still seems better than causing the failure of CMA allocation. > >>>> > >>>> Is there a possible approach to avoiding adding CMA THP into pcp from > >>>> the first > >>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>> > >> > >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >> adding CMA THP into pcp may incur a slight performance penalty. > >> > > > > But the majority of movable pages aren't CMA, right? > > > Do we have an estimate for > > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > > the original intention for THP was to avoid by having only one PCP[1]? > > > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > > > > The size of struct per_cpu_pages is 256 bytes in current code containing > commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized > allocations"). > crash> struct per_cpu_pages > struct per_cpu_pages { > spinlock_t lock; > int count; > int high; > int high_min; > int high_max; > int batch; > u8 flags; > u8 alloc_factor; > u8 expire; > short free_count; > struct list_head lists[13]; > } > SIZE: 256 > > After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list > for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. > crash> struct per_cpu_pages > struct per_cpu_pages { > spinlock_t lock; > int count; > int high; > int high_min; > int high_max; > int batch; > u8 flags; > u8 alloc_factor; > u8 expire; > short free_count; > struct list_head lists[15]; > } > SIZE: 272 > > Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > THP-sized allocations") decrease one cacheline. the proposal is not reverting the patch but adding one CMA pcp. so it is "struct list_head lists[14]"; in this case, the size is still 256? > > > > >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >> refer to it. > >> > >> > >>>>> +#else > >>>>> page = rmqueue_pcplist(preferred_zone, zone, order, > >>>>> migratetype, alloc_flags); > >>>>> if (likely(page)) > >>>>> goto out; > >>>>> +#endif > >>>>> } > >>>>> > >>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > >>>>> -- > >>>>> 2.7.4 > >>>> > >>>> Thanks > >>>> Barry > >>>> > >> > >> >
在 2024/6/18 下午12:10, Barry Song 写道: > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: >> >> >> >> 在 2024/6/18 上午9:55, Barry Song 写道: >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: >>>> >>>> >>>> >>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>> >>>>> >>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>>>> >>>>>>> From: yangge <yangge1116@126.com> >>>>>>> >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") no longer differentiates the migration type >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>> >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>>>> PCP list to avoid the issue above. >>>>>> >>>>>> Could you please describe the impact on users in the commit log? >>>>> >>>>> If a large number of CMA memory are configured in the system (for >>>>> example, the CMA memory accounts for 50% of the system memory), starting >>>>> virtual machine with device passthrough will get stuck. >>>>> >>>>> During starting virtual machine, it will call pin_user_pages_remote(..., >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>>>> >>>>> backtrace: >>>>> pin_user_pages_remote >>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>> --------__get_user_pages_locked >>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>> to migrate >>>>> ------------migrate_longterm_unpinnable_pages >>>>> ----------------alloc_migration_target // non-movable allocation >>>>> >>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>> allocation requests? >>>>> >>>>> Yes. >>>>> >>>>> >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>>>> to fail? >>>>> >>>>> >>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>> non-movable allocation requests return CMA memory, >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>>>> CMA page, which is useless and cause endless loops in >>>>> __gup_longterm_locked(). >>> >>> This is only one perspective. We also need to consider the impact on >>> CMA itself. For example, >>> when CMA is borrowed by THP, and we need to reclaim it through >>> cma_alloc() or dma_alloc_coherent(), >>> we must move those pages out to ensure CMA's users can retrieve that >>> contiguous memory. >>> >>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>> can't relocate them. >>> As a result, cma_alloc() is more likely to fail. >>> >>>>> >>>>> backtrace: >>>>> pin_user_pages_remote >>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>> --------__get_user_pages_locked >>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>> to migrate >>>>> ------------migrate_longterm_unpinnable_pages >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>> >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") >>>>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>>>> --- >>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>> 1 file changed, 10 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>> *preferred_zone, >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>>>> >>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>> ALLOC_CMA || >>>>>>> + order != >>>>>>> HPAGE_PMD_ORDER) { >>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>>>> order, >>>>>>> + migratetype, >>>>>>> alloc_flags); >>>>>>> + if (likely(page)) >>>>>>> + goto out; >>>>>>> + } >>>>>> >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>>>> But it >>>>>> still seems better than causing the failure of CMA allocation. >>>>>> >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>>>> the first >>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>> >>>> >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>> adding CMA THP into pcp may incur a slight performance penalty. >>>> >>> >>> But the majority of movable pages aren't CMA, right? >> >>> Do we have an estimate for >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which >>> the original intention for THP was to avoid by having only one PCP[1]? >>> >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>> >> >> The size of struct per_cpu_pages is 256 bytes in current code containing >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >> allocations"). >> crash> struct per_cpu_pages >> struct per_cpu_pages { >> spinlock_t lock; >> int count; >> int high; >> int high_min; >> int high_max; >> int batch; >> u8 flags; >> u8 alloc_factor; >> u8 expire; >> short free_count; >> struct list_head lists[13]; >> } >> SIZE: 256 >> >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. >> crash> struct per_cpu_pages >> struct per_cpu_pages { >> spinlock_t lock; >> int count; >> int high; >> int high_min; >> int high_max; >> int batch; >> u8 flags; >> u8 alloc_factor; >> u8 expire; >> short free_count; >> struct list_head lists[15]; >> } >> SIZE: 272 >> >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >> THP-sized allocations") decrease one cacheline. > > the proposal is not reverting the patch but adding one CMA pcp. > so it is "struct list_head lists[14]"; in this case, the size is still > 256? > Yes, if only add one CMA pcp, the size is still 256. Seems adding one CMA pcp is more reasonable. > >> >>> >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>> refer to it. >>>> >>>> >>>>>>> +#else >>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>>>> migratetype, alloc_flags); >>>>>>> if (likely(page)) >>>>>>> goto out; >>>>>>> +#endif >>>>>>> } >>>>>>> >>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>>>> -- >>>>>>> 2.7.4 >>>>>> >>>>>> Thanks >>>>>> Barry >>>>>> >>>> >>>> >>
在 2024/6/18 下午12:10, Barry Song 写道: > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: >> >> >> >> 在 2024/6/18 上午9:55, Barry Song 写道: >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: >>>> >>>> >>>> >>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>> >>>>> >>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>>>> >>>>>>> From: yangge <yangge1116@126.com> >>>>>>> >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") no longer differentiates the migration type >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>> >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>>>> PCP list to avoid the issue above. >>>>>> >>>>>> Could you please describe the impact on users in the commit log? >>>>> >>>>> If a large number of CMA memory are configured in the system (for >>>>> example, the CMA memory accounts for 50% of the system memory), starting >>>>> virtual machine with device passthrough will get stuck. >>>>> >>>>> During starting virtual machine, it will call pin_user_pages_remote(..., >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>>>> >>>>> backtrace: >>>>> pin_user_pages_remote >>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>> --------__get_user_pages_locked >>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>> to migrate >>>>> ------------migrate_longterm_unpinnable_pages >>>>> ----------------alloc_migration_target // non-movable allocation >>>>> >>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>> allocation requests? >>>>> >>>>> Yes. >>>>> >>>>> >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>>>> to fail? >>>>> >>>>> >>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>> non-movable allocation requests return CMA memory, >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>>>> CMA page, which is useless and cause endless loops in >>>>> __gup_longterm_locked(). >>> >>> This is only one perspective. We also need to consider the impact on >>> CMA itself. For example, >>> when CMA is borrowed by THP, and we need to reclaim it through >>> cma_alloc() or dma_alloc_coherent(), >>> we must move those pages out to ensure CMA's users can retrieve that >>> contiguous memory. >>> >>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>> can't relocate them. >>> As a result, cma_alloc() is more likely to fail. >>> >>>>> >>>>> backtrace: >>>>> pin_user_pages_remote >>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>> --------__get_user_pages_locked >>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>> to migrate >>>>> ------------migrate_longterm_unpinnable_pages >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>> >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") >>>>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>>>> --- >>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>> 1 file changed, 10 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>> *preferred_zone, >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>>>> >>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>> ALLOC_CMA || >>>>>>> + order != >>>>>>> HPAGE_PMD_ORDER) { >>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>>>> order, >>>>>>> + migratetype, >>>>>>> alloc_flags); >>>>>>> + if (likely(page)) >>>>>>> + goto out; >>>>>>> + } >>>>>> >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>>>> But it >>>>>> still seems better than causing the failure of CMA allocation. >>>>>> >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>>>> the first >>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>> >>>> >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>> adding CMA THP into pcp may incur a slight performance penalty. >>>> >>> >>> But the majority of movable pages aren't CMA, right? >> >>> Do we have an estimate for >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which >>> the original intention for THP was to avoid by having only one PCP[1]? >>> >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>> >> >> The size of struct per_cpu_pages is 256 bytes in current code containing >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >> allocations"). >> crash> struct per_cpu_pages >> struct per_cpu_pages { >> spinlock_t lock; >> int count; >> int high; >> int high_min; >> int high_max; >> int batch; >> u8 flags; >> u8 alloc_factor; >> u8 expire; >> short free_count; >> struct list_head lists[13]; >> } >> SIZE: 256 >> >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. >> crash> struct per_cpu_pages >> struct per_cpu_pages { >> spinlock_t lock; >> int count; >> int high; >> int high_min; >> int high_max; >> int batch; >> u8 flags; >> u8 alloc_factor; >> u8 expire; >> short free_count; >> struct list_head lists[15]; >> } >> SIZE: 272 >> >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >> THP-sized allocations") decrease one cacheline. > > the proposal is not reverting the patch but adding one CMA pcp. > so it is "struct list_head lists[14]"; in this case, the size is still > 256? > Yes, the size is still 256. If add one PCP list, we will have 2 PCP lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right? > >> >>> >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>> refer to it. >>>> >>>> >>>>>>> +#else >>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>>>> migratetype, alloc_flags); >>>>>>> if (likely(page)) >>>>>>> goto out; >>>>>>> +#endif >>>>>>> } >>>>>>> >>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>>>> -- >>>>>>> 2.7.4 >>>>>> >>>>>> Thanks >>>>>> Barry >>>>>> >>>> >>>> >>
On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote: > > > > 在 2024/6/18 下午12:10, Barry Song 写道: > > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: > >> > >> > >> > >> 在 2024/6/18 上午9:55, Barry Song 写道: > >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: > >>>> > >>>> > >>>> > >>>> 在 2024/6/17 下午8:47, yangge1116 写道: > >>>>> > >>>>> > >>>>> 在 2024/6/17 下午6:26, Barry Song 写道: > >>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: > >>>>>>> > >>>>>>> From: yangge <yangge1116@126.com> > >>>>>>> > >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>> THP-sized allocations") no longer differentiates the migration type > >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>>>>>> the list, in some cases, it's not acceptable, for example, allocating > >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>>>> > >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>>>>>> PCP list to avoid the issue above. > >>>>>> > >>>>>> Could you please describe the impact on users in the commit log? > >>>>> > >>>>> If a large number of CMA memory are configured in the system (for > >>>>> example, the CMA memory accounts for 50% of the system memory), starting > >>>>> virtual machine with device passthrough will get stuck. > >>>>> > >>>>> During starting virtual machine, it will call pin_user_pages_remote(..., > >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA > >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests > >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. > >>>>> > >>>>> backtrace: > >>>>> pin_user_pages_remote > >>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>> --------__get_user_pages_locked > >>>>> --------check_and_migrate_movable_pages //always check fail and continue > >>>>> to migrate > >>>>> ------------migrate_longterm_unpinnable_pages > >>>>> ----------------alloc_migration_target // non-movable allocation > >>>>> > >>>>>> Is it possible that some CMA memory might be used by non-movable > >>>>>> allocation requests? > >>>>> > >>>>> Yes. > >>>>> > >>>>> > >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() > >>>>>> to fail? > >>>>> > >>>>> > >>>>> No, it will cause endless loops in __gup_longterm_locked(). If > >>>>> non-movable allocation requests return CMA memory, > >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another > >>>>> CMA page, which is useless and cause endless loops in > >>>>> __gup_longterm_locked(). > >>> > >>> This is only one perspective. We also need to consider the impact on > >>> CMA itself. For example, > >>> when CMA is borrowed by THP, and we need to reclaim it through > >>> cma_alloc() or dma_alloc_coherent(), > >>> we must move those pages out to ensure CMA's users can retrieve that > >>> contiguous memory. > >>> > >>> Currently, CMA's memory is occupied by non-movable pages, meaning we > >>> can't relocate them. > >>> As a result, cma_alloc() is more likely to fail. > >>> > >>>>> > >>>>> backtrace: > >>>>> pin_user_pages_remote > >>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>> --------__get_user_pages_locked > >>>>> --------check_and_migrate_movable_pages //always check fail and continue > >>>>> to migrate > >>>>> ------------migrate_longterm_unpinnable_pages > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>> > >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>> THP-sized allocations") > >>>>>>> Signed-off-by: yangge <yangge1116@126.com> > >>>>>>> --- > >>>>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>>>> 1 file changed, 10 insertions(+) > >>>>>>> > >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>>>> index 2e22ce5..0bdf471 100644 > >>>>>>> --- a/mm/page_alloc.c > >>>>>>> +++ b/mm/page_alloc.c > >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>>>> *preferred_zone, > >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>>>>>> > >>>>>>> if (likely(pcp_allowed_order(order))) { > >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>>>> ALLOC_CMA || > >>>>>>> + order != > >>>>>>> HPAGE_PMD_ORDER) { > >>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, > >>>>>>> order, > >>>>>>> + migratetype, > >>>>>>> alloc_flags); > >>>>>>> + if (likely(page)) > >>>>>>> + goto out; > >>>>>>> + } > >>>>>> > >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. > >>>>>> But it > >>>>>> still seems better than causing the failure of CMA allocation. > >>>>>> > >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from > >>>>>> the first > >>>>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>>>> > >>>> > >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >>>> adding CMA THP into pcp may incur a slight performance penalty. > >>>> > >>> > >>> But the majority of movable pages aren't CMA, right? > >> > >>> Do we have an estimate for > >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > >>> the original intention for THP was to avoid by having only one PCP[1]? > >>> > >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > >>> > >> > >> The size of struct per_cpu_pages is 256 bytes in current code containing > >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized > >> allocations"). > >> crash> struct per_cpu_pages > >> struct per_cpu_pages { > >> spinlock_t lock; > >> int count; > >> int high; > >> int high_min; > >> int high_max; > >> int batch; > >> u8 flags; > >> u8 alloc_factor; > >> u8 expire; > >> short free_count; > >> struct list_head lists[13]; > >> } > >> SIZE: 256 > >> > >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list > >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. > >> crash> struct per_cpu_pages > >> struct per_cpu_pages { > >> spinlock_t lock; > >> int count; > >> int high; > >> int high_min; > >> int high_max; > >> int batch; > >> u8 flags; > >> u8 alloc_factor; > >> u8 expire; > >> short free_count; > >> struct list_head lists[15]; > >> } > >> SIZE: 272 > >> > >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >> THP-sized allocations") decrease one cacheline. > > > > the proposal is not reverting the patch but adding one CMA pcp. > > so it is "struct list_head lists[14]"; in this case, the size is still > > 256? > > > > Yes, the size is still 256. If add one PCP list, we will have 2 PCP > lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other > PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right? i am not quite sure about MIGRATE_RECLAIMABLE as we want to CMA is only used by movable. So it might be: MOVABLE and NON-MOVABLE. > > > > >> > >>> > >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >>>> refer to it. > >>>> > >>>> > >>>>>>> +#else > >>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, > >>>>>>> migratetype, alloc_flags); > >>>>>>> if (likely(page)) > >>>>>>> goto out; > >>>>>>> +#endif > >>>>>>> } > >>>>>>> > >>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > >>>>>>> -- > >>>>>>> 2.7.4 > >>>>>> > >>>>>> Thanks > >>>>>> Barry > >>>>>> > >>>> > >>>> > >> >
在 2024/6/18 下午2:58, Barry Song 写道: > On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote: >> >> >> >> 在 2024/6/18 下午12:10, Barry Song 写道: >>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: >>>> >>>> >>>> >>>> 在 2024/6/18 上午9:55, Barry Song 写道: >>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>>>> >>>>>>> >>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>>>>>> >>>>>>>>> From: yangge <yangge1116@126.com> >>>>>>>>> >>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>> THP-sized allocations") no longer differentiates the migration type >>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>>>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>>>> >>>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>>>>>> PCP list to avoid the issue above. >>>>>>>> >>>>>>>> Could you please describe the impact on users in the commit log? >>>>>>> >>>>>>> If a large number of CMA memory are configured in the system (for >>>>>>> example, the CMA memory accounts for 50% of the system memory), starting >>>>>>> virtual machine with device passthrough will get stuck. >>>>>>> >>>>>>> During starting virtual machine, it will call pin_user_pages_remote(..., >>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>>>>>> >>>>>>> backtrace: >>>>>>> pin_user_pages_remote >>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>> --------__get_user_pages_locked >>>>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>>>> to migrate >>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>> ----------------alloc_migration_target // non-movable allocation >>>>>>> >>>>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>>>> allocation requests? >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>> >>>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>>>>>> to fail? >>>>>>> >>>>>>> >>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>>>> non-movable allocation requests return CMA memory, >>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>>>>>> CMA page, which is useless and cause endless loops in >>>>>>> __gup_longterm_locked(). >>>>> >>>>> This is only one perspective. We also need to consider the impact on >>>>> CMA itself. For example, >>>>> when CMA is borrowed by THP, and we need to reclaim it through >>>>> cma_alloc() or dma_alloc_coherent(), >>>>> we must move those pages out to ensure CMA's users can retrieve that >>>>> contiguous memory. >>>>> >>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>>>> can't relocate them. >>>>> As a result, cma_alloc() is more likely to fail. >>>>> >>>>>>> >>>>>>> backtrace: >>>>>>> pin_user_pages_remote >>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>> --------__get_user_pages_locked >>>>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>>>> to migrate >>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>> THP-sized allocations") >>>>>>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>>>>>> --- >>>>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>>>> 1 file changed, 10 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>>>> *preferred_zone, >>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>>>>>> >>>>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>>>> ALLOC_CMA || >>>>>>>>> + order != >>>>>>>>> HPAGE_PMD_ORDER) { >>>>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>>>>>> order, >>>>>>>>> + migratetype, >>>>>>>>> alloc_flags); >>>>>>>>> + if (likely(page)) >>>>>>>>> + goto out; >>>>>>>>> + } >>>>>>>> >>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>>>>>> But it >>>>>>>> still seems better than causing the failure of CMA allocation. >>>>>>>> >>>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>>>>>> the first >>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>>>> >>>>>> >>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>>>> adding CMA THP into pcp may incur a slight performance penalty. >>>>>> >>>>> >>>>> But the majority of movable pages aren't CMA, right? >>>> >>>>> Do we have an estimate for >>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which >>>>> the original intention for THP was to avoid by having only one PCP[1]? >>>>> >>>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>>>> >>>> >>>> The size of struct per_cpu_pages is 256 bytes in current code containing >>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >>>> allocations"). >>>> crash> struct per_cpu_pages >>>> struct per_cpu_pages { >>>> spinlock_t lock; >>>> int count; >>>> int high; >>>> int high_min; >>>> int high_max; >>>> int batch; >>>> u8 flags; >>>> u8 alloc_factor; >>>> u8 expire; >>>> short free_count; >>>> struct list_head lists[13]; >>>> } >>>> SIZE: 256 >>>> >>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list >>>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. >>>> crash> struct per_cpu_pages >>>> struct per_cpu_pages { >>>> spinlock_t lock; >>>> int count; >>>> int high; >>>> int high_min; >>>> int high_max; >>>> int batch; >>>> u8 flags; >>>> u8 alloc_factor; >>>> u8 expire; >>>> short free_count; >>>> struct list_head lists[15]; >>>> } >>>> SIZE: 272 >>>> >>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>> THP-sized allocations") decrease one cacheline. >>> >>> the proposal is not reverting the patch but adding one CMA pcp. >>> so it is "struct list_head lists[14]"; in this case, the size is still >>> 256? >>> >> >> Yes, the size is still 256. If add one PCP list, we will have 2 PCP >> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other >> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right? > > i am not quite sure about MIGRATE_RECLAIMABLE as we want to > CMA is only used by movable. > So it might be: > MOVABLE and NON-MOVABLE. One PCP list is used by UNMOVABLE pages, and the other PCP list is used by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP list contains MIGRATE_MOVABLE pages. > >> >>> >>>> >>>>> >>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>>>> refer to it. >>>>>> >>>>>> >>>>>>>>> +#else >>>>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>>>>>> migratetype, alloc_flags); >>>>>>>>> if (likely(page)) >>>>>>>>> goto out; >>>>>>>>> +#endif >>>>>>>>> } >>>>>>>>> >>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>>>>>> -- >>>>>>>>> 2.7.4 >>>>>>>> >>>>>>>> Thanks >>>>>>>> Barry >>>>>>>> >>>>>> >>>>>> >>>> >>
在 2024/6/18 15:51, yangge1116 写道: > > > 在 2024/6/18 下午2:58, Barry Song 写道: >> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote: >>> >>> >>> >>> 在 2024/6/18 下午12:10, Barry Song 写道: >>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: >>>>> >>>>> >>>>> >>>>> 在 2024/6/18 上午9:55, Barry Song 写道: >>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>>>>> >>>>>>>> >>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>>>>>>> >>>>>>>>>> From: yangge <yangge1116@126.com> >>>>>>>>>> >>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP >>>>>>>>>> list for >>>>>>>>>> THP-sized allocations") no longer differentiates the migration >>>>>>>>>> type >>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA >>>>>>>>>> page from >>>>>>>>>> the list, in some cases, it's not acceptable, for example, >>>>>>>>>> allocating >>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>>>>> >>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from >>>>>>>>>> THP-sized >>>>>>>>>> PCP list to avoid the issue above. >>>>>>>>> >>>>>>>>> Could you please describe the impact on users in the commit log? >>>>>>>> >>>>>>>> If a large number of CMA memory are configured in the system (for >>>>>>>> example, the CMA memory accounts for 50% of the system memory), >>>>>>>> starting >>>>>>>> virtual machine with device passthrough will get stuck. >>>>>>>> >>>>>>>> During starting virtual machine, it will call >>>>>>>> pin_user_pages_remote(..., >>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to >>>>>>>> non-CMA >>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation >>>>>>>> requests >>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless >>>>>>>> loops. >>>>>>>> >>>>>>>> backtrace: >>>>>>>> pin_user_pages_remote >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>>> --------__get_user_pages_locked >>>>>>>> --------check_and_migrate_movable_pages //always check fail and >>>>>>>> continue >>>>>>>> to migrate >>>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>>> ----------------alloc_migration_target // non-movable allocation >>>>>>>> >>>>>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>>>>> allocation requests? >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>> >>>>>>>>> If so, will CMA somehow become unable to migrate, causing >>>>>>>>> cma_alloc() >>>>>>>>> to fail? >>>>>>>> >>>>>>>> >>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>>>>> non-movable allocation requests return CMA memory, >>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to >>>>>>>> another >>>>>>>> CMA page, which is useless and cause endless loops in >>>>>>>> __gup_longterm_locked(). >>>>>> >>>>>> This is only one perspective. We also need to consider the impact on >>>>>> CMA itself. For example, >>>>>> when CMA is borrowed by THP, and we need to reclaim it through >>>>>> cma_alloc() or dma_alloc_coherent(), >>>>>> we must move those pages out to ensure CMA's users can retrieve that >>>>>> contiguous memory. >>>>>> >>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>>>>> can't relocate them. >>>>>> As a result, cma_alloc() is more likely to fail. >>>>>> >>>>>>>> >>>>>>>> backtrace: >>>>>>>> pin_user_pages_remote >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>>> --------__get_user_pages_locked >>>>>>>> --------check_and_migrate_movable_pages //always check fail and >>>>>>>> continue >>>>>>>> to migrate >>>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>>> THP-sized allocations") >>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>>>>>>> --- >>>>>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>>>>> 1 file changed, 10 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>>>>> *preferred_zone, >>>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order >>>>>>>>>> > 1)); >>>>>>>>>> >>>>>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>>>>> ALLOC_CMA || >>>>>>>>>> + order != >>>>>>>>>> HPAGE_PMD_ORDER) { >>>>>>>>>> + page = rmqueue_pcplist(preferred_zone, >>>>>>>>>> zone, >>>>>>>>>> order, >>>>>>>>>> + migratetype, >>>>>>>>>> alloc_flags); >>>>>>>>>> + if (likely(page)) >>>>>>>>>> + goto out; >>>>>>>>>> + } >>>>>>>>> >>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use >>>>>>>>> PCP. >>>>>>>>> But it >>>>>>>>> still seems better than causing the failure of CMA allocation. >>>>>>>>> >>>>>>>>> Is there a possible approach to avoiding adding CMA THP into >>>>>>>>> pcp from >>>>>>>>> the first >>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>>>>> >>>>>>> >>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>>>>> adding CMA THP into pcp may incur a slight performance penalty. >>>>>>> >>>>>> >>>>>> But the majority of movable pages aren't CMA, right? >>>>> >>>>>> Do we have an estimate for >>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new >>>>>> cacheline, which >>>>>> the original intention for THP was to avoid by having only one >>>>>> PCP[1]? >>>>>> >>>>>> [1] >>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>>>>> >>>>> >>>>> The size of struct per_cpu_pages is 256 bytes in current code >>>>> containing >>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized >>>>> allocations"). >>>>> crash> struct per_cpu_pages >>>>> struct per_cpu_pages { >>>>> spinlock_t lock; >>>>> int count; >>>>> int high; >>>>> int high_min; >>>>> int high_max; >>>>> int batch; >>>>> u8 flags; >>>>> u8 alloc_factor; >>>>> u8 expire; >>>>> short free_count; >>>>> struct list_head lists[13]; >>>>> } >>>>> SIZE: 256 >>>>> >>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP >>>>> list >>>>> for THP-sized allocations"), the size of struct per_cpu_pages is >>>>> 272 bytes. >>>>> crash> struct per_cpu_pages >>>>> struct per_cpu_pages { >>>>> spinlock_t lock; >>>>> int count; >>>>> int high; >>>>> int high_min; >>>>> int high_max; >>>>> int batch; >>>>> u8 flags; >>>>> u8 alloc_factor; >>>>> u8 expire; >>>>> short free_count; >>>>> struct list_head lists[15]; >>>>> } >>>>> SIZE: 272 >>>>> >>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") decrease one cacheline. >>>> >>>> the proposal is not reverting the patch but adding one CMA pcp. >>>> so it is "struct list_head lists[14]"; in this case, the size is still >>>> 256? >>>> >>> >>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP >>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other >>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that >>> right? >> >> i am not quite sure about MIGRATE_RECLAIMABLE as we want to >> CMA is only used by movable. >> So it might be: >> MOVABLE and NON-MOVABLE. > > One PCP list is used by UNMOVABLE pages, and the other PCP list is used > by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains > MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP > list contains MIGRATE_MOVABLE pages. > Is the following modification feasiable? #ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define NR_PCP_THP 1 +#define NR_PCP_THP 2 +#define PCP_THP_MOVABLE 0 +#define PCP_THP_UNMOVABLE 1 #else #define NR_PCP_THP 0 #endif static inline unsigned int order_to_pindex(int migratetype, int order) { + int pcp_type = migratetype; + #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != HPAGE_PMD_ORDER); - return NR_LOWORDER_PCP_LISTS; + + if (migratetype != MIGRATE_MOVABLE) + pcp_type = PCP_THP_UNMOVABLE; + else + pcp_type = PCP_THP_MOVABLE; + + return NR_LOWORDER_PCP_LISTS + pcp_type; } #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); #endif - return (MIGRATE_PCPTYPES * order) + migratetype; + return (MIGRATE_PCPTYPES * order) + pcp_type; } @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex) int order = pindex / MIGRATE_PCPTYPES; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pindex == NR_LOWORDER_PCP_LISTS) + if (order > PAGE_ALLOC_COSTLY_ORDER) order = HPAGE_PMD_ORDER; #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); >> >>> >>>> >>>>> >>>>>> >>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>>>>> refer to it. >>>>>>> >>>>>>> >>>>>>>>>> +#else >>>>>>>>>> page = rmqueue_pcplist(preferred_zone, >>>>>>>>>> zone, order, >>>>>>>>>> migratetype, >>>>>>>>>> alloc_flags); >>>>>>>>>> if (likely(page)) >>>>>>>>>> goto out; >>>>>>>>>> +#endif >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, >>>>>>>>>> alloc_flags, >>>>>>>>>> -- >>>>>>>>>> 2.7.4 >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Barry >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>
On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@126.com> wrote: > > > > 在 2024/6/18 15:51, yangge1116 写道: > > > > > > 在 2024/6/18 下午2:58, Barry Song 写道: > >> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote: > >>> > >>> > >>> > >>> 在 2024/6/18 下午12:10, Barry Song 写道: > >>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>> 在 2024/6/18 上午9:55, Barry Song 写道: > >>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: > >>>>>>>> > >>>>>>>> > >>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: > >>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: > >>>>>>>>>> > >>>>>>>>>> From: yangge <yangge1116@126.com> > >>>>>>>>>> > >>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP > >>>>>>>>>> list for > >>>>>>>>>> THP-sized allocations") no longer differentiates the migration > >>>>>>>>>> type > >>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA > >>>>>>>>>> page from > >>>>>>>>>> the list, in some cases, it's not acceptable, for example, > >>>>>>>>>> allocating > >>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>>>>>>> > >>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from > >>>>>>>>>> THP-sized > >>>>>>>>>> PCP list to avoid the issue above. > >>>>>>>>> > >>>>>>>>> Could you please describe the impact on users in the commit log? > >>>>>>>> > >>>>>>>> If a large number of CMA memory are configured in the system (for > >>>>>>>> example, the CMA memory accounts for 50% of the system memory), > >>>>>>>> starting > >>>>>>>> virtual machine with device passthrough will get stuck. > >>>>>>>> > >>>>>>>> During starting virtual machine, it will call > >>>>>>>> pin_user_pages_remote(..., > >>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to > >>>>>>>> non-CMA > >>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation > >>>>>>>> requests > >>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless > >>>>>>>> loops. > >>>>>>>> > >>>>>>>> backtrace: > >>>>>>>> pin_user_pages_remote > >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>>>>> --------__get_user_pages_locked > >>>>>>>> --------check_and_migrate_movable_pages //always check fail and > >>>>>>>> continue > >>>>>>>> to migrate > >>>>>>>> ------------migrate_longterm_unpinnable_pages > >>>>>>>> ----------------alloc_migration_target // non-movable allocation > >>>>>>>> > >>>>>>>>> Is it possible that some CMA memory might be used by non-movable > >>>>>>>>> allocation requests? > >>>>>>>> > >>>>>>>> Yes. > >>>>>>>> > >>>>>>>> > >>>>>>>>> If so, will CMA somehow become unable to migrate, causing > >>>>>>>>> cma_alloc() > >>>>>>>>> to fail? > >>>>>>>> > >>>>>>>> > >>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If > >>>>>>>> non-movable allocation requests return CMA memory, > >>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to > >>>>>>>> another > >>>>>>>> CMA page, which is useless and cause endless loops in > >>>>>>>> __gup_longterm_locked(). > >>>>>> > >>>>>> This is only one perspective. We also need to consider the impact on > >>>>>> CMA itself. For example, > >>>>>> when CMA is borrowed by THP, and we need to reclaim it through > >>>>>> cma_alloc() or dma_alloc_coherent(), > >>>>>> we must move those pages out to ensure CMA's users can retrieve that > >>>>>> contiguous memory. > >>>>>> > >>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we > >>>>>> can't relocate them. > >>>>>> As a result, cma_alloc() is more likely to fail. > >>>>>> > >>>>>>>> > >>>>>>>> backtrace: > >>>>>>>> pin_user_pages_remote > >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>>>>> --------__get_user_pages_locked > >>>>>>>> --------check_and_migrate_movable_pages //always check fail and > >>>>>>>> continue > >>>>>>>> to migrate > >>>>>>>> ------------migrate_longterm_unpinnable_pages > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>>>>> THP-sized allocations") > >>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com> > >>>>>>>>>> --- > >>>>>>>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>>>>>>> 1 file changed, 10 insertions(+) > >>>>>>>>>> > >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>>>>>>> index 2e22ce5..0bdf471 100644 > >>>>>>>>>> --- a/mm/page_alloc.c > >>>>>>>>>> +++ b/mm/page_alloc.c > >>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>>>>>>> *preferred_zone, > >>>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > >>>>>>>>>> > 1)); > >>>>>>>>>> > >>>>>>>>>> if (likely(pcp_allowed_order(order))) { > >>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>>>>>>> ALLOC_CMA || > >>>>>>>>>> + order != > >>>>>>>>>> HPAGE_PMD_ORDER) { > >>>>>>>>>> + page = rmqueue_pcplist(preferred_zone, > >>>>>>>>>> zone, > >>>>>>>>>> order, > >>>>>>>>>> + migratetype, > >>>>>>>>>> alloc_flags); > >>>>>>>>>> + if (likely(page)) > >>>>>>>>>> + goto out; > >>>>>>>>>> + } > >>>>>>>>> > >>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use > >>>>>>>>> PCP. > >>>>>>>>> But it > >>>>>>>>> still seems better than causing the failure of CMA allocation. > >>>>>>>>> > >>>>>>>>> Is there a possible approach to avoiding adding CMA THP into > >>>>>>>>> pcp from > >>>>>>>>> the first > >>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>>>>>>> > >>>>>>> > >>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >>>>>>> adding CMA THP into pcp may incur a slight performance penalty. > >>>>>>> > >>>>>> > >>>>>> But the majority of movable pages aren't CMA, right? > >>>>> > >>>>>> Do we have an estimate for > >>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new > >>>>>> cacheline, which > >>>>>> the original intention for THP was to avoid by having only one > >>>>>> PCP[1]? > >>>>>> > >>>>>> [1] > >>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > >>>>>> > >>>>> > >>>>> The size of struct per_cpu_pages is 256 bytes in current code > >>>>> containing > >>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized > >>>>> allocations"). > >>>>> crash> struct per_cpu_pages > >>>>> struct per_cpu_pages { > >>>>> spinlock_t lock; > >>>>> int count; > >>>>> int high; > >>>>> int high_min; > >>>>> int high_max; > >>>>> int batch; > >>>>> u8 flags; > >>>>> u8 alloc_factor; > >>>>> u8 expire; > >>>>> short free_count; > >>>>> struct list_head lists[13]; > >>>>> } > >>>>> SIZE: 256 > >>>>> > >>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP > >>>>> list > >>>>> for THP-sized allocations"), the size of struct per_cpu_pages is > >>>>> 272 bytes. > >>>>> crash> struct per_cpu_pages > >>>>> struct per_cpu_pages { > >>>>> spinlock_t lock; > >>>>> int count; > >>>>> int high; > >>>>> int high_min; > >>>>> int high_max; > >>>>> int batch; > >>>>> u8 flags; > >>>>> u8 alloc_factor; > >>>>> u8 expire; > >>>>> short free_count; > >>>>> struct list_head lists[15]; > >>>>> } > >>>>> SIZE: 272 > >>>>> > >>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") decrease one cacheline. > >>>> > >>>> the proposal is not reverting the patch but adding one CMA pcp. > >>>> so it is "struct list_head lists[14]"; in this case, the size is still > >>>> 256? > >>>> > >>> > >>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP > >>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other > >>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that > >>> right? > >> > >> i am not quite sure about MIGRATE_RECLAIMABLE as we want to > >> CMA is only used by movable. > >> So it might be: > >> MOVABLE and NON-MOVABLE. > > > > One PCP list is used by UNMOVABLE pages, and the other PCP list is used > > by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains > > MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP > > list contains MIGRATE_MOVABLE pages. > > > > Is the following modification feasiable? > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > -#define NR_PCP_THP 1 > +#define NR_PCP_THP 2 > +#define PCP_THP_MOVABLE 0 > +#define PCP_THP_UNMOVABLE 1 > #else > #define NR_PCP_THP 0 > #endif > > static inline unsigned int order_to_pindex(int migratetype, int order) > { > + int pcp_type = migratetype; > + > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > if (order > PAGE_ALLOC_COSTLY_ORDER) { > VM_BUG_ON(order != HPAGE_PMD_ORDER); > - return NR_LOWORDER_PCP_LISTS; > + > + if (migratetype != MIGRATE_MOVABLE) > + pcp_type = PCP_THP_UNMOVABLE; > + else > + pcp_type = PCP_THP_MOVABLE; > + > + return NR_LOWORDER_PCP_LISTS + pcp_type; > } > #else > VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > #endif > > - return (MIGRATE_PCPTYPES * order) + migratetype; > + return (MIGRATE_PCPTYPES * order) + pcp_type; > } > a minimum change might be, then you can drop most new code. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 120a317d0938..cfe1e0625e38 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason) static inline unsigned int order_to_pindex(int migratetype, int order) { + bool __maybe_unused movable; #ifdef CONFIG_CMA /* * We shouldn't get here for MIGRATE_CMA if those pages don't @@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int migratetype, int order) #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != pageblock_order); - return NR_LOWORDER_PCP_LISTS; + movable = migratetype == MIGRATE_MOVABLE; + return NR_LOWORDER_PCP_LISTS + movable; } #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > > > @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex) > int order = pindex / MIGRATE_PCPTYPES; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (pindex == NR_LOWORDER_PCP_LISTS) > + if (order > PAGE_ALLOC_COSTLY_ORDER) > order = HPAGE_PMD_ORDER; > #else > VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > > > > >> > >>> > >>>> > >>>>> > >>>>>> > >>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >>>>>>> refer to it. > >>>>>>> > >>>>>>> > >>>>>>>>>> +#else > >>>>>>>>>> page = rmqueue_pcplist(preferred_zone, > >>>>>>>>>> zone, order, > >>>>>>>>>> migratetype, > >>>>>>>>>> alloc_flags); > >>>>>>>>>> if (likely(page)) > >>>>>>>>>> goto out; > >>>>>>>>>> +#endif > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, > >>>>>>>>>> alloc_flags, > >>>>>>>>>> -- > >>>>>>>>>> 2.7.4 > >>>>>>>>> > >>>>>>>>> Thanks > >>>>>>>>> Barry > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>> >
在 2024/6/19 16:20, Barry Song 写道: > On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@126.com> wrote: >> >> >> >> 在 2024/6/18 15:51, yangge1116 写道: >>> >>> >>> 在 2024/6/18 下午2:58, Barry Song 写道: >>>> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@126.com> wrote: >>>>> >>>>> >>>>> >>>>> 在 2024/6/18 下午12:10, Barry Song 写道: >>>>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> 在 2024/6/18 上午9:55, Barry Song 写道: >>>>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> From: yangge <yangge1116@126.com> >>>>>>>>>>>> >>>>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP >>>>>>>>>>>> list for >>>>>>>>>>>> THP-sized allocations") no longer differentiates the migration >>>>>>>>>>>> type >>>>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA >>>>>>>>>>>> page from >>>>>>>>>>>> the list, in some cases, it's not acceptable, for example, >>>>>>>>>>>> allocating >>>>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>>>>>>> >>>>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from >>>>>>>>>>>> THP-sized >>>>>>>>>>>> PCP list to avoid the issue above. >>>>>>>>>>> >>>>>>>>>>> Could you please describe the impact on users in the commit log? >>>>>>>>>> >>>>>>>>>> If a large number of CMA memory are configured in the system (for >>>>>>>>>> example, the CMA memory accounts for 50% of the system memory), >>>>>>>>>> starting >>>>>>>>>> virtual machine with device passthrough will get stuck. >>>>>>>>>> >>>>>>>>>> During starting virtual machine, it will call >>>>>>>>>> pin_user_pages_remote(..., >>>>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to >>>>>>>>>> non-CMA >>>>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation >>>>>>>>>> requests >>>>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless >>>>>>>>>> loops. >>>>>>>>>> >>>>>>>>>> backtrace: >>>>>>>>>> pin_user_pages_remote >>>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>>>>> --------__get_user_pages_locked >>>>>>>>>> --------check_and_migrate_movable_pages //always check fail and >>>>>>>>>> continue >>>>>>>>>> to migrate >>>>>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>>>>> ----------------alloc_migration_target // non-movable allocation >>>>>>>>>> >>>>>>>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>>>>>>> allocation requests? >>>>>>>>>> >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> If so, will CMA somehow become unable to migrate, causing >>>>>>>>>>> cma_alloc() >>>>>>>>>>> to fail? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>>>>>>> non-movable allocation requests return CMA memory, >>>>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to >>>>>>>>>> another >>>>>>>>>> CMA page, which is useless and cause endless loops in >>>>>>>>>> __gup_longterm_locked(). >>>>>>>> >>>>>>>> This is only one perspective. We also need to consider the impact on >>>>>>>> CMA itself. For example, >>>>>>>> when CMA is borrowed by THP, and we need to reclaim it through >>>>>>>> cma_alloc() or dma_alloc_coherent(), >>>>>>>> we must move those pages out to ensure CMA's users can retrieve that >>>>>>>> contiguous memory. >>>>>>>> >>>>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>>>>>>> can't relocate them. >>>>>>>> As a result, cma_alloc() is more likely to fail. >>>>>>>> >>>>>>>>>> >>>>>>>>>> backtrace: >>>>>>>>>> pin_user_pages_remote >>>>>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>>>>> --------__get_user_pages_locked >>>>>>>>>> --------check_and_migrate_movable_pages //always check fail and >>>>>>>>>> continue >>>>>>>>>> to migrate >>>>>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>>>>> THP-sized allocations") >>>>>>>>>>>> Signed-off-by: yangge <yangge1116@126.com> >>>>>>>>>>>> --- >>>>>>>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>>>>>>> 1 file changed, 10 insertions(+) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>>>>>>> *preferred_zone, >>>>>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order >>>>>>>>>>>>> 1)); >>>>>>>>>>>> >>>>>>>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>>>>>>> ALLOC_CMA || >>>>>>>>>>>> + order != >>>>>>>>>>>> HPAGE_PMD_ORDER) { >>>>>>>>>>>> + page = rmqueue_pcplist(preferred_zone, >>>>>>>>>>>> zone, >>>>>>>>>>>> order, >>>>>>>>>>>> + migratetype, >>>>>>>>>>>> alloc_flags); >>>>>>>>>>>> + if (likely(page)) >>>>>>>>>>>> + goto out; >>>>>>>>>>>> + } >>>>>>>>>>> >>>>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use >>>>>>>>>>> PCP. >>>>>>>>>>> But it >>>>>>>>>>> still seems better than causing the failure of CMA allocation. >>>>>>>>>>> >>>>>>>>>>> Is there a possible approach to avoiding adding CMA THP into >>>>>>>>>>> pcp from >>>>>>>>>>> the first >>>>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>>>>>>> >>>>>>>>> >>>>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>>>>>>> adding CMA THP into pcp may incur a slight performance penalty. >>>>>>>>> >>>>>>>> >>>>>>>> But the majority of movable pages aren't CMA, right? >>>>>>> >>>>>>>> Do we have an estimate for >>>>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new >>>>>>>> cacheline, which >>>>>>>> the original intention for THP was to avoid by having only one >>>>>>>> PCP[1]? >>>>>>>> >>>>>>>> [1] >>>>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>>>>>>> >>>>>>> >>>>>>> The size of struct per_cpu_pages is 256 bytes in current code >>>>>>> containing >>>>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized >>>>>>> allocations"). >>>>>>> crash> struct per_cpu_pages >>>>>>> struct per_cpu_pages { >>>>>>> spinlock_t lock; >>>>>>> int count; >>>>>>> int high; >>>>>>> int high_min; >>>>>>> int high_max; >>>>>>> int batch; >>>>>>> u8 flags; >>>>>>> u8 alloc_factor; >>>>>>> u8 expire; >>>>>>> short free_count; >>>>>>> struct list_head lists[13]; >>>>>>> } >>>>>>> SIZE: 256 >>>>>>> >>>>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP >>>>>>> list >>>>>>> for THP-sized allocations"), the size of struct per_cpu_pages is >>>>>>> 272 bytes. >>>>>>> crash> struct per_cpu_pages >>>>>>> struct per_cpu_pages { >>>>>>> spinlock_t lock; >>>>>>> int count; >>>>>>> int high; >>>>>>> int high_min; >>>>>>> int high_max; >>>>>>> int batch; >>>>>>> u8 flags; >>>>>>> u8 alloc_factor; >>>>>>> u8 expire; >>>>>>> short free_count; >>>>>>> struct list_head lists[15]; >>>>>>> } >>>>>>> SIZE: 272 >>>>>>> >>>>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") decrease one cacheline. >>>>>> >>>>>> the proposal is not reverting the patch but adding one CMA pcp. >>>>>> so it is "struct list_head lists[14]"; in this case, the size is still >>>>>> 256? >>>>>> >>>>> >>>>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP >>>>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other >>>>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that >>>>> right? >>>> >>>> i am not quite sure about MIGRATE_RECLAIMABLE as we want to >>>> CMA is only used by movable. >>>> So it might be: >>>> MOVABLE and NON-MOVABLE. >>> >>> One PCP list is used by UNMOVABLE pages, and the other PCP list is used >>> by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains >>> MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP >>> list contains MIGRATE_MOVABLE pages. >>> >> >> Is the following modification feasiable? >> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> -#define NR_PCP_THP 1 >> +#define NR_PCP_THP 2 >> +#define PCP_THP_MOVABLE 0 >> +#define PCP_THP_UNMOVABLE 1 >> #else >> #define NR_PCP_THP 0 >> #endif >> >> static inline unsigned int order_to_pindex(int migratetype, int order) >> { >> + int pcp_type = migratetype; >> + >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> if (order > PAGE_ALLOC_COSTLY_ORDER) { >> VM_BUG_ON(order != HPAGE_PMD_ORDER); >> - return NR_LOWORDER_PCP_LISTS; >> + >> + if (migratetype != MIGRATE_MOVABLE) >> + pcp_type = PCP_THP_UNMOVABLE; >> + else >> + pcp_type = PCP_THP_MOVABLE; >> + >> + return NR_LOWORDER_PCP_LISTS + pcp_type; >> } >> #else >> VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); >> #endif >> >> - return (MIGRATE_PCPTYPES * order) + migratetype; >> + return (MIGRATE_PCPTYPES * order) + pcp_type; >> } >> > > a minimum change might be, then you can drop most new code. > Thanks, I will prepare the V2. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 120a317d0938..cfe1e0625e38 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason) > > static inline unsigned int order_to_pindex(int migratetype, int order) > { > + bool __maybe_unused movable; > #ifdef CONFIG_CMA > /* > * We shouldn't get here for MIGRATE_CMA if those pages don't > @@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int > migratetype, int order) > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > if (order > PAGE_ALLOC_COSTLY_ORDER) { > VM_BUG_ON(order != pageblock_order); > - return NR_LOWORDER_PCP_LISTS; > + movable = migratetype == MIGRATE_MOVABLE; > + return NR_LOWORDER_PCP_LISTS + movable; > } > #else > VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > > >> >> >> @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex) >> int order = pindex / MIGRATE_PCPTYPES; >> >> #ifdef CONFIG_TRANSPARENT_HUGEPAGE >> - if (pindex == NR_LOWORDER_PCP_LISTS) >> + if (order > PAGE_ALLOC_COSTLY_ORDER) >> order = HPAGE_PMD_ORDER; >> #else >> VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); >> >> >> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>>>>>>> refer to it. >>>>>>>>> >>>>>>>>> >>>>>>>>>>>> +#else >>>>>>>>>>>> page = rmqueue_pcplist(preferred_zone, >>>>>>>>>>>> zone, order, >>>>>>>>>>>> migratetype, >>>>>>>>>>>> alloc_flags); >>>>>>>>>>>> if (likely(page)) >>>>>>>>>>>> goto out; >>>>>>>>>>>> +#endif >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, >>>>>>>>>>>> alloc_flags, >>>>>>>>>>>> -- >>>>>>>>>>>> 2.7.4 >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Barry >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e22ce5..0bdf471 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone *preferred_zone, WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); if (likely(pcp_allowed_order(order))) { +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA || + order != HPAGE_PMD_ORDER) { + page = rmqueue_pcplist(preferred_zone, zone, order, + migratetype, alloc_flags); + if (likely(page)) + goto out; + } +#else page = rmqueue_pcplist(preferred_zone, zone, order, migratetype, alloc_flags); if (likely(page)) goto out; +#endif } page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,