Message ID | 20220511081207.132034-1-vvghjk1234@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/page_alloc: Fix tracepoint mm_page_alloc_zone_locked() | expand |
On Wed, May 11, 2022 at 05:12:07PM +0900, Wonhyuk Yang wrote: > Currently, trace point mm_page_alloc_zone_locked() doesn't show > correct information. > > First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can > be allocated from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, > tracepoint use requested migration type not MIGRATE_HIGHATOMIC and > MIGRATE_CMA. > > Second, after Commit 44042b4498728 ("mm/page_alloc: allow high-order > pages to be stored on the per-cpu lists") percpu-list can store > high order pages. But trace point determine whether it is a refiil > of percpu-list by comparing requested order and 0. > > To handle these problems, use cached migration type by > get_pcppage_migratetype() instead of requested migration type. > Then, make mm_page_alloc_zone_locked() be called only two contexts > (rmqueue_bulk, rmqueue). With a new argument called percpu_refill, > it can show whether it is a refill of percpu-list correctly. > You're definitely right that the current tracepoint is broken. I got momentarily confused because HIGHATOMIC and CMA are not stored on PCP lists even though they are a pageblock migrate type. Superficially calling get_pcppage_migratetype on a page that cannot be a PCP page seems silly but in the context of this patch, it happens to work because it was isolated with __rmqueue_smallest which sets the PCP type even if the page is not going to a PCP list. The original intent of that tracepoint was to trace when pages were removed from the buddy list. That would suggest this untested patch on top of yours as a simplication; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0351808322ba..66a70b898130 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2476,6 +2476,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, del_page_from_free_list(page, zone, current_order); expand(zone, page, order, current_order, migratetype); set_pcppage_migratetype(page, migratetype); + trace_mm_page_alloc_zone_locked(page, order, migratetype, + pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); return page; } @@ -3025,7 +3027,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, int migratetype, unsigned int alloc_flags) { int i, allocated = 0; - int mt; /* * local_lock_irq held so equivalent to spin_lock_irqsave for @@ -3053,9 +3054,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, */ list_add_tail(&page->lru, list); allocated++; - mt = get_pcppage_migratetype(page); - trace_mm_page_alloc_zone_locked(page, order, mt, true); - if (is_migrate_cma(mt)) + if (is_migrate_cma(get_pcppage_migratetype(page))) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order)); } @@ -3704,7 +3703,6 @@ struct page *rmqueue(struct zone *preferred_zone, { unsigned long flags; struct page *page; - int mt; if (likely(pcp_allowed_order(order))) { /* @@ -3734,17 +3732,15 @@ struct page *rmqueue(struct zone *preferred_zone, * reserved for high-order atomic allocation, so order-0 * request should skip it. */ - if (order > 0 && alloc_flags & ALLOC_HARDER) { + if (order > 0 && alloc_flags & ALLOC_HARDER) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); - } if (!page) { page = __rmqueue(zone, order, migratetype, alloc_flags); if (!page) goto failed; } - mt = get_pcppage_migratetype(page); - trace_mm_page_alloc_zone_locked(page, order, mt, false); - __mod_zone_freepage_state(zone, -(1 << order), mt); + __mod_zone_freepage_state(zone, -(1 << order), + get_pcppage_migratetype(page)); spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order));
On Wed, May 11, 2022 at 11:23 PM Mel Gorman <mgorman@suse.de> wrote: > > On Wed, May 11, 2022 at 05:12:07PM +0900, Wonhyuk Yang wrote: > > Currently, trace point mm_page_alloc_zone_locked() doesn't show > > correct information. > > > > First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can > > be allocated from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, > > tracepoint use requested migration type not MIGRATE_HIGHATOMIC and > > MIGRATE_CMA. > > > > Second, after Commit 44042b4498728 ("mm/page_alloc: allow high-order > > pages to be stored on the per-cpu lists") percpu-list can store > > high order pages. But trace point determine whether it is a refiil > > of percpu-list by comparing requested order and 0. > > > > To handle these problems, use cached migration type by > > get_pcppage_migratetype() instead of requested migration type. > > Then, make mm_page_alloc_zone_locked() be called only two contexts > > (rmqueue_bulk, rmqueue). With a new argument called percpu_refill, > > it can show whether it is a refill of percpu-list correctly. > > > > You're definitely right that the current tracepoint is broken. > > I got momentarily confused because HIGHATOMIC and CMA are not stored on > PCP lists even though they are a pageblock migrate type. Superficially > calling get_pcppage_migratetype on a page that cannot be a PCP page > seems silly but in the context of this patch, it happens to work because > it was isolated with __rmqueue_smallest which sets the PCP type even if > the page is not going to a PCP list. Yes, I agree that calling get_pcppage_migratetype look quite confusing. > The original intent of that tracepoint was to trace when pages were > removed from the buddy list. That would suggest this untested patch on > top of yours as a simplication; > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 0351808322ba..66a70b898130 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2476,6 +2476,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > del_page_from_free_list(page, zone, current_order); > expand(zone, page, order, current_order, migratetype); > set_pcppage_migratetype(page, migratetype); > + trace_mm_page_alloc_zone_locked(page, order, migratetype, > + pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); > return page; > } Interestingly, my first approach was quite similar your suggestion. But I noticed that there can be a request whose migration type is MOVABLE and alloc_flags doen't have ALLOC_CMA. In that case, page are marked as percpu-refill even though it was allocated from buddy-list directly. Is there no problem if we just ignore this case? > @@ -3025,7 +3027,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > int migratetype, unsigned int alloc_flags) > { > int i, allocated = 0; > - int mt; > > /* > * local_lock_irq held so equivalent to spin_lock_irqsave for > @@ -3053,9 +3054,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > */ > list_add_tail(&page->lru, list); > allocated++; > - mt = get_pcppage_migratetype(page); > - trace_mm_page_alloc_zone_locked(page, order, mt, true); > - if (is_migrate_cma(mt)) > + if (is_migrate_cma(get_pcppage_migratetype(page))) > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, > -(1 << order)); > } > @@ -3704,7 +3703,6 @@ struct page *rmqueue(struct zone *preferred_zone, > { > unsigned long flags; > struct page *page; > - int mt; > > if (likely(pcp_allowed_order(order))) { > /* > @@ -3734,17 +3732,15 @@ struct page *rmqueue(struct zone *preferred_zone, > * reserved for high-order atomic allocation, so order-0 > * request should skip it. > */ > - if (order > 0 && alloc_flags & ALLOC_HARDER) { > + if (order > 0 && alloc_flags & ALLOC_HARDER) > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > - } > if (!page) { > page = __rmqueue(zone, order, migratetype, alloc_flags); > if (!page) > goto failed; > } > - mt = get_pcppage_migratetype(page); > - trace_mm_page_alloc_zone_locked(page, order, mt, false); > - __mod_zone_freepage_state(zone, -(1 << order), mt); > + __mod_zone_freepage_state(zone, -(1 << order), > + get_pcppage_migratetype(page)); > spin_unlock_irqrestore(&zone->lock, flags); > } while (check_new_pages(page, order)); >
On Thu, May 12, 2022 at 12:02:30AM +0900, Wonhyuk Yang wrote: > > The original intent of that tracepoint was to trace when pages were > > removed from the buddy list. That would suggest this untested patch on > > top of yours as a simplication; > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 0351808322ba..66a70b898130 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2476,6 +2476,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > > del_page_from_free_list(page, zone, current_order); > > expand(zone, page, order, current_order, migratetype); > > set_pcppage_migratetype(page, migratetype); > > + trace_mm_page_alloc_zone_locked(page, order, migratetype, > > + pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); > > return page; > > } > > Interestingly, my first approach was quite similar your suggestion. But I > noticed that there can be a request whose migration type is MOVABLE > and alloc_flags doen't have ALLOC_CMA. In that case, page are marked > as percpu-refill even though it was allocated from buddy-list directly. > Is there no problem if we just ignore this case? > I assume you are referring to the case where CMA allocations are being balanced between regular and CMA areas. I think it's relatively harmless if percpu_refill field is not 100% accurate for that case. There are also cases like the percpu list is too small to hold a THP and it's not a percpu_refill either. If 100% accuracy is an issue, I would prefer renaming it to percpu_eligible or just deleting it instead of adding complexity for a tracepoint. The main value of that tracepoint is determining what percentage of allocations are potentially contending on zone lock at a particular time.
On Thu, May 12, 2022 at 12:47 AM Mel Gorman <mgorman@suse.de> wrote: > > On Thu, May 12, 2022 at 12:02:30AM +0900, Wonhyuk Yang wrote: > > > The original intent of that tracepoint was to trace when pages were > > > removed from the buddy list. That would suggest this untested patch on > > > top of yours as a simplication; > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 0351808322ba..66a70b898130 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -2476,6 +2476,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > > > del_page_from_free_list(page, zone, current_order); > > > expand(zone, page, order, current_order, migratetype); > > > set_pcppage_migratetype(page, migratetype); > > > + trace_mm_page_alloc_zone_locked(page, order, migratetype, > > > + pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); > > > return page; > > > } > > > > Interestingly, my first approach was quite similar your suggestion. But I > > noticed that there can be a request whose migration type is MOVABLE > > and alloc_flags doen't have ALLOC_CMA. In that case, page are marked > > as percpu-refill even though it was allocated from buddy-list directly. > > Is there no problem if we just ignore this case? > > > > I assume you are referring to the case where CMA allocations are being > balanced between regular and CMA areas. I think it's relatively harmless > if percpu_refill field is not 100% accurate for that case. There are > also cases like the percpu list is too small to hold a THP and it's not a > percpu_refill either. If 100% accuracy is an issue, I would prefer renaming > it to percpu_eligible or just deleting it instead of adding complexity > for a tracepoint. The main value of that tracepoint is determining what > percentage of allocations are potentially contending on zone lock at a > particular time. > Okay, I'll send a new one with your suggestions. Thanks
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index ddc8c944f417..f89fb3afcd46 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -229,20 +229,23 @@ TRACE_EVENT(mm_page_alloc, DECLARE_EVENT_CLASS(mm_page, - TP_PROTO(struct page *page, unsigned int order, int migratetype), + TP_PROTO(struct page *page, unsigned int order, int migratetype, + int percpu_refill), - TP_ARGS(page, order, migratetype), + TP_ARGS(page, order, migratetype, percpu_refill), TP_STRUCT__entry( __field( unsigned long, pfn ) __field( unsigned int, order ) __field( int, migratetype ) + __field( int, percpu_refill ) ), TP_fast_assign( __entry->pfn = page ? page_to_pfn(page) : -1UL; __entry->order = order; __entry->migratetype = migratetype; + __entry->percpu_refill = percpu_refill; ), TP_printk("page=%p pfn=0x%lx order=%u migratetype=%d percpu_refill=%d", @@ -250,14 +253,15 @@ DECLARE_EVENT_CLASS(mm_page, __entry->pfn != -1UL ? __entry->pfn : 0, __entry->order, __entry->migratetype, - __entry->order == 0) + __entry->percpu_refill) ); DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked, - TP_PROTO(struct page *page, unsigned int order, int migratetype), + TP_PROTO(struct page *page, unsigned int order, int migratetype, + int percpu_refill), - TP_ARGS(page, order, migratetype) + TP_ARGS(page, order, migratetype, percpu_refill) ); TRACE_EVENT(mm_page_pcpu_drain, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e42038382c1..0351808322ba 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2999,7 +2999,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, zone_page_state(zone, NR_FREE_PAGES) / 2) { page = __rmqueue_cma_fallback(zone, order); if (page) - goto out; + return page; } } retry: @@ -3012,9 +3012,6 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, alloc_flags)) goto retry; } -out: - if (page) - trace_mm_page_alloc_zone_locked(page, order, migratetype); return page; } @@ -3028,6 +3025,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, int migratetype, unsigned int alloc_flags) { int i, allocated = 0; + int mt; /* * local_lock_irq held so equivalent to spin_lock_irqsave for @@ -3055,7 +3053,9 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, */ list_add_tail(&page->lru, list); allocated++; - if (is_migrate_cma(get_pcppage_migratetype(page))) + mt = get_pcppage_migratetype(page); + trace_mm_page_alloc_zone_locked(page, order, mt, true); + if (is_migrate_cma(mt)) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order)); } @@ -3704,6 +3704,7 @@ struct page *rmqueue(struct zone *preferred_zone, { unsigned long flags; struct page *page; + int mt; if (likely(pcp_allowed_order(order))) { /* @@ -3735,16 +3736,15 @@ struct page *rmqueue(struct zone *preferred_zone, */ if (order > 0 && alloc_flags & ALLOC_HARDER) { page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); - if (page) - trace_mm_page_alloc_zone_locked(page, order, migratetype); } if (!page) { page = __rmqueue(zone, order, migratetype, alloc_flags); if (!page) goto failed; } - __mod_zone_freepage_state(zone, -(1 << order), - get_pcppage_migratetype(page)); + mt = get_pcppage_migratetype(page); + trace_mm_page_alloc_zone_locked(page, order, mt, false); + __mod_zone_freepage_state(zone, -(1 << order), mt); spin_unlock_irqrestore(&zone->lock, flags); } while (check_new_pages(page, order));
Currently, trace point mm_page_alloc_zone_locked() doesn't show correct information. First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, tracepoint use requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA. Second, after Commit 44042b4498728 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") percpu-list can store high order pages. But trace point determine whether it is a refiil of percpu-list by comparing requested order and 0. To handle these problems, use cached migration type by get_pcppage_migratetype() instead of requested migration type. Then, make mm_page_alloc_zone_locked() be called only two contexts (rmqueue_bulk, rmqueue). With a new argument called percpu_refill, it can show whether it is a refill of percpu-list correctly. Cc: Baik Song An <bsahn@etri.re.kr> Cc: Hong Yeon Kim <kimhy@etri.re.kr> Cc: Taeung Song <taeung@reallinux.co.kr> Cc: linuxgeek@linuxgeek.io Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> --- include/trace/events/kmem.h | 14 +++++++++----- mm/page_alloc.c | 18 +++++++++--------- 2 files changed, 18 insertions(+), 14 deletions(-)