Message ID | 1739514729-21265-1-git-send-email-yangge1116@126.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/hugetlb: wait for hugepage folios to be freed | expand |
On 14.02.25 07:32, yangge1116@126.com wrote: > From: Ge Yang <yangge1116@126.com> > > Since the introduction of commit b65d4adbc0f0 ("mm: hugetlb: defer freeing > of HugeTLB pages"), which supports deferring the freeing of HugeTLB pages, > the allocation of contiguous memory through cma_alloc() may fail > probabilistically. > > In the CMA allocation process, if it is found that the CMA area is occupied > by in-use hugepage folios, these in-use hugepage folios need to be migrated > to another location. When there are no available hugepage folios in the > free HugeTLB pool during the migration of in-use HugeTLB pages, new folios > are allocated from the buddy system. A temporary state is set on the newly > allocated folio. Upon completion of the hugepage folio migration, the > temporary state is transferred from the new folios to the old folios. > Normally, when the old folios with the temporary state are freed, it is > directly released back to the buddy system. However, due to the deferred > freeing of HugeTLB pages, the PageBuddy() check fails, ultimately leading > to the failure of cma_alloc(). > > Here is a simplified call trace illustrating the process: > cma_alloc() > ->__alloc_contig_migrate_range() // Migrate in-use hugepage > ->unmap_and_move_huge_page() > ->folio_putback_hugetlb() // Free old folios > ->test_pages_isolated() > ->__test_page_isolated_in_pageblock() > ->PageBuddy(page) // Check if the page is in buddy > > To resolve this issue, we have implemented a function named > wait_for_hugepage_folios_freed(). This function ensures that the hugepage > folios are properly released back to the buddy system after their migration > is completed. By invoking wait_for_hugepage_folios_freed() following the > migration process, we guarantee that when test_pages_isolated() is > executed, it will successfully pass. Okay, so after every successful migration -> put of src, we wait for the src to actually get freed. When migrating multiple hugetlb folios, we'd wait once per folio. It reminds me a bit about pcp caches, where folios are !buddy until the pcp was drained. I wonder if that waiting should instead be done exactly once after migrating multiple folios? For example, at the beginning of test_pages_isolated(), to "flush" that state from any previous migration? Thanks for all your effort around making CMA allocations / migration more reliable.
在 2025/2/14 16:08, David Hildenbrand 写道: > On 14.02.25 07:32, yangge1116@126.com wrote: >> From: Ge Yang <yangge1116@126.com> >> >> Since the introduction of commit b65d4adbc0f0 ("mm: hugetlb: defer >> freeing >> of HugeTLB pages"), which supports deferring the freeing of HugeTLB >> pages, >> the allocation of contiguous memory through cma_alloc() may fail >> probabilistically. >> >> In the CMA allocation process, if it is found that the CMA area is >> occupied >> by in-use hugepage folios, these in-use hugepage folios need to be >> migrated >> to another location. When there are no available hugepage folios in the >> free HugeTLB pool during the migration of in-use HugeTLB pages, new >> folios >> are allocated from the buddy system. A temporary state is set on the >> newly >> allocated folio. Upon completion of the hugepage folio migration, the >> temporary state is transferred from the new folios to the old folios. >> Normally, when the old folios with the temporary state are freed, it is >> directly released back to the buddy system. However, due to the deferred >> freeing of HugeTLB pages, the PageBuddy() check fails, ultimately leading >> to the failure of cma_alloc(). >> >> Here is a simplified call trace illustrating the process: >> cma_alloc() >> ->__alloc_contig_migrate_range() // Migrate in-use hugepage >> ->unmap_and_move_huge_page() >> ->folio_putback_hugetlb() // Free old folios >> ->test_pages_isolated() >> ->__test_page_isolated_in_pageblock() >> ->PageBuddy(page) // Check if the page is in buddy >> >> To resolve this issue, we have implemented a function named >> wait_for_hugepage_folios_freed(). This function ensures that the hugepage >> folios are properly released back to the buddy system after their >> migration >> is completed. By invoking wait_for_hugepage_folios_freed() following the >> migration process, we guarantee that when test_pages_isolated() is >> executed, it will successfully pass. > > Okay, so after every successful migration -> put of src, we wait for the > src to actually get freed. > > When migrating multiple hugetlb folios, we'd wait once per folio. > > It reminds me a bit about pcp caches, where folios are !buddy until the > pcp was drained. > It seems that we only track unmovable, reclaimable, and movable pages on the pcp lists. For specific details, please refer to the free_frozen_pages() function. > I wonder if that waiting should instead be done exactly once after > migrating multiple folios? For example, at the beginning of > test_pages_isolated(), to "flush" that state from any previous migration? > Yes, this can improve performance. I will make the modification in the next version. Thank you. > Thanks for all your effort around making CMA allocations / migration > more reliable. >
On 15.02.25 06:50, Ge Yang wrote: > > > 在 2025/2/14 16:08, David Hildenbrand 写道: >> On 14.02.25 07:32, yangge1116@126.com wrote: >>> From: Ge Yang <yangge1116@126.com> >>> >>> Since the introduction of commit b65d4adbc0f0 ("mm: hugetlb: defer >>> freeing >>> of HugeTLB pages"), which supports deferring the freeing of HugeTLB >>> pages, >>> the allocation of contiguous memory through cma_alloc() may fail >>> probabilistically. >>> >>> In the CMA allocation process, if it is found that the CMA area is >>> occupied >>> by in-use hugepage folios, these in-use hugepage folios need to be >>> migrated >>> to another location. When there are no available hugepage folios in the >>> free HugeTLB pool during the migration of in-use HugeTLB pages, new >>> folios >>> are allocated from the buddy system. A temporary state is set on the >>> newly >>> allocated folio. Upon completion of the hugepage folio migration, the >>> temporary state is transferred from the new folios to the old folios. >>> Normally, when the old folios with the temporary state are freed, it is >>> directly released back to the buddy system. However, due to the deferred >>> freeing of HugeTLB pages, the PageBuddy() check fails, ultimately leading >>> to the failure of cma_alloc(). >>> >>> Here is a simplified call trace illustrating the process: >>> cma_alloc() >>> ->__alloc_contig_migrate_range() // Migrate in-use hugepage >>> ->unmap_and_move_huge_page() >>> ->folio_putback_hugetlb() // Free old folios >>> ->test_pages_isolated() >>> ->__test_page_isolated_in_pageblock() >>> ->PageBuddy(page) // Check if the page is in buddy >>> >>> To resolve this issue, we have implemented a function named >>> wait_for_hugepage_folios_freed(). This function ensures that the hugepage >>> folios are properly released back to the buddy system after their >>> migration >>> is completed. By invoking wait_for_hugepage_folios_freed() following the >>> migration process, we guarantee that when test_pages_isolated() is >>> executed, it will successfully pass. >> >> Okay, so after every successful migration -> put of src, we wait for the >> src to actually get freed. >> >> When migrating multiple hugetlb folios, we'd wait once per folio. >> >> It reminds me a bit about pcp caches, where folios are !buddy until the >> pcp was drained. >> > It seems that we only track unmovable, reclaimable, and movable pages on > the pcp lists. For specific details, please refer to the > free_frozen_pages() function. It reminded me about PCP caches, because we effectively also have to wait for some stuck folios to properly get freed to the buddy.
在 2025/2/18 16:55, David Hildenbrand 写道: > On 15.02.25 06:50, Ge Yang wrote: >> >> >> 在 2025/2/14 16:08, David Hildenbrand 写道: >>> On 14.02.25 07:32, yangge1116@126.com wrote: >>>> From: Ge Yang <yangge1116@126.com> >>>> >>>> Since the introduction of commit b65d4adbc0f0 ("mm: hugetlb: defer >>>> freeing >>>> of HugeTLB pages"), which supports deferring the freeing of HugeTLB >>>> pages, >>>> the allocation of contiguous memory through cma_alloc() may fail >>>> probabilistically. >>>> >>>> In the CMA allocation process, if it is found that the CMA area is >>>> occupied >>>> by in-use hugepage folios, these in-use hugepage folios need to be >>>> migrated >>>> to another location. When there are no available hugepage folios in the >>>> free HugeTLB pool during the migration of in-use HugeTLB pages, new >>>> folios >>>> are allocated from the buddy system. A temporary state is set on the >>>> newly >>>> allocated folio. Upon completion of the hugepage folio migration, the >>>> temporary state is transferred from the new folios to the old folios. >>>> Normally, when the old folios with the temporary state are freed, it is >>>> directly released back to the buddy system. However, due to the >>>> deferred >>>> freeing of HugeTLB pages, the PageBuddy() check fails, ultimately >>>> leading >>>> to the failure of cma_alloc(). >>>> >>>> Here is a simplified call trace illustrating the process: >>>> cma_alloc() >>>> ->__alloc_contig_migrate_range() // Migrate in-use hugepage >>>> ->unmap_and_move_huge_page() >>>> ->folio_putback_hugetlb() // Free old folios >>>> ->test_pages_isolated() >>>> ->__test_page_isolated_in_pageblock() >>>> ->PageBuddy(page) // Check if the page is in buddy >>>> >>>> To resolve this issue, we have implemented a function named >>>> wait_for_hugepage_folios_freed(). This function ensures that the >>>> hugepage >>>> folios are properly released back to the buddy system after their >>>> migration >>>> is completed. By invoking wait_for_hugepage_folios_freed() following >>>> the >>>> migration process, we guarantee that when test_pages_isolated() is >>>> executed, it will successfully pass. >>> >>> Okay, so after every successful migration -> put of src, we wait for the >>> src to actually get freed. >>> >>> When migrating multiple hugetlb folios, we'd wait once per folio. >>> >>> It reminds me a bit about pcp caches, where folios are !buddy until the >>> pcp was drained. >>> >> It seems that we only track unmovable, reclaimable, and movable pages on >> the pcp lists. For specific details, please refer to the >> free_frozen_pages() function. > > It reminded me about PCP caches, because we effectively also have to > wait for some stuck folios to properly get freed to the buddy. > It seems that when an isolated page is freed, it won't be placed back into the PCP caches.
On 18.02.25 10:22, Ge Yang wrote: > > > 在 2025/2/18 16:55, David Hildenbrand 写道: >> On 15.02.25 06:50, Ge Yang wrote: >>> >>> >>> 在 2025/2/14 16:08, David Hildenbrand 写道: >>>> On 14.02.25 07:32, yangge1116@126.com wrote: >>>>> From: Ge Yang <yangge1116@126.com> >>>>> >>>>> Since the introduction of commit b65d4adbc0f0 ("mm: hugetlb: defer >>>>> freeing >>>>> of HugeTLB pages"), which supports deferring the freeing of HugeTLB >>>>> pages, >>>>> the allocation of contiguous memory through cma_alloc() may fail >>>>> probabilistically. >>>>> >>>>> In the CMA allocation process, if it is found that the CMA area is >>>>> occupied >>>>> by in-use hugepage folios, these in-use hugepage folios need to be >>>>> migrated >>>>> to another location. When there are no available hugepage folios in the >>>>> free HugeTLB pool during the migration of in-use HugeTLB pages, new >>>>> folios >>>>> are allocated from the buddy system. A temporary state is set on the >>>>> newly >>>>> allocated folio. Upon completion of the hugepage folio migration, the >>>>> temporary state is transferred from the new folios to the old folios. >>>>> Normally, when the old folios with the temporary state are freed, it is >>>>> directly released back to the buddy system. However, due to the >>>>> deferred >>>>> freeing of HugeTLB pages, the PageBuddy() check fails, ultimately >>>>> leading >>>>> to the failure of cma_alloc(). >>>>> >>>>> Here is a simplified call trace illustrating the process: >>>>> cma_alloc() >>>>> ->__alloc_contig_migrate_range() // Migrate in-use hugepage >>>>> ->unmap_and_move_huge_page() >>>>> ->folio_putback_hugetlb() // Free old folios >>>>> ->test_pages_isolated() >>>>> ->__test_page_isolated_in_pageblock() >>>>> ->PageBuddy(page) // Check if the page is in buddy >>>>> >>>>> To resolve this issue, we have implemented a function named >>>>> wait_for_hugepage_folios_freed(). This function ensures that the >>>>> hugepage >>>>> folios are properly released back to the buddy system after their >>>>> migration >>>>> is completed. By invoking wait_for_hugepage_folios_freed() following >>>>> the >>>>> migration process, we guarantee that when test_pages_isolated() is >>>>> executed, it will successfully pass. >>>> >>>> Okay, so after every successful migration -> put of src, we wait for the >>>> src to actually get freed. >>>> >>>> When migrating multiple hugetlb folios, we'd wait once per folio. >>>> >>>> It reminds me a bit about pcp caches, where folios are !buddy until the >>>> pcp was drained. >>>> >>> It seems that we only track unmovable, reclaimable, and movable pages on >>> the pcp lists. For specific details, please refer to the >>> free_frozen_pages() function. >> >> It reminded me about PCP caches, because we effectively also have to >> wait for some stuck folios to properly get freed to the buddy. >> > It seems that when an isolated page is freed, it won't be placed back > into the PCP caches. I recall there are cases when the page was in the pcp before the isolation started, which is why we drain the pcp at some point (IIRC).
在 2025/2/18 17:41, David Hildenbrand 写道: > On 18.02.25 10:22, Ge Yang wrote: >> >> >> 在 2025/2/18 16:55, David Hildenbrand 写道: >>> On 15.02.25 06:50, Ge Yang wrote: >>>> >>>> >>>> 在 2025/2/14 16:08, David Hildenbrand 写道: >>>>> On 14.02.25 07:32, yangge1116@126.com wrote: >>>>>> From: Ge Yang <yangge1116@126.com> >>>>>> >>>>>> Since the introduction of commit b65d4adbc0f0 ("mm: hugetlb: defer >>>>>> freeing >>>>>> of HugeTLB pages"), which supports deferring the freeing of HugeTLB >>>>>> pages, >>>>>> the allocation of contiguous memory through cma_alloc() may fail >>>>>> probabilistically. >>>>>> >>>>>> In the CMA allocation process, if it is found that the CMA area is >>>>>> occupied >>>>>> by in-use hugepage folios, these in-use hugepage folios need to be >>>>>> migrated >>>>>> to another location. When there are no available hugepage folios >>>>>> in the >>>>>> free HugeTLB pool during the migration of in-use HugeTLB pages, new >>>>>> folios >>>>>> are allocated from the buddy system. A temporary state is set on the >>>>>> newly >>>>>> allocated folio. Upon completion of the hugepage folio migration, the >>>>>> temporary state is transferred from the new folios to the old folios. >>>>>> Normally, when the old folios with the temporary state are freed, >>>>>> it is >>>>>> directly released back to the buddy system. However, due to the >>>>>> deferred >>>>>> freeing of HugeTLB pages, the PageBuddy() check fails, ultimately >>>>>> leading >>>>>> to the failure of cma_alloc(). >>>>>> >>>>>> Here is a simplified call trace illustrating the process: >>>>>> cma_alloc() >>>>>> ->__alloc_contig_migrate_range() // Migrate in-use hugepage >>>>>> ->unmap_and_move_huge_page() >>>>>> ->folio_putback_hugetlb() // Free old folios >>>>>> ->test_pages_isolated() >>>>>> ->__test_page_isolated_in_pageblock() >>>>>> ->PageBuddy(page) // Check if the page is in buddy >>>>>> >>>>>> To resolve this issue, we have implemented a function named >>>>>> wait_for_hugepage_folios_freed(). This function ensures that the >>>>>> hugepage >>>>>> folios are properly released back to the buddy system after their >>>>>> migration >>>>>> is completed. By invoking wait_for_hugepage_folios_freed() following >>>>>> the >>>>>> migration process, we guarantee that when test_pages_isolated() is >>>>>> executed, it will successfully pass. >>>>> >>>>> Okay, so after every successful migration -> put of src, we wait >>>>> for the >>>>> src to actually get freed. >>>>> >>>>> When migrating multiple hugetlb folios, we'd wait once per folio. >>>>> >>>>> It reminds me a bit about pcp caches, where folios are !buddy until >>>>> the >>>>> pcp was drained. >>>>> >>>> It seems that we only track unmovable, reclaimable, and movable >>>> pages on >>>> the pcp lists. For specific details, please refer to the >>>> free_frozen_pages() function. >>> >>> It reminded me about PCP caches, because we effectively also have to >>> wait for some stuck folios to properly get freed to the buddy. >>> >> It seems that when an isolated page is freed, it won't be placed back >> into the PCP caches. > > I recall there are cases when the page was in the pcp before the > isolation started, which is why we drain the pcp at some point (IIRC). > Yes, indeed, drain_all_pages(cc.zone) is currently executed before __alloc_contig_migrate_range().
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 6c6546b..c39e0d5 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -697,6 +697,7 @@ bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m); int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn); +void wait_for_hugepage_folios_freed(struct hstate *h); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, bool cow_from_owner); struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid, @@ -1092,6 +1093,10 @@ static inline int replace_free_hugepage_folios(unsigned long start_pfn, return 0; } +static inline void wait_for_hugepage_folios_freed(struct hstate *h) +{ +} + static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, bool cow_from_owner) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 30bc34d..64cae39 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2955,6 +2955,13 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn) return ret; } +void wait_for_hugepage_folios_freed(struct hstate *h) +{ + WARN_ON(!h); + + flush_free_hpage_work(h); +} + typedef enum { /* * For either 0/1: we checked the per-vma resv map, and one resv diff --git a/mm/migrate.c b/mm/migrate.c index fb19a18..5dd1851 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1448,6 +1448,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, int page_was_mapped = 0; struct anon_vma *anon_vma = NULL; struct address_space *mapping = NULL; + unsigned long size; if (folio_ref_count(src) == 1) { /* page was freed from under us. So we are done. */ @@ -1533,9 +1534,20 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio, out_unlock: folio_unlock(src); out: - if (rc == MIGRATEPAGE_SUCCESS) + if (rc == MIGRATEPAGE_SUCCESS) { + size = folio_size(src); folio_putback_hugetlb(src); - else if (rc != -EAGAIN) + + /* + * Due to the deferred freeing of HugeTLB folios, the hugepage 'src' may + * not immediately release to the buddy system. This can lead to failure + * in allocating memory through the cma_alloc() function. To ensure that + * the hugepage folios are properly released back to the buddy system, + * we invoke the wait_for_hugepage_folios_freed() function to wait for + * the release to complete. + */ + wait_for_hugepage_folios_freed(size_to_hstate(size)); + } else if (rc != -EAGAIN) list_move_tail(&src->lru, ret); /*