Message ID | 20220808212836.111749-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hugetlb: freeze allocated pages before creating hugetlb pages | expand |
On 2022/8/9 5:28, Mike Kravetz wrote: > When creating hugetlb pages, the hugetlb code must first allocate > contiguous pages from a low level allocator such as buddy, cma or > memblock. The pages returned from these low level allocators are > ref counted. This creates potential issues with other code taking > speculative references on these pages before they can be transformed to > a hugetlb page. This issue has been addressed with methods and code > such as that provided in [1]. > > Recent discussions about vmemmap freeing [2] have indicated that it > would be beneficial to freeze all sub pages, including the head page > of pages returned from low level allocators before converting to a > hugetlb page. This helps avoid races if want to replace the page > containing vmemmap for the head page. > > There have been proposals to change at least the buddy allocator to > return frozen pages as described at [3]. If such a change is made, it > can be employed by the hugetlb code. However, as mentioned above > hugetlb uses several low level allocators so each would need to be > modified to return frozen pages. For now, we can manually freeze the > returned pages. This is done in two places: > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > We freeze the head page, retrying once in the VERY rare case where > there may be an inflated ref count. > 2) prep_compound_gigantic_page, for gigantic pages the current code > freezes all pages except the head page. New code will simply freeze > the head page as well. > > In a few other places, code checks for inflated ref counts on newly > allocated hugetlb pages. With the modifications to freeze after > allocating, this code can be removed. > > After hugetlb pages are freshly allocated, they are often added to the > hugetlb free lists. Since these pages were previously ref counted, this > was done via put_page() which would end up calling the hugetlb > destructor: free_huge_page. With changes to freeze pages, we simply > call free_huge_page directly to add the pages to the free list. > > In a few other places, freshly allocated hugetlb pages were immediately > put into use, and the expectation was they were already ref counted. In > these cases, we must manually ref count the page. > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- > 1 file changed, 35 insertions(+), 62 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 28516881a1b2..6b90d85d545b 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > { > int i, j; > int nr_pages = 1 << order; > - struct page *p = page + 1; > + struct page *p = page; > > /* we rely on prep_new_huge_page to set the destructor */ > set_compound_order(page, order); > - __ClearPageReserved(page); > __SetPageHead(page); > - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > /* > * For gigantic hugepages allocated through bootmem at > * boot, it's safer to be consistent with the not-gigantic > @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > } else { > VM_BUG_ON_PAGE(page_count(p), p); > } > - set_compound_head(p, page); > + if (i != 0) > + set_compound_head(p, page); It seems we forget to unfreeze the head page in out_error path. If unexpected inflated ref count occurs, the ref count of head page will become negative in free_gigantic_page? Thanks for your patch, Mike. I hope this will help solve the races with memory failure. ;) And I will take a more close review when I have enough time.
On 8/8/22 23:28, Mike Kravetz wrote: > When creating hugetlb pages, the hugetlb code must first allocate > contiguous pages from a low level allocator such as buddy, cma or > memblock. The pages returned from these low level allocators are > ref counted. This creates potential issues with other code taking > speculative references on these pages before they can be transformed to > a hugetlb page. This issue has been addressed with methods and code > such as that provided in [1]. > > Recent discussions about vmemmap freeing [2] have indicated that it > would be beneficial to freeze all sub pages, including the head page > of pages returned from low level allocators before converting to a > hugetlb page. This helps avoid races if want to replace the page > containing vmemmap for the head page. > > There have been proposals to change at least the buddy allocator to > return frozen pages as described at [3]. If such a change is made, it > can be employed by the hugetlb code. However, as mentioned above > hugetlb uses several low level allocators so each would need to be > modified to return frozen pages. For now, we can manually freeze the > returned pages. This is done in two places: > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > We freeze the head page, retrying once in the VERY rare case where > there may be an inflated ref count. > 2) prep_compound_gigantic_page, for gigantic pages the current code > freezes all pages except the head page. New code will simply freeze > the head page as well. > > In a few other places, code checks for inflated ref counts on newly > allocated hugetlb pages. With the modifications to freeze after > allocating, this code can be removed. > > After hugetlb pages are freshly allocated, they are often added to the > hugetlb free lists. Since these pages were previously ref counted, this > was done via put_page() which would end up calling the hugetlb > destructor: free_huge_page. With changes to freeze pages, we simply > call free_huge_page directly to add the pages to the free list. > > In a few other places, freshly allocated hugetlb pages were immediately > put into use, and the expectation was they were already ref counted. In > these cases, we must manually ref count the page. > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> <snip> > @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > int order = huge_page_order(h); > struct page *page; > bool alloc_try_hard = true; > + bool retry = true; > > /* > * By default we always try hard to allocate the page with > @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > gfp_mask |= __GFP_RETRY_MAYFAIL; > if (nid == NUMA_NO_NODE) > nid = numa_mem_id(); > +retry: > page = __alloc_pages(gfp_mask, order, nid, nmask); > + > + /* Freeze head page */ > + if (!page_ref_freeze(page, 1)) { > + __free_pages(page, order); > + if (retry) { /* retry once */ > + retry = false; > + goto retry; > + } > + /* WOW! twice in a row. */ > + pr_warn("HugeTLB head page unexpected inflated ref count\n"); > + page = NULL; The previous approach of waiting and hoping the temporary refcount increase will drop made sense to me, but yeah, what if it doesn't drop. So this freeing and reallocating makes sense even if it's theoretically wasteful. Should be moot anyway once we have the way to allocate frozen pages directly from buddy. > + } > + > if (page) > __count_vm_event(HTLB_BUDDY_PGALLOC); > else
On 08/09/22 10:48, Miaohe Lin wrote: > On 2022/8/9 5:28, Mike Kravetz wrote: <snip> > > There have been proposals to change at least the buddy allocator to > > return frozen pages as described at [3]. If such a change is made, it > > can be employed by the hugetlb code. However, as mentioned above > > hugetlb uses several low level allocators so each would need to be > > modified to return frozen pages. For now, we can manually freeze the > > returned pages. This is done in two places: > > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > > We freeze the head page, retrying once in the VERY rare case where > > there may be an inflated ref count. > > 2) prep_compound_gigantic_page, for gigantic pages the current code > > freezes all pages except the head page. New code will simply freeze > > the head page as well. > > <snip> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 28516881a1b2..6b90d85d545b 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > { > > int i, j; > > int nr_pages = 1 << order; > > - struct page *p = page + 1; > > + struct page *p = page; > > > > /* we rely on prep_new_huge_page to set the destructor */ > > set_compound_order(page, order); > > - __ClearPageReserved(page); > > __SetPageHead(page); > > - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > > + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > > /* > > * For gigantic hugepages allocated through bootmem at > > * boot, it's safer to be consistent with the not-gigantic > > @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > } else { > > VM_BUG_ON_PAGE(page_count(p), p); > > } > > - set_compound_head(p, page); > > + if (i != 0) > > + set_compound_head(p, page); > > It seems we forget to unfreeze the head page in out_error path. If unexpected inflated ref count occurs, > the ref count of head page will become negative in free_gigantic_page? Yes, thank you! I forgot to modify the error path to also fix up the head page.
> On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@oracle.com> wrote: > > When creating hugetlb pages, the hugetlb code must first allocate > contiguous pages from a low level allocator such as buddy, cma or > memblock. The pages returned from these low level allocators are > ref counted. This creates potential issues with other code taking > speculative references on these pages before they can be transformed to > a hugetlb page. This issue has been addressed with methods and code > such as that provided in [1]. > > Recent discussions about vmemmap freeing [2] have indicated that it > would be beneficial to freeze all sub pages, including the head page > of pages returned from low level allocators before converting to a > hugetlb page. This helps avoid races if want to replace the page > containing vmemmap for the head page. > > There have been proposals to change at least the buddy allocator to > return frozen pages as described at [3]. If such a change is made, it > can be employed by the hugetlb code. However, as mentioned above > hugetlb uses several low level allocators so each would need to be > modified to return frozen pages. For now, we can manually freeze the > returned pages. This is done in two places: > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > We freeze the head page, retrying once in the VERY rare case where > there may be an inflated ref count. > 2) prep_compound_gigantic_page, for gigantic pages the current code > freezes all pages except the head page. New code will simply freeze > the head page as well. > > In a few other places, code checks for inflated ref counts on newly > allocated hugetlb pages. With the modifications to freeze after > allocating, this code can be removed. > > After hugetlb pages are freshly allocated, they are often added to the > hugetlb free lists. Since these pages were previously ref counted, this > was done via put_page() which would end up calling the hugetlb > destructor: free_huge_page. With changes to freeze pages, we simply > call free_huge_page directly to add the pages to the free list. > > In a few other places, freshly allocated hugetlb pages were immediately > put into use, and the expectation was they were already ref counted. In > these cases, we must manually ref count the page. > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- > 1 file changed, 35 insertions(+), 62 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 28516881a1b2..6b90d85d545b 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > { > int i, j; > int nr_pages = 1 << order; > - struct page *p = page + 1; > + struct page *p = page; > > /* we rely on prep_new_huge_page to set the destructor */ > set_compound_order(page, order); > - __ClearPageReserved(page); > __SetPageHead(page); > - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > /* > * For gigantic hugepages allocated through bootmem at > * boot, it's safer to be consistent with the not-gigantic > @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > } else { > VM_BUG_ON_PAGE(page_count(p), p); > } > - set_compound_head(p, page); > + if (i != 0) > + set_compound_head(p, page); > } > atomic_set(compound_mapcount_ptr(page), -1); > atomic_set(compound_pincount_ptr(page), 0); > @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > int order = huge_page_order(h); > struct page *page; > bool alloc_try_hard = true; > + bool retry = true; > > /* > * By default we always try hard to allocate the page with > @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > gfp_mask |= __GFP_RETRY_MAYFAIL; > if (nid == NUMA_NO_NODE) > nid = numa_mem_id(); > +retry: > page = __alloc_pages(gfp_mask, order, nid, nmask); > + > + /* Freeze head page */ > + if (!page_ref_freeze(page, 1)) { Hi Mike, I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a frozen page. Then we do not need to handle an unexpected refcount case, which should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead.org/T/#u Muchun, Thanks. > + __free_pages(page, order); > + if (retry) { /* retry once */ > + retry = false; > + goto retry; > + } > + /* WOW! twice in a row. */ > + pr_warn("HugeTLB head page unexpected inflated ref count\n"); > + page = NULL; > + } > + > if (page) > __count_vm_event(HTLB_BUDDY_PGALLOC); > else > @@ -1961,6 +1976,9 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > /* > * Common helper to allocate a fresh hugetlb page. All specific allocators > * should use this function to get new hugetlb pages > + * > + * Note that returned page is 'frozen': ref count of head page and all tail > + * pages is zero. > */ > static struct page *alloc_fresh_huge_page(struct hstate *h, > gfp_t gfp_mask, int nid, nodemask_t *nmask, > @@ -2018,7 +2036,7 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, > if (!page) > return 0; > > - put_page(page); /* free it into the hugepage allocator */ > + free_huge_page(page); /* free it into the hugepage allocator */ > > return 1; > } > @@ -2175,10 +2193,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) > * Allocates a fresh surplus page from the page allocator. > */ > static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > - int nid, nodemask_t *nmask, bool zero_ref) > + int nid, nodemask_t *nmask) > { > struct page *page = NULL; > - bool retry = false; > > if (hstate_is_gigantic(h)) > return NULL; > @@ -2188,7 +2205,6 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > goto out_unlock; > spin_unlock_irq(&hugetlb_lock); > > -retry: > page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); > if (!page) > return NULL; > @@ -2204,34 +2220,10 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, > if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) { > SetHPageTemporary(page); > spin_unlock_irq(&hugetlb_lock); > - put_page(page); > + free_huge_page(page); > return NULL; > } > > - if (zero_ref) { > - /* > - * Caller requires a page with zero ref count. > - * We will drop ref count here. If someone else is holding > - * a ref, the page will be freed when they drop it. Abuse > - * temporary page flag to accomplish this. > - */ > - SetHPageTemporary(page); > - if (!put_page_testzero(page)) { > - /* > - * Unexpected inflated ref count on freshly allocated > - * huge. Retry once. > - */ > - pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); > - spin_unlock_irq(&hugetlb_lock); > - if (retry) > - return NULL; > - > - retry = true; > - goto retry; > - } > - ClearHPageTemporary(page); > - } > - > h->surplus_huge_pages++; > h->surplus_huge_pages_node[page_to_nid(page)]++; > > @@ -2253,6 +2245,9 @@ static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, > if (!page) > return NULL; > > + /* fresh huge pages are frozen */ > + set_page_refcounted(page); > + > /* > * We do not account these pages as surplus because they are only > * temporary and will be released properly on the last reference > @@ -2280,14 +2275,14 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h, > gfp_t gfp = gfp_mask | __GFP_NOWARN; > > gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); > - page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false); > + page = alloc_surplus_huge_page(h, gfp, nid, nodemask); > > /* Fallback to all nodes if page==NULL */ > nodemask = NULL; > } > > if (!page) > - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false); > + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask); > mpol_cond_put(mpol); > return page; > } > @@ -2358,7 +2353,7 @@ static int gather_surplus_pages(struct hstate *h, long delta) > spin_unlock_irq(&hugetlb_lock); > for (i = 0; i < needed; i++) { > page = alloc_surplus_huge_page(h, htlb_alloc_mask(h), > - NUMA_NO_NODE, NULL, true); > + NUMA_NO_NODE, NULL); > if (!page) { > alloc_ok = false; > break; > @@ -2720,7 +2715,6 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, > { > gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > int nid = page_to_nid(old_page); > - bool alloc_retry = false; > struct page *new_page; > int ret = 0; > > @@ -2731,30 +2725,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, > * the pool. This simplifies and let us do most of the processing > * under the lock. > */ > -alloc_retry: > new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); > if (!new_page) > return -ENOMEM; > - /* > - * If all goes well, this page will be directly added to the free > - * list in the pool. For this the ref count needs to be zero. > - * Attempt to drop now, and retry once if needed. It is VERY > - * unlikely there is another ref on the page. > - * > - * If someone else has a reference to the page, it will be freed > - * when they drop their ref. Abuse temporary page flag to accomplish > - * this. Retry once if there is an inflated ref count. > - */ > - SetHPageTemporary(new_page); > - if (!put_page_testzero(new_page)) { > - if (alloc_retry) > - return -EBUSY; > - > - alloc_retry = true; > - goto alloc_retry; > - } > - ClearHPageTemporary(new_page); > - > __prep_new_huge_page(h, new_page); > > retry: > @@ -2934,6 +2907,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, > } > spin_lock_irq(&hugetlb_lock); > list_add(&page->lru, &h->hugepage_activelist); > + set_page_refcounted(page); > /* Fall through */ > } > hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); > @@ -3038,7 +3012,7 @@ static void __init gather_bootmem_prealloc(void) > if (prep_compound_gigantic_page(page, huge_page_order(h))) { > WARN_ON(PageReserved(page)); > prep_new_huge_page(h, page, page_to_nid(page)); > - put_page(page); /* add to the hugepage allocator */ > + free_huge_page(page); /* add to the hugepage allocator */ > } else { > /* VERY unlikely inflated ref count on a tail page */ > free_gigantic_page(page, huge_page_order(h)); > @@ -3070,7 +3044,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) > &node_states[N_MEMORY], NULL); > if (!page) > break; > - put_page(page); /* free it into the hugepage allocator */ > + free_huge_page(page); /* free it into the hugepage allocator */ > } > cond_resched(); > } > @@ -3459,9 +3433,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page) > else > prep_compound_page(page + i, target_hstate->order); > set_page_private(page + i, 0); > - set_page_refcounted(page + i); > prep_new_huge_page(target_hstate, page + i, nid); > - put_page(page + i); > + free_huge_page(page + i); > } > mutex_unlock(&target_hstate->resize_lock); > > -- > 2.37.1 > >
On 08/10/22 14:20, Muchun Song wrote: > > On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > When creating hugetlb pages, the hugetlb code must first allocate > > contiguous pages from a low level allocator such as buddy, cma or > > memblock. The pages returned from these low level allocators are > > ref counted. This creates potential issues with other code taking > > speculative references on these pages before they can be transformed to > > a hugetlb page. This issue has been addressed with methods and code > > such as that provided in [1]. > > > > Recent discussions about vmemmap freeing [2] have indicated that it > > would be beneficial to freeze all sub pages, including the head page > > of pages returned from low level allocators before converting to a > > hugetlb page. This helps avoid races if want to replace the page > > containing vmemmap for the head page. > > > > There have been proposals to change at least the buddy allocator to > > return frozen pages as described at [3]. If such a change is made, it > > can be employed by the hugetlb code. However, as mentioned above > > hugetlb uses several low level allocators so each would need to be > > modified to return frozen pages. For now, we can manually freeze the > > returned pages. This is done in two places: > > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > > We freeze the head page, retrying once in the VERY rare case where > > there may be an inflated ref count. > > 2) prep_compound_gigantic_page, for gigantic pages the current code > > freezes all pages except the head page. New code will simply freeze > > the head page as well. > > > > In a few other places, code checks for inflated ref counts on newly > > allocated hugetlb pages. With the modifications to freeze after > > allocating, this code can be removed. > > > > After hugetlb pages are freshly allocated, they are often added to the > > hugetlb free lists. Since these pages were previously ref counted, this > > was done via put_page() which would end up calling the hugetlb > > destructor: free_huge_page. With changes to freeze pages, we simply > > call free_huge_page directly to add the pages to the free list. > > > > In a few other places, freshly allocated hugetlb pages were immediately > > put into use, and the expectation was they were already ref counted. In > > these cases, we must manually ref count the page. > > > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > > [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > --- > > mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- > > 1 file changed, 35 insertions(+), 62 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 28516881a1b2..6b90d85d545b 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > { > > int i, j; > > int nr_pages = 1 << order; > > - struct page *p = page + 1; > > + struct page *p = page; > > > > /* we rely on prep_new_huge_page to set the destructor */ > > set_compound_order(page, order); > > - __ClearPageReserved(page); > > __SetPageHead(page); > > - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > > + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > > /* > > * For gigantic hugepages allocated through bootmem at > > * boot, it's safer to be consistent with the not-gigantic > > @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > } else { > > VM_BUG_ON_PAGE(page_count(p), p); > > } > > - set_compound_head(p, page); > > + if (i != 0) > > + set_compound_head(p, page); > > } > > atomic_set(compound_mapcount_ptr(page), -1); > > atomic_set(compound_pincount_ptr(page), 0); > > @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > > int order = huge_page_order(h); > > struct page *page; > > bool alloc_try_hard = true; > > + bool retry = true; > > > > /* > > * By default we always try hard to allocate the page with > > @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > > gfp_mask |= __GFP_RETRY_MAYFAIL; > > if (nid == NUMA_NO_NODE) > > nid = numa_mem_id(); > > +retry: > > page = __alloc_pages(gfp_mask, order, nid, nmask); > > + > > + /* Freeze head page */ > > + if (!page_ref_freeze(page, 1)) { > > Hi Mike, > > I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a > frozen page. Then we do not need to handle an unexpected refcount case, which > should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? I asked Matthew about these efforts before creating this patch. At the time, there were issues with the first version of his patch series and he wasn't sure when he would get around to looking into those issues. I then decided to proceed with manually freezing pages after allocation. The thought was that alloc_frozen_pages() could be added when it became available. The 'downstream changes' to deal with pages that have zero ref count should remain the same. IMO, these downstream changes are the more important parts of this patch. Shortly after sending this patch, Matthew took another look at his series and discovered the source of issues. He then sent this v2 series. Matthew will correct me if this is not accurate. > > [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead.org/T/#u > I am happy to wait until Matthew's series moves forward, and then use the new interface.
> On Aug 11, 2022, at 06:38, Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 08/10/22 14:20, Muchun Song wrote: >>> On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@oracle.com> wrote: >>> >>> When creating hugetlb pages, the hugetlb code must first allocate >>> contiguous pages from a low level allocator such as buddy, cma or >>> memblock. The pages returned from these low level allocators are >>> ref counted. This creates potential issues with other code taking >>> speculative references on these pages before they can be transformed to >>> a hugetlb page. This issue has been addressed with methods and code >>> such as that provided in [1]. >>> >>> Recent discussions about vmemmap freeing [2] have indicated that it >>> would be beneficial to freeze all sub pages, including the head page >>> of pages returned from low level allocators before converting to a >>> hugetlb page. This helps avoid races if want to replace the page >>> containing vmemmap for the head page. >>> >>> There have been proposals to change at least the buddy allocator to >>> return frozen pages as described at [3]. If such a change is made, it >>> can be employed by the hugetlb code. However, as mentioned above >>> hugetlb uses several low level allocators so each would need to be >>> modified to return frozen pages. For now, we can manually freeze the >>> returned pages. This is done in two places: >>> 1) alloc_buddy_huge_page, only the returned head page is ref counted. >>> We freeze the head page, retrying once in the VERY rare case where >>> there may be an inflated ref count. >>> 2) prep_compound_gigantic_page, for gigantic pages the current code >>> freezes all pages except the head page. New code will simply freeze >>> the head page as well. >>> >>> In a few other places, code checks for inflated ref counts on newly >>> allocated hugetlb pages. With the modifications to freeze after >>> allocating, this code can be removed. >>> >>> After hugetlb pages are freshly allocated, they are often added to the >>> hugetlb free lists. Since these pages were previously ref counted, this >>> was done via put_page() which would end up calling the hugetlb >>> destructor: free_huge_page. With changes to freeze pages, we simply >>> call free_huge_page directly to add the pages to the free list. >>> >>> In a few other places, freshly allocated hugetlb pages were immediately >>> put into use, and the expectation was they were already ref counted. In >>> these cases, we must manually ref count the page. >>> >>> [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ >>> [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ >>> [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ >>> >>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >>> --- >>> mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- >>> 1 file changed, 35 insertions(+), 62 deletions(-) >>> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 28516881a1b2..6b90d85d545b 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, >>> { >>> int i, j; >>> int nr_pages = 1 << order; >>> - struct page *p = page + 1; >>> + struct page *p = page; >>> >>> /* we rely on prep_new_huge_page to set the destructor */ >>> set_compound_order(page, order); >>> - __ClearPageReserved(page); >>> __SetPageHead(page); >>> - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>> + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>> /* >>> * For gigantic hugepages allocated through bootmem at >>> * boot, it's safer to be consistent with the not-gigantic >>> @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, >>> } else { >>> VM_BUG_ON_PAGE(page_count(p), p); >>> } >>> - set_compound_head(p, page); >>> + if (i != 0) >>> + set_compound_head(p, page); >>> } >>> atomic_set(compound_mapcount_ptr(page), -1); >>> atomic_set(compound_pincount_ptr(page), 0); >>> @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, >>> int order = huge_page_order(h); >>> struct page *page; >>> bool alloc_try_hard = true; >>> + bool retry = true; >>> >>> /* >>> * By default we always try hard to allocate the page with >>> @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, >>> gfp_mask |= __GFP_RETRY_MAYFAIL; >>> if (nid == NUMA_NO_NODE) >>> nid = numa_mem_id(); >>> +retry: >>> page = __alloc_pages(gfp_mask, order, nid, nmask); >>> + >>> + /* Freeze head page */ >>> + if (!page_ref_freeze(page, 1)) { >> >> Hi Mike, >> >> I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a >> frozen page. Then we do not need to handle an unexpected refcount case, which >> should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? > > I asked Matthew about these efforts before creating this patch. At the > time, there were issues with the first version of his patch series and > he wasn't sure when he would get around to looking into those issues. > > I then decided to proceed with manually freezing pages after allocation. > The thought was that alloc_frozen_pages() could be added when it became > available. The 'downstream changes' to deal with pages that have zero > ref count should remain the same. IMO, these downstream changes are the > more important parts of this patch. > > Shortly after sending this patch, Matthew took another look at his > series and discovered the source of issues. He then sent this v2 > series. Matthew will correct me if this is not accurate. Thanks Mike, it is really clear to me. > >> >> [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead.org/T/#u >> > > I am happy to wait until Matthew's series moves forward, and then use > the new interface. I agree. Let’s wait together. Muchun > > -- > Mike Kravetz
On 8/12/22 06:36, Muchun Song wrote: >> On Aug 11, 2022, at 06:38, Mike Kravetz <mike.kravetz@oracle.com> wrote: >> On 08/10/22 14:20, Muchun Song wrote: >>>> On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@oracle.com> wrote: >>>> When creating hugetlb pages, the hugetlb code must first allocate >>>> contiguous pages from a low level allocator such as buddy, cma or >>>> memblock. The pages returned from these low level allocators are >>>> ref counted. This creates potential issues with other code taking >>>> speculative references on these pages before they can be transformed to >>>> a hugetlb page. This issue has been addressed with methods and code >>>> such as that provided in [1]. >>>> >>>> Recent discussions about vmemmap freeing [2] have indicated that it >>>> would be beneficial to freeze all sub pages, including the head page >>>> of pages returned from low level allocators before converting to a >>>> hugetlb page. This helps avoid races if want to replace the page >>>> containing vmemmap for the head page. >>>> >>>> There have been proposals to change at least the buddy allocator to >>>> return frozen pages as described at [3]. If such a change is made, it >>>> can be employed by the hugetlb code. However, as mentioned above >>>> hugetlb uses several low level allocators so each would need to be >>>> modified to return frozen pages. For now, we can manually freeze the >>>> returned pages. This is done in two places: >>>> 1) alloc_buddy_huge_page, only the returned head page is ref counted. >>>> We freeze the head page, retrying once in the VERY rare case where >>>> there may be an inflated ref count. >>>> 2) prep_compound_gigantic_page, for gigantic pages the current code >>>> freezes all pages except the head page. New code will simply freeze >>>> the head page as well. >>>> >>>> In a few other places, code checks for inflated ref counts on newly >>>> allocated hugetlb pages. With the modifications to freeze after >>>> allocating, this code can be removed. >>>> >>>> After hugetlb pages are freshly allocated, they are often added to the >>>> hugetlb free lists. Since these pages were previously ref counted, this >>>> was done via put_page() which would end up calling the hugetlb >>>> destructor: free_huge_page. With changes to freeze pages, we simply >>>> call free_huge_page directly to add the pages to the free list. >>>> >>>> In a few other places, freshly allocated hugetlb pages were immediately >>>> put into use, and the expectation was they were already ref counted. In >>>> these cases, we must manually ref count the page. >>>> >>>> [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ >>>> [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ >>>> [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ >>>> >>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >>>> --- >>>> mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- >>>> 1 file changed, 35 insertions(+), 62 deletions(-) >>>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 28516881a1b2..6b90d85d545b 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, >>>> { >>>> int i, j; >>>> int nr_pages = 1 << order; >>>> - struct page *p = page + 1; >>>> + struct page *p = page; >>>> >>>> /* we rely on prep_new_huge_page to set the destructor */ >>>> set_compound_order(page, order); >>>> - __ClearPageReserved(page); >>>> __SetPageHead(page); >>>> - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>>> + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { >>>> /* >>>> * For gigantic hugepages allocated through bootmem at >>>> * boot, it's safer to be consistent with the not-gigantic >>>> @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, >>>> } else { >>>> VM_BUG_ON_PAGE(page_count(p), p); >>>> } >>>> - set_compound_head(p, page); >>>> + if (i != 0) >>>> + set_compound_head(p, page); >>>> } >>>> atomic_set(compound_mapcount_ptr(page), -1); >>>> atomic_set(compound_pincount_ptr(page), 0); >>>> @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, >>>> int order = huge_page_order(h); >>>> struct page *page; >>>> bool alloc_try_hard = true; >>>> + bool retry = true; >>>> >>>> /* >>>> * By default we always try hard to allocate the page with >>>> @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, >>>> gfp_mask |= __GFP_RETRY_MAYFAIL; >>>> if (nid == NUMA_NO_NODE) >>>> nid = numa_mem_id(); >>>> +retry: >>>> page = __alloc_pages(gfp_mask, order, nid, nmask); >>>> + >>>> + /* Freeze head page */ >>>> + if (!page_ref_freeze(page, 1)) { >>> >>> Hi Mike, >>> >>> I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a >>> frozen page. Then we do not need to handle an unexpected refcount case, which >>> should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? >> >> I asked Matthew about these efforts before creating this patch. At the >> time, there were issues with the first version of his patch series and >> he wasn't sure when he would get around to looking into those issues. >> >> I then decided to proceed with manually freezing pages after allocation. >> The thought was that alloc_frozen_pages() could be added when it became >> available. The 'downstream changes' to deal with pages that have zero >> ref count should remain the same. IMO, these downstream changes are the >> more important parts of this patch. >> >> Shortly after sending this patch, Matthew took another look at his >> series and discovered the source of issues. He then sent this v2 >> series. Matthew will correct me if this is not accurate. > > Thanks Mike, it is really clear to me. > >> >>> >>> [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead.org/T/#u >>> >> >> I am happy to wait until Matthew's series moves forward, and then use >> the new interface. > > I agree. Let’s wait together. Maybe this is a bit of bad suggestion, but considering that the enterity of kernels supporting hugetlb vmemmap optimization are using the old interface (the new one isn't yet settled it seems?) would it be better to go with something this patch, and then have a second patch (simpler one) that just switches to the new interface? Joao
On 08/25/22 15:35, Joao Martins wrote: > On 8/12/22 06:36, Muchun Song wrote: > >> On Aug 11, 2022, at 06:38, Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> On 08/10/22 14:20, Muchun Song wrote: > >>>> On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@oracle.com> wrote: <snip> > >>>> There have been proposals to change at least the buddy allocator to > >>>> return frozen pages as described at [3]. If such a change is made, it > >>>> can be employed by the hugetlb code. However, as mentioned above > >>>> hugetlb uses several low level allocators so each would need to be > >>>> modified to return frozen pages. For now, we can manually freeze the > >>>> returned pages. This is done in two places: > >>>> 1) alloc_buddy_huge_page, only the returned head page is ref counted. > >>>> We freeze the head page, retrying once in the VERY rare case where > >>>> there may be an inflated ref count. > >>>> 2) prep_compound_gigantic_page, for gigantic pages the current code > >>>> freezes all pages except the head page. New code will simply freeze > >>>> the head page as well. <snip> > >>>> /* > >>>> * By default we always try hard to allocate the page with > >>>> @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > >>>> gfp_mask |= __GFP_RETRY_MAYFAIL; > >>>> if (nid == NUMA_NO_NODE) > >>>> nid = numa_mem_id(); > >>>> +retry: > >>>> page = __alloc_pages(gfp_mask, order, nid, nmask); > >>>> + > >>>> + /* Freeze head page */ > >>>> + if (!page_ref_freeze(page, 1)) { > >>> > >>> Hi Mike, > >>> > >>> I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a > >>> frozen page. Then we do not need to handle an unexpected refcount case, which > >>> should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? > >> > >> I asked Matthew about these efforts before creating this patch. At the > >> time, there were issues with the first version of his patch series and > >> he wasn't sure when he would get around to looking into those issues. > >> > >> I then decided to proceed with manually freezing pages after allocation. > >> The thought was that alloc_frozen_pages() could be added when it became > >> available. The 'downstream changes' to deal with pages that have zero > >> ref count should remain the same. IMO, these downstream changes are the > >> more important parts of this patch. > >> > >> Shortly after sending this patch, Matthew took another look at his > >> series and discovered the source of issues. He then sent this v2 > >> series. Matthew will correct me if this is not accurate. > > > > Thanks Mike, it is really clear to me. > > > >> > >>> > >>> [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead.org/T/#u > >>> > >> > >> I am happy to wait until Matthew's series moves forward, and then use > >> the new interface. > > > > I agree. Let’s wait together. > > Maybe this is a bit of bad suggestion, but considering that the enterity of kernels > supporting hugetlb vmemmap optimization are using the old interface (the new one isn't yet > settled it seems?) would it be better to go with something this patch, and then have a > second patch (simpler one) that just switches to the new interface? > My thoughts. Right now, the earliest any of this code would be merged is in 6.1. If the alloc_frozen_pages interface gors into 6.1, then it would make sense to just make this patch/change use it. If not, we can manually freeze as done here, and switch when alloc_frozen_pages becomes available. In either case, this could/should go into 6.1. We still have a bit of time to see if alloc_frozen_pages will make 6.1. Ideally, we would wait and see. Ideally, I (we) would help with Matthew's series. However, I am a bit concerned that we may forget about pushing this forward and miss the window. And, since Joao's optimization depends on this, that would miss the window as well. Matthew, any thoughts on the likelihood of alloc_frozen_pages going into 6.1?
On 08/25/22 15:31, Mike Kravetz wrote: > On 08/25/22 15:35, Joao Martins wrote: > > On 8/12/22 06:36, Muchun Song wrote: > > >> On Aug 11, 2022, at 06:38, Mike Kravetz <mike.kravetz@oracle.com> wrote: > > >> On 08/10/22 14:20, Muchun Song wrote: > > >>>> On Aug 9, 2022, at 05:28, Mike Kravetz <mike.kravetz@oracle.com> wrote: > <snip> > > >>>> There have been proposals to change at least the buddy allocator to > > >>>> return frozen pages as described at [3]. If such a change is made, it > > >>>> can be employed by the hugetlb code. However, as mentioned above > > >>>> hugetlb uses several low level allocators so each would need to be > > >>>> modified to return frozen pages. For now, we can manually freeze the > > >>>> returned pages. This is done in two places: > > >>>> 1) alloc_buddy_huge_page, only the returned head page is ref counted. > > >>>> We freeze the head page, retrying once in the VERY rare case where > > >>>> there may be an inflated ref count. > > >>>> 2) prep_compound_gigantic_page, for gigantic pages the current code > > >>>> freezes all pages except the head page. New code will simply freeze > > >>>> the head page as well. > <snip> > > >>>> /* > > >>>> * By default we always try hard to allocate the page with > > >>>> @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > > >>>> gfp_mask |= __GFP_RETRY_MAYFAIL; > > >>>> if (nid == NUMA_NO_NODE) > > >>>> nid = numa_mem_id(); > > >>>> +retry: > > >>>> page = __alloc_pages(gfp_mask, order, nid, nmask); > > >>>> + > > >>>> + /* Freeze head page */ > > >>>> + if (!page_ref_freeze(page, 1)) { > > >>> > > >>> Hi Mike, > > >>> > > >>> I saw Mattew has introduced a new helper alloc_frozen_pages() in thread [1] to allocate a > > >>> frozen page. Then we do not need to handle an unexpected refcount case, which > > >>> should be easy. Is there any consideration why we do not choose alloc_frozen_pages()? > > >> > > >> I asked Matthew about these efforts before creating this patch. At the > > >> time, there were issues with the first version of his patch series and > > >> he wasn't sure when he would get around to looking into those issues. > > >> > > >> I then decided to proceed with manually freezing pages after allocation. > > >> The thought was that alloc_frozen_pages() could be added when it became > > >> available. The 'downstream changes' to deal with pages that have zero > > >> ref count should remain the same. IMO, these downstream changes are the > > >> more important parts of this patch. > > >> > > >> Shortly after sending this patch, Matthew took another look at his > > >> series and discovered the source of issues. He then sent this v2 > > >> series. Matthew will correct me if this is not accurate. > > > > > > Thanks Mike, it is really clear to me. > > > > > >> > > >>> > > >>> [1] https://lore.kernel.org/linux-mm/20220809171854.3725722-15-willy@infradead.org/T/#u > > >>> > > >> > > >> I am happy to wait until Matthew's series moves forward, and then use > > >> the new interface. > > > > > > I agree. Let’s wait together. > > > > Maybe this is a bit of bad suggestion, but considering that the enterity of kernels > > supporting hugetlb vmemmap optimization are using the old interface (the new one isn't yet > > settled it seems?) would it be better to go with something this patch, and then have a > > second patch (simpler one) that just switches to the new interface? > > > > My thoughts. > Right now, the earliest any of this code would be merged is in 6.1. If > the alloc_frozen_pages interface gors into 6.1, then it would make sense > to just make this patch/change use it. If not, we can manually freeze as > done here, and switch when alloc_frozen_pages becomes available. In either > case, this could/should go into 6.1. > > We still have a bit of time to see if alloc_frozen_pages will make 6.1. > Ideally, we would wait and see. Ideally, I (we) would help with Matthew's > series. However, I am a bit concerned that we may forget about pushing this > forward and miss the window. And, since Joao's optimization depends on this, > that would miss the window as well. > > Matthew, any thoughts on the likelihood of alloc_frozen_pages going into 6.1? Unless someone objects, my plan is to send an updated verstion of this patch (fixing __prep_compound_gigantic_page issue, thanks Miaohe!). We can then switch to using alloc_frozen_pages when it becomes available. With this patch in place, Joao should be able to send an updated version of the patch "mm/hugetlb_vmemmap: remap head page to newly allocated page".
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 28516881a1b2..6b90d85d545b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1769,13 +1769,12 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, { int i, j; int nr_pages = 1 << order; - struct page *p = page + 1; + struct page *p = page; /* we rely on prep_new_huge_page to set the destructor */ set_compound_order(page, order); - __ClearPageReserved(page); __SetPageHead(page); - for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { + for (i = 0; i < nr_pages; i++, p = mem_map_next(p, page, i)) { /* * For gigantic hugepages allocated through bootmem at * boot, it's safer to be consistent with the not-gigantic @@ -1814,7 +1813,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, } else { VM_BUG_ON_PAGE(page_count(p), p); } - set_compound_head(p, page); + if (i != 0) + set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); atomic_set(compound_pincount_ptr(page), 0); @@ -1918,6 +1918,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int order = huge_page_order(h); struct page *page; bool alloc_try_hard = true; + bool retry = true; /* * By default we always try hard to allocate the page with @@ -1933,7 +1934,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, gfp_mask |= __GFP_RETRY_MAYFAIL; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); +retry: page = __alloc_pages(gfp_mask, order, nid, nmask); + + /* Freeze head page */ + if (!page_ref_freeze(page, 1)) { + __free_pages(page, order); + if (retry) { /* retry once */ + retry = false; + goto retry; + } + /* WOW! twice in a row. */ + pr_warn("HugeTLB head page unexpected inflated ref count\n"); + page = NULL; + } + if (page) __count_vm_event(HTLB_BUDDY_PGALLOC); else @@ -1961,6 +1976,9 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, /* * Common helper to allocate a fresh hugetlb page. All specific allocators * should use this function to get new hugetlb pages + * + * Note that returned page is 'frozen': ref count of head page and all tail + * pages is zero. */ static struct page *alloc_fresh_huge_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask, @@ -2018,7 +2036,7 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, if (!page) return 0; - put_page(page); /* free it into the hugepage allocator */ + free_huge_page(page); /* free it into the hugepage allocator */ return 1; } @@ -2175,10 +2193,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) * Allocates a fresh surplus page from the page allocator. */ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, - int nid, nodemask_t *nmask, bool zero_ref) + int nid, nodemask_t *nmask) { struct page *page = NULL; - bool retry = false; if (hstate_is_gigantic(h)) return NULL; @@ -2188,7 +2205,6 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, goto out_unlock; spin_unlock_irq(&hugetlb_lock); -retry: page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -2204,34 +2220,10 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) { SetHPageTemporary(page); spin_unlock_irq(&hugetlb_lock); - put_page(page); + free_huge_page(page); return NULL; } - if (zero_ref) { - /* - * Caller requires a page with zero ref count. - * We will drop ref count here. If someone else is holding - * a ref, the page will be freed when they drop it. Abuse - * temporary page flag to accomplish this. - */ - SetHPageTemporary(page); - if (!put_page_testzero(page)) { - /* - * Unexpected inflated ref count on freshly allocated - * huge. Retry once. - */ - pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); - spin_unlock_irq(&hugetlb_lock); - if (retry) - return NULL; - - retry = true; - goto retry; - } - ClearHPageTemporary(page); - } - h->surplus_huge_pages++; h->surplus_huge_pages_node[page_to_nid(page)]++; @@ -2253,6 +2245,9 @@ static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, if (!page) return NULL; + /* fresh huge pages are frozen */ + set_page_refcounted(page); + /* * We do not account these pages as surplus because they are only * temporary and will be released properly on the last reference @@ -2280,14 +2275,14 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h, gfp_t gfp = gfp_mask | __GFP_NOWARN; gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); - page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false); + page = alloc_surplus_huge_page(h, gfp, nid, nodemask); /* Fallback to all nodes if page==NULL */ nodemask = NULL; } if (!page) - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false); + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask); mpol_cond_put(mpol); return page; } @@ -2358,7 +2353,7 @@ static int gather_surplus_pages(struct hstate *h, long delta) spin_unlock_irq(&hugetlb_lock); for (i = 0; i < needed; i++) { page = alloc_surplus_huge_page(h, htlb_alloc_mask(h), - NUMA_NO_NODE, NULL, true); + NUMA_NO_NODE, NULL); if (!page) { alloc_ok = false; break; @@ -2720,7 +2715,6 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, { gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; int nid = page_to_nid(old_page); - bool alloc_retry = false; struct page *new_page; int ret = 0; @@ -2731,30 +2725,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, * the pool. This simplifies and let us do most of the processing * under the lock. */ -alloc_retry: new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); if (!new_page) return -ENOMEM; - /* - * If all goes well, this page will be directly added to the free - * list in the pool. For this the ref count needs to be zero. - * Attempt to drop now, and retry once if needed. It is VERY - * unlikely there is another ref on the page. - * - * If someone else has a reference to the page, it will be freed - * when they drop their ref. Abuse temporary page flag to accomplish - * this. Retry once if there is an inflated ref count. - */ - SetHPageTemporary(new_page); - if (!put_page_testzero(new_page)) { - if (alloc_retry) - return -EBUSY; - - alloc_retry = true; - goto alloc_retry; - } - ClearHPageTemporary(new_page); - __prep_new_huge_page(h, new_page); retry: @@ -2934,6 +2907,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, } spin_lock_irq(&hugetlb_lock); list_add(&page->lru, &h->hugepage_activelist); + set_page_refcounted(page); /* Fall through */ } hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); @@ -3038,7 +3012,7 @@ static void __init gather_bootmem_prealloc(void) if (prep_compound_gigantic_page(page, huge_page_order(h))) { WARN_ON(PageReserved(page)); prep_new_huge_page(h, page, page_to_nid(page)); - put_page(page); /* add to the hugepage allocator */ + free_huge_page(page); /* add to the hugepage allocator */ } else { /* VERY unlikely inflated ref count on a tail page */ free_gigantic_page(page, huge_page_order(h)); @@ -3070,7 +3044,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) &node_states[N_MEMORY], NULL); if (!page) break; - put_page(page); /* free it into the hugepage allocator */ + free_huge_page(page); /* free it into the hugepage allocator */ } cond_resched(); } @@ -3459,9 +3433,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page) else prep_compound_page(page + i, target_hstate->order); set_page_private(page + i, 0); - set_page_refcounted(page + i); prep_new_huge_page(target_hstate, page + i, nid); - put_page(page + i); + free_huge_page(page + i); } mutex_unlock(&target_hstate->resize_lock);
When creating hugetlb pages, the hugetlb code must first allocate contiguous pages from a low level allocator such as buddy, cma or memblock. The pages returned from these low level allocators are ref counted. This creates potential issues with other code taking speculative references on these pages before they can be transformed to a hugetlb page. This issue has been addressed with methods and code such as that provided in [1]. Recent discussions about vmemmap freeing [2] have indicated that it would be beneficial to freeze all sub pages, including the head page of pages returned from low level allocators before converting to a hugetlb page. This helps avoid races if want to replace the page containing vmemmap for the head page. There have been proposals to change at least the buddy allocator to return frozen pages as described at [3]. If such a change is made, it can be employed by the hugetlb code. However, as mentioned above hugetlb uses several low level allocators so each would need to be modified to return frozen pages. For now, we can manually freeze the returned pages. This is done in two places: 1) alloc_buddy_huge_page, only the returned head page is ref counted. We freeze the head page, retrying once in the VERY rare case where there may be an inflated ref count. 2) prep_compound_gigantic_page, for gigantic pages the current code freezes all pages except the head page. New code will simply freeze the head page as well. In a few other places, code checks for inflated ref counts on newly allocated hugetlb pages. With the modifications to freeze after allocating, this code can be removed. After hugetlb pages are freshly allocated, they are often added to the hugetlb free lists. Since these pages were previously ref counted, this was done via put_page() which would end up calling the hugetlb destructor: free_huge_page. With changes to freeze pages, we simply call free_huge_page directly to add the pages to the free list. In a few other places, freshly allocated hugetlb pages were immediately put into use, and the expectation was they were already ref counted. In these cases, we must manually ref count the page. [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ [3] https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/ Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 97 +++++++++++++++++++--------------------------------- 1 file changed, 35 insertions(+), 62 deletions(-)