hugetlb: Convert alloc_buddy_hugetlb_folio to use a folio

Message ID	20240402200656.913841-1-willy@infradead.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-mm@kvack.org, Muchun Song <muchun.song@linux.dev> Subject: [PATCH] hugetlb: Convert alloc_buddy_hugetlb_folio to use a folio Date: Tue, 2 Apr 2024 21:06:54 +0100 Message-ID: <20240402200656.913841-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	hugetlb: Convert alloc_buddy_hugetlb_folio to use a folio \| expand hugetlb: Convert alloc_buddy_hugetlb_folio to use a folio

Matthew Wilcox April 2, 2024, 8:06 p.m. UTC

While this function returned a folio, it was still using __alloc_pages()
and __free_pages().  Use __folio_alloc() and put_folio() instead.  This
actually removes a call to compound_head(), but more importantly, it
prepares us for the move to memdescs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/hugetlb.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

Sidhartha Kumar April 2, 2024, 9:19 p.m. UTC | #1

On 4/2/24 1:06 PM, Matthew Wilcox (Oracle) wrote:
> While this function returned a folio, it was still using __alloc_pages()
> and __free_pages().  Use __folio_alloc() and put_folio() instead.  This
> actually removes a call to compound_head(), but more importantly, it
> prepares us for the move to memdescs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   mm/hugetlb.c | 33 ++++++++++++++++-----------------
>   1 file changed, 16 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a3bffa8debde..5f1e0b1a0d57 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2177,13 +2177,13 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
>   		nodemask_t *node_alloc_noretry)
>   {
>   	int order = huge_page_order(h);
> -	struct page *page;
> +	struct folio *folio;
>   	bool alloc_try_hard = true;
>   	bool retry = true;
>   
>   	/*
> -	 * By default we always try hard to allocate the page with
> -	 * __GFP_RETRY_MAYFAIL flag.  However, if we are allocating pages in
> +	 * By default we always try hard to allocate the folio with
> +	 * __GFP_RETRY_MAYFAIL flag.  However, if we are allocating folios in
>   	 * a loop (to adjust global huge page counts) and previous allocation
>   	 * failed, do not continue to try hard on the same node.  Use the
>   	 * node_alloc_noretry bitmap to manage this state information.
> @@ -2196,43 +2196,42 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
>   	if (nid == NUMA_NO_NODE)
>   		nid = numa_mem_id();
>   retry:
> -	page = __alloc_pages(gfp_mask, order, nid, nmask);
> +	folio = __folio_alloc(gfp_mask, order, nid, nmask);
>   
> -	/* Freeze head page */
> -	if (page && !page_ref_freeze(page, 1)) {
> -		__free_pages(page, order);
> +	if (folio && !folio_ref_freeze(folio, 1)) {
> +		folio_put(folio);
>   		if (retry) {	/* retry once */
>   			retry = false;
>   			goto retry;
>   		}
>   		/* WOW!  twice in a row. */
> -		pr_warn("HugeTLB head page unexpected inflated ref count\n");
> -		page = NULL;
> +		pr_warn("HugeTLB unexpected inflated folio ref count\n");
> +		folio = NULL;
>   	}
>   
>   	/*
> -	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this
> -	 * indicates an overall state change.  Clear bit so that we resume
> -	 * normal 'try hard' allocations.
> +	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a
> +	 * folio this indicates an overall state change.  Clear bit so
> +	 * that we resume normal 'try hard' allocations.
>   	 */
> -	if (node_alloc_noretry && page && !alloc_try_hard)
> +	if (node_alloc_noretry && folio && !alloc_try_hard)
>   		node_clear(nid, *node_alloc_noretry);
>   
>   	/*
> -	 * If we tried hard to get a page but failed, set bit so that
> +	 * If we tried hard to get a folio but failed, set bit so that
>   	 * subsequent attempts will not try as hard until there is an
>   	 * overall state change.
>   	 */
> -	if (node_alloc_noretry && !page && alloc_try_hard)
> +	if (node_alloc_noretry && !folio && alloc_try_hard)
>   		node_set(nid, *node_alloc_noretry);
>   
> -	if (!page) {
> +	if (!folio) {
>   		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
>   		return NULL;
>   	}
>   
>   	__count_vm_event(HTLB_BUDDY_PGALLOC);
> -	return page_folio(page);
> +	return folio;
>   }
>   
>   static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>

Oscar Salvador April 3, 2024, 4:55 a.m. UTC | #2

On Tue, Apr 02, 2024 at 09:06:54PM +0100, Matthew Wilcox (Oracle) wrote:
> While this function returned a folio, it was still using __alloc_pages()
> and __free_pages().  Use __folio_alloc() and put_folio() instead.  This
> actually removes a call to compound_head(), but more importantly, it
> prepares us for the move to memdescs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

> -	page = __alloc_pages(gfp_mask, order, nid, nmask);
> +	folio = __folio_alloc(gfp_mask, order, nid, nmask);
>  
> -	/* Freeze head page */
> -	if (page && !page_ref_freeze(page, 1)) {
> -		__free_pages(page, order);
> +	if (folio && !folio_ref_freeze(folio, 1)) {
> +		folio_put(folio);

This made me look again at the problem we had in the past with
speculative refcount vs hugetlb pages, and made me think whether there
are any more users trying to defeat speculative refcounts this way.
It was discussed some time ago that maybe all pages returned from the buddy
allocator should have its refcount frozen, to avoid this.

Muchun Song April 3, 2024, 6 a.m. UTC | #3

> On Apr 3, 2024, at 04:06, Matthew Wilcox (Oracle) <willy@infradead.org> wrote:
> 
> While this function returned a folio, it was still using __alloc_pages()
> and __free_pages().  Use __folio_alloc() and put_folio() instead.  This
> actually removes a call to compound_head(), but more importantly, it
> prepares us for the move to memdescs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.

Muchun Song April 3, 2024, 6:19 a.m. UTC | #4

> On Apr 3, 2024, at 12:55, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Tue, Apr 02, 2024 at 09:06:54PM +0100, Matthew Wilcox (Oracle) wrote:
>> While this function returned a folio, it was still using __alloc_pages()
>> and __free_pages().  Use __folio_alloc() and put_folio() instead.  This
>> actually removes a call to compound_head(), but more importantly, it
>> prepares us for the move to memdescs.
>> 
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> 
>> - page = __alloc_pages(gfp_mask, order, nid, nmask);
>> + folio = __folio_alloc(gfp_mask, order, nid, nmask);
>> 
>> - /* Freeze head page */
>> - if (page && !page_ref_freeze(page, 1)) {
>> - __free_pages(page, order);
>> + if (folio && !folio_ref_freeze(folio, 1)) {
>> + folio_put(folio);
> 
> This made me look again at the problem we had in the past with
> speculative refcount vs hugetlb pages, and made me think whether there
> are any more users trying to defeat speculative refcounts this way.
> It was discussed some time ago that maybe all pages returned from the buddy
> allocator should have its refcount frozen, to avoid this.

I think you mean this patch [1], right? With alloc_frozen_pages()
introduced, we could get rid of the trick from HugeTLB code.

[1] https://lore.kernel.org/linux-mm/B4889CC0-5D36-44E2-B901-CDC5226995A2@oracle.com/T/#m20457752e757cdafc99c7f5e6a3d8cbbb65fcd3e

> 
> 
> -- 
> Oscar Salvador
> SUSE Labs

Oscar Salvador April 3, 2024, 7:25 a.m. UTC | #5

On Wed, Apr 03, 2024 at 02:19:19PM +0800, Muchun Song wrote:
> I think you mean this patch [1], right? With alloc_frozen_pages()
> introduced, we could get rid of the trick from HugeTLB code.

Ah yes, that one, thanks.
It would be nice, but having read the discussion I am kind of skeptical.

But maybe some to revisit.

Matthew Wilcox April 3, 2024, 9:17 p.m. UTC | #6

On Wed, Apr 03, 2024 at 09:25:41AM +0200, Oscar Salvador wrote:
> On Wed, Apr 03, 2024 at 02:19:19PM +0800, Muchun Song wrote:
> > I think you mean this patch [1], right? With alloc_frozen_pages()
> > introduced, we could get rid of the trick from HugeTLB code.
> 
> Ah yes, that one, thanks.
> It would be nice, but having read the discussion I am kind of skeptical.
> 
> But maybe some to revisit.

I haven't given up on it.  It's just currently parked, awaiting more
cleanups, some of which I have scheduled for the next merge window.
Part of the memdesc project will involve not having refcounts for some
memdescs.  Slab, percpu and pagetable don't need them, for example.

I think hugetlb is being unnecessarily paranoid here, tbh.  Or maybe
this part is just badly structured; if we're allocating a hugetlb folio,
it should be fine for its refcount to be temporarily elevated by someone
else.  Not sure I can figure out what's going on in
alloc_and_dissolve_hugetlb_folio() though.

Oscar Salvador April 4, 2024, 11:13 a.m. UTC | #7

On Wed, Apr 03, 2024 at 10:17:45PM +0100, Matthew Wilcox wrote:
> I think hugetlb is being unnecessarily paranoid here, tbh.  Or maybe
> this part is just badly structured; if we're allocating a hugetlb folio,
> it should be fine for its refcount to be temporarily elevated by someone
> else.  Not sure I can figure out what's going on in
> alloc_and_dissolve_hugetlb_folio() though.

AFAICR, the problem comes when we need to remap the pages for vmemmap
optimization [1].
So, IIUC:

1) if someone comes around and grabs a refcount (say something doing
   speculative stuff)
2) we do the remapping
3) that someone who took the refcount, now does a put_page()
4) vmemmap no longer points to the old page but the new one, meaning
   that that 'put_page()' is done on the wrong page.

@Munchun: Did I get this right?

[1] https://lore.kernel.org/linux-mm/YupRjWRiz4lPo+y7@FVFYT0MHHV2J/

Muchun Song April 7, 2024, 7:14 a.m. UTC | #8

> On Apr 4, 2024, at 19:13, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Wed, Apr 03, 2024 at 10:17:45PM +0100, Matthew Wilcox wrote:
>> I think hugetlb is being unnecessarily paranoid here, tbh.  Or maybe
>> this part is just badly structured; if we're allocating a hugetlb folio,
>> it should be fine for its refcount to be temporarily elevated by someone
>> else.  Not sure I can figure out what's going on in
>> alloc_and_dissolve_hugetlb_folio() though.
> 
> AFAICR, the problem comes when we need to remap the pages for vmemmap
> optimization [1].
> So, IIUC:
> 
> 1) if someone comes around and grabs a refcount (say something doing
>   speculative stuff)
> 2) we do the remapping
> 3) that someone who took the refcount, now does a put_page()
> 4) vmemmap no longer points to the old page but the new one, meaning
>   that that 'put_page()' is done on the wrong page.
> 
> @Munchun: Did I get this right?

Right. But let me clarify this again. We need to keep the content of the
physical page constant throughout the processing of HVO (i.g. between
the copying of head vmemmap page and remapping of it). Zero-referenced page
could prevent others from updating the content of the page structs.

> 
> [1] https://lore.kernel.org/linux-mm/YupRjWRiz4lPo+y7@FVFYT0MHHV2J/

Yes, I've also pointed it out here.

Thanks.

> 
> 
> -- 
> Oscar Salvador
> SUSE Labs

hugetlb: Convert alloc_buddy_hugetlb_folio to use a folio

Commit Message

Comments

Patch