diff mbox series

[PATCHv8,02/14] mm: Add support for unaccepted memory

Message ID 20221207014933.8435-3-kirill.shutemov@linux.intel.com (mailing list archive)
State New
Headers show
Series mm, x86/cc: Implement support for unaccepted memory | expand

Commit Message

Kirill A. Shutemov Dec. 7, 2022, 1:49 a.m. UTC
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

 1. Accept all the memory during the boot. It is easy to implement and
    it doesn't have runtime cost once the system is booted. The downside
    is very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #2 for now. It is a reasonable default. Some workloads may
want to use #1 or #3 and they can be implemented later based on user's
demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

The page allocator is modified to accept pages. New memory gets accepted
before putting pages on free lists. It is done lazily: only accept new
pages when we run out of already accepted memory.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
---
 include/linux/mmzone.h     |   5 ++
 include/linux/page-flags.h |  24 ++++++++
 mm/internal.h              |  12 ++++
 mm/memblock.c              |   9 +++
 mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
 5 files changed, 169 insertions(+)

Comments

Vlastimil Babka Dec. 9, 2022, 6:10 p.m. UTC | #1
On 12/7/22 02:49, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>  1. Accept all the memory during the boot. It is easy to implement and
>     it doesn't have runtime cost once the system is booted. The downside
>     is very long boot time.
> 
>     Accept can be parallelized to multiple CPUs to keep it manageable
>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>     memory bandwidth and does not scale beyond the point.
> 
>  2. Accept a block of memory on the first use. It requires more
>     infrastructure and changes in page allocator to make it work, but
>     it provides good boot time.
> 
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.
> 
>  3. Accept all memory in background. Introduce a thread (or multiple)
>     that gets memory accepted proactively. It will minimize time the
>     system experience latency spikes on memory allocation while keeping
>     low boot time.
> 
>     This approach cannot function on its own. It is an extension of #2:
>     background memory acceptance requires functional scheduler, but the
>     page allocator may need to tap into unaccepted memory before that.
> 
>     The downside of the approach is that these threads also steal CPU
>     cycles and memory bandwidth from the user's workload and may hurt
>     user experience.
> 
> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages. New memory gets accepted
> before putting pages on free lists. It is done lazily: only accept new
> pages when we run out of already accepted memory.
> 
> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 
>  - range_contains_unaccepted_memory() checks anything within the range
>    of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock

Hi,

the intrusiveness to page allocator seems to be much better in this version
than in the older ones, thanks!

> ---
>  include/linux/mmzone.h     |   5 ++
>  include/linux/page-flags.h |  24 ++++++++
>  mm/internal.h              |  12 ++++
>  mm/memblock.c              |   9 +++
>  mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 169 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5f74891556f3..da335381e63f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -822,6 +822,11 @@ struct zone {
>  	/* free areas of different sizes */
>  	struct free_area	free_area[MAX_ORDER];
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* pages to be accepted */
> +	struct list_head	unaccepted_pages;
> +#endif
> +
>  	/* zone flags, see below */
>  	unsigned long		flags;
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0b0ae5084e60..ce953be8fe10 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
>  #define PG_offline	0x00000100
>  #define PG_table	0x00000200
>  #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>  	page->page_type |= PG_##lname;					\
>  }
>  
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}

Wonder if we need this. The page type is defined regardless of
CONFIG_UNACCEPTED_MEMORY and with CONFIG_UNACCEPTED_MEMORY=n nobody will be
actually be setting or checking the type, so we can simply just keep the
"real" implementations?

> +
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
> @@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
>  extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b7ef495b56d..8ef4f88608ad 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
>  	return !(vma->vm_flags & VM_SOFTDIRTY);
>  }
>  
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 511d4783dcf1..3bc404a5352a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		 */
>  		kmemleak_alloc_phys(found, size, 0);
>  
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>  	return found;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e60657875d3..6d597e833a73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +/* Counts number of zones with unaccepted pages. */
> +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> +#endif
> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  /*
>   * During boot we initialize deferred pages on-demand, as needed, but once
> @@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>  {
>  	struct free_area *area = &zone->free_area[order];
>  
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);

Hm I think we're not allowed to do VM_BUG_ON anymore per some semi-recent
thread with Linus, as new BUG_ONs are forbidden in general, and Fedora
builds with DEBUG_VM. So probably use VM_WARN_ON everywhere and bail out
where feasible (i.e. probably not here).

>  	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  					   unsigned int order)
>  {
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> +
>  	/* clear reported state and update reported page count */
>  	if (page_reported(page))
>  		__ClearPageReported(page);
> @@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	__count_vm_events(PGFREE, 1 << order);
>  }
>  
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +
> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +}
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	unsigned long flags, order;
> +	struct page *page;
> +	bool last = false;
> +	int migratetype;
> +
> +	if (!static_branch_unlikely(&unaccepted_pages))
> +		return false;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	page = list_first_entry_or_null(&zone->unaccepted_pages,
> +					struct page, lru);
> +	if (!page) {
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +		return false;
> +	}
> +
> +	list_del(&page->lru);
> +	last = list_empty(&zone->unaccepted_pages);
> +
> +	order = page->private;
> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> +
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> +	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	if (last)
> +		static_branch_dec(&unaccepted_pages);
> +
> +	accept_page(page, order);
> +	__ClearPageUnaccepted(page);
> +	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +
> +	return true;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long flags;
> +	int migratetype;
> +	bool first = false;
> +
> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> +	__SetPageUnaccepted(page);
> +	page->private = order;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	first = list_empty(&zone->unaccepted_pages);
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> +	list_add_tail(&page->lru, &zone->unaccepted_pages);
> +	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	if (first)
> +		static_branch_inc(&unaccepted_pages);
> +}
> +
> +#else
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	return false;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	BUILD_BUG();
> +}
> +
> +#endif /* CONFIG_UNACCEPTED_MEMORY */
> +
>  void __free_pages_core(struct page *page, unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
> @@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>  
>  	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>  
> +	if (page_contains_unaccepted(page, order)) {
> +		if (order >= pageblock_order)
> +			return __free_unaccepted(page, order);

Would it be better to only do this when all pages are unaccepted, and accept
all of them as long as at least one in there is already accepted? Or not
worth the trouble of making page_contains_unaccepted() actually count all of
them, which would be necessary.

> +		else
> +			accept_page(page, order);
> +	}
> +
>  	/*
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
> @@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if (pageblock_aligned(pfn))
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  				       gfp_mask)) {
>  			int ret;
>  
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;


This is under !zone_watermark_fast(), but as unaccepted pages are counted as
free pages, then I think this is unlikely to trigger in practice AFAIU as we
should therefore be passing the watermarks even if pages are unaccepted.

That's (intentionally IIRC) different from deferred init which doesn't count
the pages as free until they are initialized.

>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/*
>  			 * Watermark failed for this zone, but see if we can
> @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  
>  			return page;
>  		} else {
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;

On the other hand, here we failed the full rmqueue(), including the
potentially fragmenting fallbacks, so I'm worried that before we finally
fail all of that and resort to accepting more memory, we already fragmented
the already accepted memory, more than necessary.

So one way to prevent would be to move the acceptance into rmqueue() to
happen before __rmqueue_fallback(), which I originally had in mind and maybe
suggested that previously.

But maybe less intrusive and more robust way would be to track how much
memory is unaccepted, and actually decrement that amount  from free memory
in zone_watermark_fast() in order to force earlier failure of that check and
thus to accept more memory and give us a buffer of truly accepted and
available memory up to high watermark, which should hopefully prevent most
of the fallbacks. Then the code I flagged above as currently unecessary
would make perfect sense.

And maybe Mel will have some ideas as well.

> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/* Try again if zone has deferred pages */
>  			if (static_branch_unlikely(&deferred_pages)) {
> @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>  		zone->free_area[order].nr_free = 0;
>  	}
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	INIT_LIST_HEAD(&zone->unaccepted_pages);
> +#endif
>  }
>  
>  /*
Kirill A. Shutemov Dec. 9, 2022, 7:26 p.m. UTC | #2
On Fri, Dec 09, 2022 at 07:10:25PM +0100, Vlastimil Babka wrote:
> On 12/7/22 02:49, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, require memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific to the Virtual Machine
> > platform.
> > 
> > There are several ways kernel can deal with unaccepted memory:
> > 
> >  1. Accept all the memory during the boot. It is easy to implement and
> >     it doesn't have runtime cost once the system is booted. The downside
> >     is very long boot time.
> > 
> >     Accept can be parallelized to multiple CPUs to keep it manageable
> >     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
> >     memory bandwidth and does not scale beyond the point.
> > 
> >  2. Accept a block of memory on the first use. It requires more
> >     infrastructure and changes in page allocator to make it work, but
> >     it provides good boot time.
> > 
> >     On-demand memory accept means latency spikes every time kernel steps
> >     onto a new memory block. The spikes will go away once workload data
> >     set size gets stabilized or all memory gets accepted.
> > 
> >  3. Accept all memory in background. Introduce a thread (or multiple)
> >     that gets memory accepted proactively. It will minimize time the
> >     system experience latency spikes on memory allocation while keeping
> >     low boot time.
> > 
> >     This approach cannot function on its own. It is an extension of #2:
> >     background memory acceptance requires functional scheduler, but the
> >     page allocator may need to tap into unaccepted memory before that.
> > 
> >     The downside of the approach is that these threads also steal CPU
> >     cycles and memory bandwidth from the user's workload and may hurt
> >     user experience.
> > 
> > Implement #2 for now. It is a reasonable default. Some workloads may
> > want to use #1 or #3 and they can be implemented later based on user's
> > demands.
> > 
> > Support of unaccepted memory requires a few changes in core-mm code:
> > 
> >   - memblock has to accept memory on allocation;
> > 
> >   - page allocator has to accept memory on the first allocation of the
> >     page;
> > 
> > Memblock change is trivial.
> > 
> > The page allocator is modified to accept pages. New memory gets accepted
> > before putting pages on free lists. It is done lazily: only accept new
> > pages when we run out of already accepted memory.
> > 
> > Architecture has to provide two helpers if it wants to support
> > unaccepted memory:
> > 
> >  - accept_memory() makes a range of physical addresses accepted.
> > 
> >  - range_contains_unaccepted_memory() checks anything within the range
> >    of physical addresses requires acceptance.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> 
> Hi,
> 
> the intrusiveness to page allocator seems to be much better in this version
> than in the older ones, thanks!
> 
> > ---
> >  include/linux/mmzone.h     |   5 ++
> >  include/linux/page-flags.h |  24 ++++++++
> >  mm/internal.h              |  12 ++++
> >  mm/memblock.c              |   9 +++
> >  mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
> >  5 files changed, 169 insertions(+)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 5f74891556f3..da335381e63f 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -822,6 +822,11 @@ struct zone {
> >  	/* free areas of different sizes */
> >  	struct free_area	free_area[MAX_ORDER];
> >  
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +	/* pages to be accepted */
> > +	struct list_head	unaccepted_pages;
> > +#endif
> > +
> >  	/* zone flags, see below */
> >  	unsigned long		flags;
> >  
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 0b0ae5084e60..ce953be8fe10 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
> >  #define PG_offline	0x00000100
> >  #define PG_table	0x00000200
> >  #define PG_guard	0x00000400
> > +#define PG_unaccepted	0x00000800
> >  
> >  #define PageType(page, flag)						\
> >  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> > @@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
> >  	page->page_type |= PG_##lname;					\
> >  }
> >  
> > +#define PAGE_TYPE_OPS_FALSE(uname)					\
> > +static __always_inline int Page##uname(struct page *page)		\
> > +{									\
> > +	return false;							\
> > +}									\
> > +static __always_inline void __SetPage##uname(struct page *page)		\
> > +{									\
> > +}									\
> > +static __always_inline void __ClearPage##uname(struct page *page)	\
> > +{									\
> > +}
> 
> Wonder if we need this. The page type is defined regardless of
> CONFIG_UNACCEPTED_MEMORY and with CONFIG_UNACCEPTED_MEMORY=n nobody will be
> actually be setting or checking the type, so we can simply just keep the
> "real" implementations?

Okay. Makes sense.

> > +
> >  /*
> >   * PageBuddy() indicates that the page is free and in the buddy system
> >   * (see mm/page_alloc.c).
> > @@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
> >   */
> >  PAGE_TYPE_OPS(Offline, offline)
> >  
> > +/*
> > + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> > + * be read or written. The page allocator must call accept_page() before
> > + * touching the page or returning it to the caller.
> > + */
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> > +#else
> > +PAGE_TYPE_OPS_FALSE(Unaccepted)
> > +#endif
> > +
> >  extern void page_offline_freeze(void);
> >  extern void page_offline_thaw(void);
> >  extern void page_offline_begin(void);
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 6b7ef495b56d..8ef4f88608ad 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
> >  	return !(vma->vm_flags & VM_SOFTDIRTY);
> >  }
> >  
> > +#ifndef CONFIG_UNACCEPTED_MEMORY
> > +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> > +						    phys_addr_t end)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +}
> > +#endif
> > +
> >  #endif	/* __MM_INTERNAL_H */
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 511d4783dcf1..3bc404a5352a 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
> >  		 */
> >  		kmemleak_alloc_phys(found, size, 0);
> >  
> > +	/*
> > +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > +	 * require memory to be accepted before it can be used by the
> > +	 * guest.
> > +	 *
> > +	 * Accept the memory of the allocated buffer.
> > +	 */
> > +	accept_memory(found, found + size);
> > +
> >  	return found;
> >  }
> >  
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e60657875d3..6d597e833a73 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
> >  
> >  int page_group_by_mobility_disabled __read_mostly;
> >  
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +/* Counts number of zones with unaccepted pages. */
> > +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> > +#endif
> > +
> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  /*
> >   * During boot we initialize deferred pages on-demand, as needed, but once
> > @@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
> >  {
> >  	struct free_area *area = &zone->free_area[order];
> >  
> > +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> 
> Hm I think we're not allowed to do VM_BUG_ON anymore per some semi-recent
> thread with Linus, as new BUG_ONs are forbidden in general, and Fedora
> builds with DEBUG_VM. So probably use VM_WARN_ON everywhere and bail out
> where feasible (i.e. probably not here).

Okay, I can downgrade it to VM_WARN_ON_PAGE(), but it will crash soon
after anyway. On the first access to the supposedly free page that
unaccepted.

> >  	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
> >  }
> >  
> >  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> >  					   unsigned int order)
> >  {
> > +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> > +
> >  	/* clear reported state and update reported page count */
> >  	if (page_reported(page))
> >  		__ClearPageReported(page);
> > @@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
> >  	__count_vm_events(PGFREE, 1 << order);
> >  }
> >  
> > +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +	phys_addr_t end = start + (PAGE_SIZE << order);
> > +
> > +	return range_contains_unaccepted_memory(start, end);
> > +}
> > +
> > +static void accept_page(struct page *page, unsigned int order)
> > +{
> > +	phys_addr_t start = page_to_phys(page);
> > +
> > +	accept_memory(start, start + (PAGE_SIZE << order));
> > +}
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +
> > +static bool try_to_accept_memory(struct zone *zone)
> > +{
> > +	unsigned long flags, order;
> > +	struct page *page;
> > +	bool last = false;
> > +	int migratetype;
> > +
> > +	if (!static_branch_unlikely(&unaccepted_pages))
> > +		return false;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	page = list_first_entry_or_null(&zone->unaccepted_pages,
> > +					struct page, lru);
> > +	if (!page) {
> > +		spin_unlock_irqrestore(&zone->lock, flags);
> > +		return false;
> > +	}
> > +
> > +	list_del(&page->lru);
> > +	last = list_empty(&zone->unaccepted_pages);
> > +
> > +	order = page->private;
> > +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> > +
> > +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> > +	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	if (last)
> > +		static_branch_dec(&unaccepted_pages);
> > +
> > +	accept_page(page, order);
> > +	__ClearPageUnaccepted(page);
> > +	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> > +
> > +	return true;
> > +}
> > +
> > +static void __free_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	struct zone *zone = page_zone(page);
> > +	unsigned long flags;
> > +	int migratetype;
> > +	bool first = false;
> > +
> > +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> > +	__SetPageUnaccepted(page);
> > +	page->private = order;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	first = list_empty(&zone->unaccepted_pages);
> > +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> > +	list_add_tail(&page->lru, &zone->unaccepted_pages);
> > +	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	if (first)
> > +		static_branch_inc(&unaccepted_pages);
> > +}
> > +
> > +#else
> > +
> > +static bool try_to_accept_memory(struct zone *zone)
> > +{
> > +	return false;
> > +}
> > +
> > +static void __free_unaccepted(struct page *page, unsigned int order)
> > +{
> > +	BUILD_BUG();
> > +}
> > +
> > +#endif /* CONFIG_UNACCEPTED_MEMORY */
> > +
> >  void __free_pages_core(struct page *page, unsigned int order)
> >  {
> >  	unsigned int nr_pages = 1 << order;
> > @@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
> >  
> >  	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
> >  
> > +	if (page_contains_unaccepted(page, order)) {
> > +		if (order >= pageblock_order)
> > +			return __free_unaccepted(page, order);
> 
> Would it be better to only do this when all pages are unaccepted, and accept
> all of them as long as at least one in there is already accepted? Or not
> worth the trouble of making page_contains_unaccepted() actually count all of
> them, which would be necessary.

I don't think it worth the change.

This is going to be corner case that affect few pages that happened to be
on the border between unaccepted and accepted memory. It makes no
difference in practice.

> > +		else
> > +			accept_page(page, order);
> > +	}
> > +
> >  	/*
> >  	 * Bypass PCP and place fresh pages right to the tail, primarily
> >  	 * relevant for memory onlining.
> > @@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
> >  		return;
> >  	}
> >  
> > +	/* Accept chunks smaller than page-block upfront */
> > +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> > +
> >  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
> >  		if (pageblock_aligned(pfn))
> >  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > @@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  				       gfp_mask)) {
> >  			int ret;
> >  
> > +			if (try_to_accept_memory(zone))
> > +				goto try_this_zone;
> 
> 
> This is under !zone_watermark_fast(), but as unaccepted pages are counted as
> free pages, then I think this is unlikely to trigger in practice AFAIU as we
> should therefore be passing the watermarks even if pages are unaccepted.
> 
> That's (intentionally IIRC) different from deferred init which doesn't count
> the pages as free until they are initialized.

It is intentional in sense that we don't want to accept pages proactively.
Deferred init need to be finished before userspace starts.

> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  			/*
> >  			 * Watermark failed for this zone, but see if we can
> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  
> >  			return page;
> >  		} else {
> > +			if (try_to_accept_memory(zone))
> > +				goto try_this_zone;
> 
> On the other hand, here we failed the full rmqueue(), including the
> potentially fragmenting fallbacks, so I'm worried that before we finally
> fail all of that and resort to accepting more memory, we already fragmented
> the already accepted memory, more than necessary.

I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
allocate from a free pageblock if we have other memory to tap from? It
doesn't make sense to me.

> So one way to prevent would be to move the acceptance into rmqueue() to
> happen before __rmqueue_fallback(), which I originally had in mind and maybe
> suggested that previously.

I guess it should be pretty straight forward to fail __rmqueue_fallback()
if there's non-empty unaccepted_pages list and steer to
try_to_accept_memory() this way.

But I still don't understand why.

> But maybe less intrusive and more robust way would be to track how much
> memory is unaccepted, and actually decrement that amount  from free memory
> in zone_watermark_fast() in order to force earlier failure of that check and
> thus to accept more memory and give us a buffer of truly accepted and
> available memory up to high watermark, which should hopefully prevent most
> of the fallbacks. Then the code I flagged above as currently unecessary
> would make perfect sense.

The next patch adds per-node unaccepted memory accounting. We can move it
per-zone if it would help.

> And maybe Mel will have some ideas as well.

I don't have much expertise in page allocator. Any input is valuable.

> > +
> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >  			/* Try again if zone has deferred pages */
> >  			if (static_branch_unlikely(&deferred_pages)) {
> > @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> >  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> >  		zone->free_area[order].nr_free = 0;
> >  	}
> > +
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +	INIT_LIST_HEAD(&zone->unaccepted_pages);
> > +#endif
> >  }
> >  
> >  /*
>
Vlastimil Babka Dec. 9, 2022, 10:23 p.m. UTC | #3
On 12/9/22 20:26, Kirill A. Shutemov wrote:
>> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> >  			/*
>> >  			 * Watermark failed for this zone, but see if we can
>> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>> >  
>> >  			return page;
>> >  		} else {
>> > +			if (try_to_accept_memory(zone))
>> > +				goto try_this_zone;
>> 
>> On the other hand, here we failed the full rmqueue(), including the
>> potentially fragmenting fallbacks, so I'm worried that before we finally
>> fail all of that and resort to accepting more memory, we already fragmented
>> the already accepted memory, more than necessary.
> 
> I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
> allocate from a free pageblock if we have other memory to tap from? It
> doesn't make sense to me.

The fragmentation avoidance based on migratetype does work with pageblock
granularity, so yeah, if you accept a single pageblock worth of memory and
then (through __rmqueue_fallback()) end up serving both movable and
unmovable allocations from it, the whole fragmentation avoidance mechanism
is defeated and you end up with unmovable allocations (e.g. page tables)
scattered over many pageblocks and inability to allocate any huge pages.

>> So one way to prevent would be to move the acceptance into rmqueue() to
>> happen before __rmqueue_fallback(), which I originally had in mind and maybe
>> suggested that previously.
> 
> I guess it should be pretty straight forward to fail __rmqueue_fallback()
> if there's non-empty unaccepted_pages list and steer to
> try_to_accept_memory() this way.

That could be a way indeed. We do have ALLOC_NOFRAGMENT which could be
possible to employ here.
But maybe the zone_watermark_fast() modification would be simpler yet
sufficient. It makes sense to me that we'd try to keep a high watermark
worth of pre-accepted memory. zone_watermark_fast() would fail at low
watermark, so we could try accepting (high-low) at a time instead of single
pageblock.

> But I still don't understand why.

To avoid what I described above.

>> But maybe less intrusive and more robust way would be to track how much
>> memory is unaccepted, and actually decrement that amount  from free memory
>> in zone_watermark_fast() in order to force earlier failure of that check and
>> thus to accept more memory and give us a buffer of truly accepted and
>> available memory up to high watermark, which should hopefully prevent most
>> of the fallbacks. Then the code I flagged above as currently unecessary
>> would make perfect sense.
> 
> The next patch adds per-node unaccepted memory accounting. We can move it
> per-zone if it would help.

Right.

>> And maybe Mel will have some ideas as well.
> 
> I don't have much expertise in page allocator. Any input is valuable.
> 
>> > +
>> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> >  			/* Try again if zone has deferred pages */
>> >  			if (static_branch_unlikely(&deferred_pages)) {
>> > @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>> >  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>> >  		zone->free_area[order].nr_free = 0;
>> >  	}
>> > +
>> > +#ifdef CONFIG_UNACCEPTED_MEMORY
>> > +	INIT_LIST_HEAD(&zone->unaccepted_pages);
>> > +#endif
>> >  }
>> >  
>> >  /*
>> 
>
Kirill A. Shutemov Dec. 24, 2022, 4:46 p.m. UTC | #4
On Fri, Dec 09, 2022 at 11:23:50PM +0100, Vlastimil Babka wrote:
> On 12/9/22 20:26, Kirill A. Shutemov wrote:
> >> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >> >  			/*
> >> >  			 * Watermark failed for this zone, but see if we can
> >> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >> >  
> >> >  			return page;
> >> >  		} else {
> >> > +			if (try_to_accept_memory(zone))
> >> > +				goto try_this_zone;
> >> 
> >> On the other hand, here we failed the full rmqueue(), including the
> >> potentially fragmenting fallbacks, so I'm worried that before we finally
> >> fail all of that and resort to accepting more memory, we already fragmented
> >> the already accepted memory, more than necessary.
> > 
> > I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
> > allocate from a free pageblock if we have other memory to tap from? It
> > doesn't make sense to me.
> 
> The fragmentation avoidance based on migratetype does work with pageblock
> granularity, so yeah, if you accept a single pageblock worth of memory and
> then (through __rmqueue_fallback()) end up serving both movable and
> unmovable allocations from it, the whole fragmentation avoidance mechanism
> is defeated and you end up with unmovable allocations (e.g. page tables)
> scattered over many pageblocks and inability to allocate any huge pages.
> 
> >> So one way to prevent would be to move the acceptance into rmqueue() to
> >> happen before __rmqueue_fallback(), which I originally had in mind and maybe
> >> suggested that previously.
> > 
> > I guess it should be pretty straight forward to fail __rmqueue_fallback()
> > if there's non-empty unaccepted_pages list and steer to
> > try_to_accept_memory() this way.
> 
> That could be a way indeed. We do have ALLOC_NOFRAGMENT which could be
> possible to employ here.
> But maybe the zone_watermark_fast() modification would be simpler yet
> sufficient. It makes sense to me that we'd try to keep a high watermark
> worth of pre-accepted memory. zone_watermark_fast() would fail at low
> watermark, so we could try accepting (high-low) at a time instead of single
> pageblock.

Looks like we already have __zone_watermark_unusable_free() that seems
match use-case rather closely. We only need switch unaccepted memory to
per-zone accounting.

The fixup below suppose to do the trick, but I'm not sure how to test
fragmentation avoidance properly.

Any suggestions?

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ca6f0590be21..1bd2d245edee 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -483,7 +483,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
 #ifdef CONFIG_UNACCEPTED_MEMORY
 			     ,
-			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
+			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
 #endif
 			    );
 	len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 789b77c7b6df..e9c05b4c457c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -157,7 +157,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_UNACCEPTED_MEMORY
 	show_val_kb(m, "Unaccepted:     ",
-		    global_node_page_state(NR_UNACCEPTED));
+		    global_zone_page_state(NR_UNACCEPTED));
 #endif
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9c762e8175fc..8b5800cd4424 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -152,6 +152,9 @@ enum zone_stat_item {
 	NR_ZSPAGES,		/* allocated in zsmalloc */
 #endif
 	NR_FREE_CMA_PAGES,
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	NR_UNACCEPTED,
+#endif
 	NR_VM_ZONE_STAT_ITEMS };
 
 enum node_stat_item {
@@ -198,9 +201,6 @@ enum node_stat_item {
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_KERNEL_STACK_KB,	/* measured in KiB */
-#ifdef CONFIG_UNACCEPTED_MEMORY
-	NR_UNACCEPTED,
-#endif
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e80e8d398863..404b267332a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1779,7 +1779,7 @@ static bool try_to_accept_memory(struct zone *zone)
 
 	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
 	__mod_zone_freepage_state(zone, -1 << order, migratetype);
-	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -1 << order);
+	__mod_zone_page_state(zone, NR_UNACCEPTED, -1 << order);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	if (last)
@@ -1808,7 +1808,7 @@ static void __free_unaccepted(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
 	list_add_tail(&page->lru, &zone->unaccepted_pages);
 	__mod_zone_freepage_state(zone, 1 << order, migratetype);
-	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, 1 << order);
+	__mod_zone_page_state(zone, NR_UNACCEPTED, 1 << order);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	if (first)
@@ -4074,6 +4074,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
 	if (!(alloc_flags & ALLOC_CMA))
 		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	unusable_free += zone_page_state(z, NR_UNACCEPTED);
+#endif
 
 	return unusable_free;
 }
Borislav Petkov Dec. 26, 2022, 12:23 p.m. UTC | #5
On Wed, Dec 07, 2022 at 04:49:21AM +0300, Kirill A. Shutemov wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e60657875d3..6d597e833a73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +/* Counts number of zones with unaccepted pages. */
> +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);

Then call it what it is:

static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);

so that when reading the code it is self-describing:

        if (!static_branch_unlikely(&zones_with_unaccepted_pages))
                return false;
Kirill A. Shutemov Dec. 27, 2022, 3:18 a.m. UTC | #6
On Mon, Dec 26, 2022 at 01:23:18PM +0100, Borislav Petkov wrote:
> On Wed, Dec 07, 2022 at 04:49:21AM +0300, Kirill A. Shutemov wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6e60657875d3..6d597e833a73 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
> >  
> >  int page_group_by_mobility_disabled __read_mostly;
> >  
> > +#ifdef CONFIG_UNACCEPTED_MEMORY
> > +/* Counts number of zones with unaccepted pages. */
> > +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> 
> Then call it what it is:
> 
> static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);
> 
> so that when reading the code it is self-describing:
> 
>         if (!static_branch_unlikely(&zones_with_unaccepted_pages))
>                 return false;

Looks good. Will fix for the next version.
Vlastimil Babka Jan. 12, 2023, 11:59 a.m. UTC | #7
On 12/24/22 17:46, Kirill A. Shutemov wrote:
> On Fri, Dec 09, 2022 at 11:23:50PM +0100, Vlastimil Babka wrote:
>> On 12/9/22 20:26, Kirill A. Shutemov wrote:
>> >> >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> >> >  			/*
>> >> >  			 * Watermark failed for this zone, but see if we can
>> >> > @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>> >> >  
>> >> >  			return page;
>> >> >  		} else {
>> >> > +			if (try_to_accept_memory(zone))
>> >> > +				goto try_this_zone;
>> >> 
>> >> On the other hand, here we failed the full rmqueue(), including the
>> >> potentially fragmenting fallbacks, so I'm worried that before we finally
>> >> fail all of that and resort to accepting more memory, we already fragmented
>> >> the already accepted memory, more than necessary.
>> > 
>> > I'm not sure I follow. We accept memory in pageblock chunks. Do we want to
>> > allocate from a free pageblock if we have other memory to tap from? It
>> > doesn't make sense to me.
>> 
>> The fragmentation avoidance based on migratetype does work with pageblock
>> granularity, so yeah, if you accept a single pageblock worth of memory and
>> then (through __rmqueue_fallback()) end up serving both movable and
>> unmovable allocations from it, the whole fragmentation avoidance mechanism
>> is defeated and you end up with unmovable allocations (e.g. page tables)
>> scattered over many pageblocks and inability to allocate any huge pages.
>> 
>> >> So one way to prevent would be to move the acceptance into rmqueue() to
>> >> happen before __rmqueue_fallback(), which I originally had in mind and maybe
>> >> suggested that previously.
>> > 
>> > I guess it should be pretty straight forward to fail __rmqueue_fallback()
>> > if there's non-empty unaccepted_pages list and steer to
>> > try_to_accept_memory() this way.
>> 
>> That could be a way indeed. We do have ALLOC_NOFRAGMENT which could be
>> possible to employ here.
>> But maybe the zone_watermark_fast() modification would be simpler yet
>> sufficient. It makes sense to me that we'd try to keep a high watermark
>> worth of pre-accepted memory. zone_watermark_fast() would fail at low
>> watermark, so we could try accepting (high-low) at a time instead of single
>> pageblock.
> 
> Looks like we already have __zone_watermark_unusable_free() that seems
> match use-case rather closely. We only need switch unaccepted memory to
> per-zone accounting.

Could work. I'd still suggest also making try_to_accept_memory() to accept
up to high watermark, not a single pageblock.
> The fixup below suppose to do the trick, but I'm not sure how to test
> fragmentation avoidance properly.
> 
> Any suggestions?

Haven't done that for years, maybe Mel knows better. But from what I
remember, I'd compare /proc/pagetypeinfo with and without memory accepting,
and collect the mm_page_alloc_extfrag tracepoint. If there are more of these
events happening, it's bad. Ideally with a workload that stresses both
userspace (movable) allocations and kernel allocations. Again, Mel might
have suggestions for a mmtest?

> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ca6f0590be21..1bd2d245edee 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -483,7 +483,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  #endif
>  #ifdef CONFIG_UNACCEPTED_MEMORY
>  			     ,
> -			     nid, K(node_page_state(pgdat, NR_UNACCEPTED))
> +			     nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
>  #endif
>  			    );
>  	len += hugetlb_report_node_meminfo(buf, len, nid);
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 789b77c7b6df..e9c05b4c457c 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -157,7 +157,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  
>  #ifdef CONFIG_UNACCEPTED_MEMORY
>  	show_val_kb(m, "Unaccepted:     ",
> -		    global_node_page_state(NR_UNACCEPTED));
> +		    global_zone_page_state(NR_UNACCEPTED));
>  #endif
>  
>  	hugetlb_report_meminfo(m);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9c762e8175fc..8b5800cd4424 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -152,6 +152,9 @@ enum zone_stat_item {
>  	NR_ZSPAGES,		/* allocated in zsmalloc */
>  #endif
>  	NR_FREE_CMA_PAGES,
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	NR_UNACCEPTED,
> +#endif
>  	NR_VM_ZONE_STAT_ITEMS };
>  
>  enum node_stat_item {
> @@ -198,9 +201,6 @@ enum node_stat_item {
>  	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
>  	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
>  	NR_KERNEL_STACK_KB,	/* measured in KiB */
> -#ifdef CONFIG_UNACCEPTED_MEMORY
> -	NR_UNACCEPTED,
> -#endif
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>  	NR_KERNEL_SCS_KB,	/* measured in KiB */
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e80e8d398863..404b267332a9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1779,7 +1779,7 @@ static bool try_to_accept_memory(struct zone *zone)
>  
>  	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
>  	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> -	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, -1 << order);
> +	__mod_zone_page_state(zone, NR_UNACCEPTED, -1 << order);
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
>  	if (last)
> @@ -1808,7 +1808,7 @@ static void __free_unaccepted(struct page *page, unsigned int order)
>  	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
>  	list_add_tail(&page->lru, &zone->unaccepted_pages);
>  	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> -	__mod_node_page_state(page_pgdat(page), NR_UNACCEPTED, 1 << order);
> +	__mod_zone_page_state(zone, NR_UNACCEPTED, 1 << order);
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
>  	if (first)
> @@ -4074,6 +4074,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
>  	if (!(alloc_flags & ALLOC_CMA))
>  		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
>  #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	unusable_free += zone_page_state(z, NR_UNACCEPTED);
> +#endif
>  
>  	return unusable_free;
>  }
Mel Gorman Jan. 16, 2023, 1:04 p.m. UTC | #8
On Wed, Dec 07, 2022 at 04:49:21AM +0300, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, require memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific to the Virtual Machine
> platform.
> 
> There are several ways kernel can deal with unaccepted memory:
> 
>  1. Accept all the memory during the boot. It is easy to implement and
>     it doesn't have runtime cost once the system is booted. The downside
>     is very long boot time.
> 
>     Accept can be parallelized to multiple CPUs to keep it manageable
>     (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
>     memory bandwidth and does not scale beyond the point.
> 

This should be an option via kernel command line if at all possible as soon
as possible because if issues like fragmentation show up as highlighted by
Vlastimil, then there will be need to do a comparison.  Furthermore, while
boot time is the primary focus of this series, it ignores ramp-up time of
an application that ultimately uses a high percentage of memory. A latency
sensitive application that stalls while accepting memory during ramp-up
may be more undesirable than a long boot time because it's unpredictable.
While it's not exactly the same, sometimes similar happens for ext4 for
lazy inode init after filesystem creation and this feature can be disabled
because it screws with benchmarks based on fresh filesystems. That problem
only happens on every filesystem init though, not every boot.

>  2. Accept a block of memory on the first use. It requires more
>     infrastructure and changes in page allocator to make it work, but
>     it provides good boot time.
> 
>     On-demand memory accept means latency spikes every time kernel steps
>     onto a new memory block. The spikes will go away once workload data
>     set size gets stabilized or all memory gets accepted.
> 

For very large machines (and > 1TB virtual machines do exist), this may
take a long time.

>  3. Accept all memory in background. Introduce a thread (or multiple)
>     that gets memory accepted proactively. It will minimize time the
>     system experience latency spikes on memory allocation while keeping
>     low boot time.
> 
>     This approach cannot function on its own. It is an extension of #2:
>     background memory acceptance requires functional scheduler, but the
>     page allocator may need to tap into unaccepted memory before that.
> 
>     The downside of the approach is that these threads also steal CPU
>     cycles and memory bandwidth from the user's workload and may hurt
>     user experience.
> 

This is not necessarily true, while the background thread (or threads,
one per zone using one local CPU) are running an allocation request
could block on a waitqueue until the background thread makes progress.
That means the application would only stall if the background thread
is too slow or there are too many allocation requests. With that setup,
it would be possible that an application never blocks on accepting memory.


> Implement #2 for now. It is a reasonable default. Some workloads may
> want to use #1 or #3 and they can be implemented later based on user's
> demands.
> 
> Support of unaccepted memory requires a few changes in core-mm code:
> 
>   - memblock has to accept memory on allocation;
> 
>   - page allocator has to accept memory on the first allocation of the
>     page;
> 
> Memblock change is trivial.
> 
> The page allocator is modified to accept pages. New memory gets accepted
> before putting pages on free lists. It is done lazily: only accept new
> pages when we run out of already accepted memory.
> 

And this is the primary disadvantage between 2 and 3, stalls are guaranteed
for some allocating tasks until all memory is accepted.

> Architecture has to provide two helpers if it wants to support
> unaccepted memory:
> 
>  - accept_memory() makes a range of physical addresses accepted.
> 

accept_memory is introduced later in the series and it's an arch-specfic
decision on what range to accept. pageblock order would be the miminum
to help fragmentation avoidance but MAX_ORDER-1 would make more sense.
MAX_ORDER-1 would amortise the cost of acceptance, reduce potential issues
with fragmentation and reduce the total number of stalls required to
accept memory. The minimum range to accept should be generically enforced
in accept_memory that calls arch_accept_memory with the range. accept_page
-> accept_memory is fine but the core->arch_specific divide should be clear.

>  - range_contains_unaccepted_memory() checks anything within the range
>    of physical addresses requires acceptance.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
> ---
>  include/linux/mmzone.h     |   5 ++
>  include/linux/page-flags.h |  24 ++++++++
>  mm/internal.h              |  12 ++++
>  mm/memblock.c              |   9 +++
>  mm/page_alloc.c            | 119 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 169 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 5f74891556f3..da335381e63f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -822,6 +822,11 @@ struct zone {
>  	/* free areas of different sizes */
>  	struct free_area	free_area[MAX_ORDER];
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	/* pages to be accepted */
> +	struct list_head	unaccepted_pages;
> +#endif
> +

If MAX_ORDER-1 ranges were accepted at a time, it could be documented that the
list_head only contains orders of pages of a fixed size.

>  	/* zone flags, see below */
>  	unsigned long		flags;
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 0b0ae5084e60..ce953be8fe10 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -941,6 +941,7 @@ static inline bool is_page_hwpoison(struct page *page)
>  #define PG_offline	0x00000100
>  #define PG_table	0x00000200
>  #define PG_guard	0x00000400
> +#define PG_unaccepted	0x00000800
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -966,6 +967,18 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
>  	page->page_type |= PG_##lname;					\
>  }
>  
> +#define PAGE_TYPE_OPS_FALSE(uname)					\
> +static __always_inline int Page##uname(struct page *page)		\
> +{									\
> +	return false;							\
> +}									\
> +static __always_inline void __SetPage##uname(struct page *page)		\
> +{									\
> +}									\
> +static __always_inline void __ClearPage##uname(struct page *page)	\
> +{									\
> +}
> +
>  /*
>   * PageBuddy() indicates that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
> @@ -996,6 +1009,17 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> +/*
> + * PageUnaccepted() indicates that the page has to be "accepted" before it can
> + * be read or written. The page allocator must call accept_page() before
> + * touching the page or returning it to the caller.
> + */
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +PAGE_TYPE_OPS(Unaccepted, unaccepted)
> +#else
> +PAGE_TYPE_OPS_FALSE(Unaccepted)
> +#endif
> +
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
>  extern void page_offline_begin(void);
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b7ef495b56d..8ef4f88608ad 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -856,4 +856,16 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
>  	return !(vma->vm_flags & VM_SOFTDIRTY);
>  }
>  
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> +						    phys_addr_t end)
> +{
> +	return false;
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}
> +#endif
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 511d4783dcf1..3bc404a5352a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1423,6 +1423,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
>  		 */
>  		kmemleak_alloc_phys(found, size, 0);
>  
> +	/*
> +	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> +	 * require memory to be accepted before it can be used by the
> +	 * guest.
> +	 *
> +	 * Accept the memory of the allocated buffer.
> +	 */
> +	accept_memory(found, found + size);
> +
>  	return found;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e60657875d3..6d597e833a73 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -450,6 +450,11 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +/* Counts number of zones with unaccepted pages. */
> +static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
> +#endif
> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  /*
>   * During boot we initialize deferred pages on-demand, as needed, but once
> @@ -1043,12 +1048,15 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>  {
>  	struct free_area *area = &zone->free_area[order];
>  
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
>  	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  					   unsigned int order)
>  {
> +	VM_BUG_ON_PAGE(PageUnevictable(page), page);
> +
>  	/* clear reported state and update reported page count */
>  	if (page_reported(page))
>  		__ClearPageReported(page);

Aside from what Vlastimil said, it's not commented *why* these VM_BUG_ON_PAGE
(or WARN) could trigger and why Unevitable specifically. It could be a long
time after these patches are merged before they are used in production so if
it does happen, it'd be nice to have a comment explaining why it could occur.

> @@ -1728,6 +1736,97 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	__count_vm_events(PGFREE, 1 << order);
>  }
>  
> +static bool page_contains_unaccepted(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +	phys_addr_t end = start + (PAGE_SIZE << order);
> +
> +	return range_contains_unaccepted_memory(start, end);
> +}
> +

If pages are accepted in MAX_ORDER-1 ranges enforced by the core, it'll
not be necessary to check ranges as 1 page will be enough to know the
whole block needs to be accepted.

> +static void accept_page(struct page *page, unsigned int order)
> +{
> +	phys_addr_t start = page_to_phys(page);
> +
> +	accept_memory(start, start + (PAGE_SIZE << order));
> +}
> +

Generically enforce a minimum range -- MAX_ORDER-1 recommended.

> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	unsigned long flags, order;
> +	struct page *page;
> +	bool last = false;
> +	int migratetype;
> +
> +	if (!static_branch_unlikely(&unaccepted_pages))
> +		return false;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	page = list_first_entry_or_null(&zone->unaccepted_pages,
> +					struct page, lru);
> +	if (!page) {
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +		return false;
> +	}
> +

Probably could check this first locklessly as unaccepted pages only go down,
never up unless memory hotplug is a factor?

> +	list_del(&page->lru);
> +	last = list_empty(&zone->unaccepted_pages);
> +
> +	order = page->private;

If blocks are released in fixed sizes or MAX_ORDER-1 then order is
irrelevant.

> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
> +
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));

As is the migration type because the migration type will become the type
that splits at least a pageblock for the first time.

> +	__mod_zone_freepage_state(zone, -1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	if (last)
> +		static_branch_dec(&unaccepted_pages);
> +
> +	accept_page(page, order);
> +	__ClearPageUnaccepted(page);
> +	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
> +

If accept_page returns the first struct page of the block freed then this
can be simplified. The corner case for zone boundaries or holes within a zone
that are not MAX_ORDER-1 should be accepted unconditionally in early boot.

> +	return true;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long flags;
> +	int migratetype;
> +	bool first = false;
> +
> +	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);

Regardless of forced alignment, this should be a WARN_ON and bail. It's most
likely early in boot and if it affects 1 machine, it'll happen every time.

> +	__SetPageUnaccepted(page);
> +	page->private = order;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	first = list_empty(&zone->unaccepted_pages);
> +	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
> +	list_add_tail(&page->lru, &zone->unaccepted_pages);
> +	__mod_zone_freepage_state(zone, 1 << order, migratetype);
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +

Enforcing alignment would also simplify fragmentation handling here.

> +	if (first)
> +		static_branch_inc(&unaccepted_pages);
> +}
> +
> +#else
> +
> +static bool try_to_accept_memory(struct zone *zone)
> +{
> +	return false;
> +}
> +
> +static void __free_unaccepted(struct page *page, unsigned int order)
> +{
> +	BUILD_BUG();
> +}
> +
> +#endif /* CONFIG_UNACCEPTED_MEMORY */
> +
>  void __free_pages_core(struct page *page, unsigned int order)
>  {
>  	unsigned int nr_pages = 1 << order;
> @@ -1750,6 +1849,13 @@ void __free_pages_core(struct page *page, unsigned int order)
>  
>  	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
>  
> +	if (page_contains_unaccepted(page, order)) {
> +		if (order >= pageblock_order)
> +			return __free_unaccepted(page, order);
> +		else
> +			accept_page(page, order);
> +	}
> +

If MAX_ORDER-1 is the minimum for acceptance then this changes. It would
have to be checked if this is a zone boundary and if so, accept the
pages immediately with a comment, otherwise add a MAX_ORDER-1 page to the
unaccepted list.

>  	/*
>  	 * Bypass PCP and place fresh pages right to the tail, primarily
>  	 * relevant for memory onlining.
> @@ -1910,6 +2016,9 @@ static void __init deferred_free_range(unsigned long pfn,
>  		return;
>  	}
>  
> +	/* Accept chunks smaller than page-block upfront */
> +	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
> +
>  	for (i = 0; i < nr_pages; i++, page++, pfn++) {
>  		if (pageblock_aligned(pfn))
>  			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> @@ -4247,6 +4356,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  				       gfp_mask)) {
>  			int ret;
>  
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;
> +

This is a potential fragmentation hazard because by the time this
is reached, the watermarks are low. By then, pageblocks potentially
became mixed due to __rmqueue_fallback and the mixing of blocks may be
persistent. __rmqueue_fallback should fail if there is still unaccepted
memory.

That said, what you have here is fine, it's just not enough to avoid
the hazard on its own.

>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/*
>  			 * Watermark failed for this zone, but see if we can
> @@ -4299,6 +4411,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  
>  			return page;
>  		} else {
> +			if (try_to_accept_memory(zone))
> +				goto try_this_zone;
> +

This has a similar problem, fallbacks could have already occurred. By
forbidding fallbacks until all memory within a zone is accepted, this
should be less of an issue because NULL will be returned, more memory is
accepted and it retries. It'll be a little race prone as another request
could allocate the newly accepted memory before the retry but it would
eventually be resolved.

Vlastimil was correct in terms of how to gauge how serious fragmentation
issues are early-on. Tracking /proc/pagetypeinfo helps but is hard to
use because it has no information on mixed pageblocks. It's better to
track mm_page_alloc_extfrag and at the very least, count the number of
times this happens when all memory is accepted up-front versus deferred
acceptance. Any major difference in counts is potentially problematic.

At the very least, a kernel build with all CPUs early in boot would be a
start. Slightly better would be to run the number of a number of separate
kernel builds that uses a high percentage of memory. As the allocations
are a good mix, but short-lived, it would also help to run fio in the
background creating many 4K files to overally generate a mix of page
tables allocations, slab and movable allocs, some of which are short-lived
from kernbench and others that are more persisent due to fio. I don't
have something suitable automated in mmtests at the moment as it's been
a while since I investigated fragmentation specifically and the old tests
took too long to complete on larger systems.

>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  			/* Try again if zone has deferred pages */
>  			if (static_branch_unlikely(&deferred_pages)) {
> @@ -6935,6 +7050,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>  		zone->free_area[order].nr_free = 0;
>  	}
> +
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +	INIT_LIST_HEAD(&zone->unaccepted_pages);
> +#endif
>  }
>  
>  /*
> -- 
> 2.38.0
>
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f74891556f3..da335381e63f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -822,6 +822,11 @@  struct zone {
 	/* free areas of different sizes */
 	struct free_area	free_area[MAX_ORDER];
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* pages to be accepted */
+	struct list_head	unaccepted_pages;
+#endif
+
 	/* zone flags, see below */
 	unsigned long		flags;
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0b0ae5084e60..ce953be8fe10 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -941,6 +941,7 @@  static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_unaccepted	0x00000800
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -966,6 +967,18 @@  static __always_inline void __ClearPage##uname(struct page *page)	\
 	page->page_type |= PG_##lname;					\
 }
 
+#define PAGE_TYPE_OPS_FALSE(uname)					\
+static __always_inline int Page##uname(struct page *page)		\
+{									\
+	return false;							\
+}									\
+static __always_inline void __SetPage##uname(struct page *page)		\
+{									\
+}									\
+static __always_inline void __ClearPage##uname(struct page *page)	\
+{									\
+}
+
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
@@ -996,6 +1009,17 @@  PAGE_TYPE_OPS(Buddy, buddy)
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+/*
+ * PageUnaccepted() indicates that the page has to be "accepted" before it can
+ * be read or written. The page allocator must call accept_page() before
+ * touching the page or returning it to the caller.
+ */
+#ifdef CONFIG_UNACCEPTED_MEMORY
+PAGE_TYPE_OPS(Unaccepted, unaccepted)
+#else
+PAGE_TYPE_OPS_FALSE(Unaccepted)
+#endif
+
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
 extern void page_offline_begin(void);
diff --git a/mm/internal.h b/mm/internal.h
index 6b7ef495b56d..8ef4f88608ad 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -856,4 +856,16 @@  static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
 	return !(vma->vm_flags & VM_SOFTDIRTY);
 }
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+						    phys_addr_t end)
+{
+	return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 511d4783dcf1..3bc404a5352a 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1423,6 +1423,15 @@  phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0);
 
+	/*
+	 * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+	 * require memory to be accepted before it can be used by the
+	 * guest.
+	 *
+	 * Accept the memory of the allocated buffer.
+	 */
+	accept_memory(found, found + size);
+
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e60657875d3..6d597e833a73 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -450,6 +450,11 @@  EXPORT_SYMBOL(nr_online_nodes);
 
 int page_group_by_mobility_disabled __read_mostly;
 
+#ifdef CONFIG_UNACCEPTED_MEMORY
+/* Counts number of zones with unaccepted pages. */
+static DEFINE_STATIC_KEY_FALSE(unaccepted_pages);
+#endif
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /*
  * During boot we initialize deferred pages on-demand, as needed, but once
@@ -1043,12 +1048,15 @@  static inline void move_to_free_list(struct page *page, struct zone *zone,
 {
 	struct free_area *area = &zone->free_area[order];
 
+	VM_BUG_ON_PAGE(PageUnevictable(page), page);
 	list_move_tail(&page->buddy_list, &area->free_list[migratetype]);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 					   unsigned int order)
 {
+	VM_BUG_ON_PAGE(PageUnevictable(page), page);
+
 	/* clear reported state and update reported page count */
 	if (page_reported(page))
 		__ClearPageReported(page);
@@ -1728,6 +1736,97 @@  static void __free_pages_ok(struct page *page, unsigned int order,
 	__count_vm_events(PGFREE, 1 << order);
 }
 
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+	phys_addr_t end = start + (PAGE_SIZE << order);
+
+	return range_contains_unaccepted_memory(start, end);
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+	phys_addr_t start = page_to_phys(page);
+
+	accept_memory(start, start + (PAGE_SIZE << order));
+}
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+static bool try_to_accept_memory(struct zone *zone)
+{
+	unsigned long flags, order;
+	struct page *page;
+	bool last = false;
+	int migratetype;
+
+	if (!static_branch_unlikely(&unaccepted_pages))
+		return false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	page = list_first_entry_or_null(&zone->unaccepted_pages,
+					struct page, lru);
+	if (!page) {
+		spin_unlock_irqrestore(&zone->lock, flags);
+		return false;
+	}
+
+	list_del(&page->lru);
+	last = list_empty(&zone->unaccepted_pages);
+
+	order = page->private;
+	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
+
+	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
+	__mod_zone_freepage_state(zone, -1 << order, migratetype);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	if (last)
+		static_branch_dec(&unaccepted_pages);
+
+	accept_page(page, order);
+	__ClearPageUnaccepted(page);
+	__free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON);
+
+	return true;
+}
+
+static void __free_unaccepted(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	int migratetype;
+	bool first = false;
+
+	VM_BUG_ON(order > MAX_ORDER || order < pageblock_order);
+	__SetPageUnaccepted(page);
+	page->private = order;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	first = list_empty(&zone->unaccepted_pages);
+	migratetype = get_pfnblock_migratetype(page, page_to_pfn(page));
+	list_add_tail(&page->lru, &zone->unaccepted_pages);
+	__mod_zone_freepage_state(zone, 1 << order, migratetype);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	if (first)
+		static_branch_inc(&unaccepted_pages);
+}
+
+#else
+
+static bool try_to_accept_memory(struct zone *zone)
+{
+	return false;
+}
+
+static void __free_unaccepted(struct page *page, unsigned int order)
+{
+	BUILD_BUG();
+}
+
+#endif /* CONFIG_UNACCEPTED_MEMORY */
+
 void __free_pages_core(struct page *page, unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
@@ -1750,6 +1849,13 @@  void __free_pages_core(struct page *page, unsigned int order)
 
 	atomic_long_add(nr_pages, &page_zone(page)->managed_pages);
 
+	if (page_contains_unaccepted(page, order)) {
+		if (order >= pageblock_order)
+			return __free_unaccepted(page, order);
+		else
+			accept_page(page, order);
+	}
+
 	/*
 	 * Bypass PCP and place fresh pages right to the tail, primarily
 	 * relevant for memory onlining.
@@ -1910,6 +2016,9 @@  static void __init deferred_free_range(unsigned long pfn,
 		return;
 	}
 
+	/* Accept chunks smaller than page-block upfront */
+	accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if (pageblock_aligned(pfn))
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
@@ -4247,6 +4356,9 @@  get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				       gfp_mask)) {
 			int ret;
 
+			if (try_to_accept_memory(zone))
+				goto try_this_zone;
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/*
 			 * Watermark failed for this zone, but see if we can
@@ -4299,6 +4411,9 @@  get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 			return page;
 		} else {
+			if (try_to_accept_memory(zone))
+				goto try_this_zone;
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 			/* Try again if zone has deferred pages */
 			if (static_branch_unlikely(&deferred_pages)) {
@@ -6935,6 +7050,10 @@  static void __meminit zone_init_free_lists(struct zone *zone)
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
 	}
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	INIT_LIST_HEAD(&zone->unaccepted_pages);
+#endif
 }
 
 /*