Message ID | 20220614120231.48165-3-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, x86/cc: Implement support for unaccepted memory | expand |
On 6/14/2022 2:02 PM, Kirill A. Shutemov wrote: > UEFI Specification version 2.9 introduces the concept of memory > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD > SEV-SNP, require memory to be accepted before it can be used by the > guest. Accepting happens via a protocol specific to the Virtual Machine > platform. > > There are several ways kernel can deal with unaccepted memory: > > 1. Accept all the memory during the boot. It is easy to implement and > it doesn't have runtime cost once the system is booted. The downside > is very long boot time. > > Accept can be parallelized to multiple CPUs to keep it manageable > (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate > memory bandwidth and does not scale beyond the point. > > 2. Accept a block of memory on the first use. It requires more > infrastructure and changes in page allocator to make it work, but > it provides good boot time. > > On-demand memory accept means latency spikes every time kernel steps > onto a new memory block. The spikes will go away once workload data > set size gets stabilized or all memory gets accepted. > > 3. Accept all memory in background. Introduce a thread (or multiple) > that gets memory accepted proactively. It will minimize time the > system experience latency spikes on memory allocation while keeping > low boot time. > > This approach cannot function on its own. It is an extension of #2: > background memory acceptance requires functional scheduler, but the > page allocator may need to tap into unaccepted memory before that. > > The downside of the approach is that these threads also steal CPU > cycles and memory bandwidth from the user's workload and may hurt > user experience. > > Implement #2 for now. It is a reasonable default. Some workloads may > want to use #1 or #3 and they can be implemented later based on user's > demands. > > Support of unaccepted memory requires a few changes in core-mm code: > > - memblock has to accept memory on allocation; > > - page allocator has to accept memory on the first allocation of the > page; > > Memblock change is trivial. > > The page allocator is modified to accept pages on the first allocation. > The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is > used to indicate that the page requires acceptance. > > Architecture has to provide two helpers if it wants to support > unaccepted memory: > > - accept_memory() makes a range of physical addresses accepted. > > - range_contains_unaccepted_memory() checks anything within the range > of physical addresses requires acceptance. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock > Reviewed-by: David Hildenbrand <david@redhat.com> > --- > include/linux/page-flags.h | 31 +++++++++++++ > mm/internal.h | 12 +++++ > mm/memblock.c | 9 ++++ > mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++++++++- > 4 files changed, 139 insertions(+), 2 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index e66f7aa3191d..444ba8f4bfa0 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page) > #define PG_offline 0x00000100 > #define PG_table 0x00000200 > #define PG_guard 0x00000400 > +#define PG_unaccepted 0x00000800 > + > +/* > + * Page types allowed at page_expected_state() > + * > + * PageUnaccepted() will get cleared in post_alloc_hook(). > + */ > +#define PAGE_TYPES_EXPECTED PG_unaccepted > > #define PageType(page, flag) \ > ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) > @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page) \ > page->page_type |= PG_##lname; \ > } > > +#define PAGE_TYPE_OPS_FALSE(uname) \ > +static __always_inline int Page##uname(struct page *page) \ > +{ \ > + return false; \ > +} \ > +static __always_inline void __SetPage##uname(struct page *page) \ > +{ \ > +} \ > +static __always_inline void __ClearPage##uname(struct page *page) \ > +{ \ > +} > + > /* > * PageBuddy() indicates that the page is free and in the buddy system > * (see mm/page_alloc.c). > @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy) > */ > PAGE_TYPE_OPS(Offline, offline) > > +/* > + * PageUnaccepted() indicates that the page has to be "accepted" before it can > + * be read or written. The page allocator must call accept_page() before > + * touching the page or returning it to the caller. > + */ > +#ifdef CONFIG_UNACCEPTED_MEMORY > +PAGE_TYPE_OPS(Unaccepted, unaccepted) > +#else > +PAGE_TYPE_OPS_FALSE(Unaccepted) > +#endif > + > extern void page_offline_freeze(void); > extern void page_offline_thaw(void); > extern void page_offline_begin(void); > diff --git a/mm/internal.h b/mm/internal.h > index c0f8fbe0445b..7c1a653edfc8 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); > > DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); > > +#ifndef CONFIG_UNACCEPTED_MEMORY > +static inline bool range_contains_unaccepted_memory(phys_addr_t start, > + phys_addr_t end) > +{ > + return false; > +} > + > +static inline void accept_memory(phys_addr_t start, phys_addr_t end) > +{ > +} > +#endif > + > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/memblock.c b/mm/memblock.c > index e4f03a6e8e56..a1f7f8b304d5 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, > */ > kmemleak_alloc_phys(found, size, 0, 0); > > + /* > + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > + * require memory to be accepted before it can be used by the > + * guest. > + * > + * Accept the memory of the allocated buffer. > + */ > + accept_memory(found, found + size); > + > return found; > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e008a3df0485..279c2746aaa8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t; > */ > #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) > > +/* > + * Check if the page needs to be marked as PageUnaccepted(). > + * Used for the new pages added to the buddy allocator for the first time. > + */ > +#define FPI_UNACCEPTED_SLOWPATH ((__force fpi_t)BIT(3)) > + > /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ > static DEFINE_MUTEX(pcp_batch_high_lock); > #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) > @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > NULL) != NULL; > } > > +/* > + * Page acceptance can be very slow. Do not call under critical locks. > + */ > +static void accept_page(struct page *page, unsigned int order) > +{ > + phys_addr_t start = page_to_phys(page); > + int i; > + > + if (!PageUnaccepted(page)) > + return; > + > + accept_memory(start, start + (PAGE_SIZE << order)); > + __ClearPageUnaccepted(page); > + > + /* Assert that there is no PageUnaccepted() on tail pages */ > + if (IS_ENABLED(CONFIG_DEBUG_VM)) { > + for (i = 1; i < (1 << order); i++) > + VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i); > + } > +} > + > +static bool page_contains_unaccepted(struct page *page, unsigned int order) > +{ > + phys_addr_t start = page_to_phys(page); > + phys_addr_t end = start + (PAGE_SIZE << order); > + > + return range_contains_unaccepted_memory(start, end); > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_pfn; > struct page *buddy; > bool to_tail; > + bool page_needs_acceptance = false; > > VM_BUG_ON(!zone_is_initialized(zone)); > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page, > VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); > VM_BUG_ON_PAGE(bad_range(zone, page), page); > > + if (PageUnaccepted(page)) { > + page_needs_acceptance = true; > + __ClearPageUnaccepted(page); > + } > + > while (order < MAX_ORDER - 1) { > if (compaction_capture(capc, page, order, migratetype)) { > __mod_zone_freepage_state(zone, -(1 << order), > @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page, > clear_page_guard(zone, buddy, order, migratetype); > else > del_page_from_free_list(buddy, zone, order); > + > + /* Mark page unaccepted if any of merged pages were unaccepted */ > + if (PageUnaccepted(buddy)) { > + page_needs_acceptance = true; > + __ClearPageUnaccepted(buddy); > + } > + > combined_pfn = buddy_pfn & pfn; > page = page + (combined_pfn - pfn); > pfn = combined_pfn; > @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page, > done_merging: > set_buddy_order(page, order); > > + /* > + * The page gets marked as PageUnaccepted() if any of merged-in pages > + * is PageUnaccepted(). > + * > + * New pages, just being added to buddy allocator, do not have > + * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the > + * page is new and page_contains_unaccepted() check is required to > + * determinate if acceptance is required. > + * > + * Avoid calling page_contains_unaccepted() if it is known that the page > + * needs acceptance. It can be costly. > + */ > + if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH)) > + page_needs_acceptance = page_contains_unaccepted(page, order); > + if (page_needs_acceptance) > + __SetPageUnaccepted(page); > + > if (fpi_flags & FPI_TO_TAIL) > to_tail = true; > else if (is_shuffle_order(order)) > @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page, > static inline bool page_expected_state(struct page *page, > unsigned long check_flags) > { > - if (unlikely(atomic_read(&page->_mapcount) != -1)) > + /* > + * The page must not be mapped to userspace and must not have > + * a PageType other than listed in PAGE_TYPES_EXPECTED. > + * > + * Note, bit cleared means the page type is set. > + */ > + if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1)) > return false; > > if (unlikely((unsigned long)page->mapping | > @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order) > * Bypass PCP and place fresh pages right to the tail, primarily > * relevant for memory onlining. > */ > - __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); > + __free_pages_ok(page, order, > + FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | > + FPI_UNACCEPTED_SLOWPATH); > } > > #ifdef CONFIG_NUMA > @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn, > return; > } > > + /* Accept chunks smaller than page-block upfront */ > + accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages)); > + > for (i = 0; i < nr_pages; i++, page++, pfn++) { > if ((pfn & (pageblock_nr_pages - 1)) == 0) > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page, > if (set_page_guard(zone, &page[size], high, migratetype)) > continue; > > + /* > + * Transfer PageUnaccepted() to the newly split pages so > + * they can be accepted after dropping the zone lock. > + */ > + if (PageUnaccepted(page)) > + __SetPageUnaccepted(&page[size]); > + > add_to_free_list(&page[size], zone, high, migratetype); > set_buddy_order(&page[size], high); > } > @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > */ > kernel_unpoison_pages(page, 1 << order); > > + accept_page(page, order); > + > /* > * As memory initialization might be integrated into KASAN, > * KASAN unpoisoning and memory initializion code must be Reviewed previous version(v6) of this patch. Seems no change in this version except '!PageUnaccepted' check addition in function 'accept_page' and rename of function 'page_contains_unaccepted'. Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>
On 6/14/22 07:02, Kirill A. Shutemov wrote: > UEFI Specification version 2.9 introduces the concept of memory > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD > SEV-SNP, require memory to be accepted before it can be used by the > guest. Accepting happens via a protocol specific to the Virtual Machine > platform. > > There are several ways kernel can deal with unaccepted memory: > > 1. Accept all the memory during the boot. It is easy to implement and > it doesn't have runtime cost once the system is booted. The downside > is very long boot time. > > Accept can be parallelized to multiple CPUs to keep it manageable > (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate > memory bandwidth and does not scale beyond the point. > > 2. Accept a block of memory on the first use. It requires more > infrastructure and changes in page allocator to make it work, but > it provides good boot time. > > On-demand memory accept means latency spikes every time kernel steps > onto a new memory block. The spikes will go away once workload data > set size gets stabilized or all memory gets accepted. > > 3. Accept all memory in background. Introduce a thread (or multiple) > that gets memory accepted proactively. It will minimize time the > system experience latency spikes on memory allocation while keeping > low boot time. > > This approach cannot function on its own. It is an extension of #2: > background memory acceptance requires functional scheduler, but the > page allocator may need to tap into unaccepted memory before that. > > The downside of the approach is that these threads also steal CPU > cycles and memory bandwidth from the user's workload and may hurt > user experience. > > Implement #2 for now. It is a reasonable default. Some workloads may > want to use #1 or #3 and they can be implemented later based on user's > demands. > > Support of unaccepted memory requires a few changes in core-mm code: > > - memblock has to accept memory on allocation; > > - page allocator has to accept memory on the first allocation of the > page; > > Memblock change is trivial. > > The page allocator is modified to accept pages on the first allocation. > The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is > used to indicate that the page requires acceptance. > > Architecture has to provide two helpers if it wants to support > unaccepted memory: > > - accept_memory() makes a range of physical addresses accepted. > > - range_contains_unaccepted_memory() checks anything within the range > of physical addresses requires acceptance. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock > Reviewed-by: David Hildenbrand <david@redhat.com> > --- > include/linux/page-flags.h | 31 +++++++++++++ > mm/internal.h | 12 +++++ > mm/memblock.c | 9 ++++ > mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++++++++- > 4 files changed, 139 insertions(+), 2 deletions(-) > > diff --git a/mm/memblock.c b/mm/memblock.c > index e4f03a6e8e56..a1f7f8b304d5 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, > */ > kmemleak_alloc_phys(found, size, 0, 0); > > + /* > + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > + * require memory to be accepted before it can be used by the > + * guest. > + * > + * Accept the memory of the allocated buffer. > + */ > + accept_memory(found, found + size); The SNP support will kmalloc a descriptor that can be used to supply the range to the hypervisor using the GHCB/VMGEXIT. But kmalloc won't work when called this early, so we are likely to need an early_accept_memory or some kind of flag to know whether this is an early call or not in order to use a static descriptor in the file. Thanks, Tom > + > return found; > } >
On 6/17/22 14:28, Tom Lendacky wrote: > On 6/14/22 07:02, Kirill A. Shutemov wrote: >> UEFI Specification version 2.9 introduces the concept of memory >> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD >> SEV-SNP, require memory to be accepted before it can be used by the >> guest. Accepting happens via a protocol specific to the Virtual Machine >> platform. >> >> There are several ways kernel can deal with unaccepted memory: >> >> 1. Accept all the memory during the boot. It is easy to implement and >> it doesn't have runtime cost once the system is booted. The downside >> is very long boot time. >> >> Accept can be parallelized to multiple CPUs to keep it manageable >> (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate >> memory bandwidth and does not scale beyond the point. >> >> 2. Accept a block of memory on the first use. It requires more >> infrastructure and changes in page allocator to make it work, but >> it provides good boot time. >> >> On-demand memory accept means latency spikes every time kernel steps >> onto a new memory block. The spikes will go away once workload data >> set size gets stabilized or all memory gets accepted. >> >> 3. Accept all memory in background. Introduce a thread (or multiple) >> that gets memory accepted proactively. It will minimize time the >> system experience latency spikes on memory allocation while keeping >> low boot time. >> >> This approach cannot function on its own. It is an extension of #2: >> background memory acceptance requires functional scheduler, but the >> page allocator may need to tap into unaccepted memory before that. >> >> The downside of the approach is that these threads also steal CPU >> cycles and memory bandwidth from the user's workload and may hurt >> user experience. >> >> Implement #2 for now. It is a reasonable default. Some workloads may >> want to use #1 or #3 and they can be implemented later based on user's >> demands. >> >> Support of unaccepted memory requires a few changes in core-mm code: >> >> - memblock has to accept memory on allocation; >> >> - page allocator has to accept memory on the first allocation of the >> page; >> >> Memblock change is trivial. >> >> The page allocator is modified to accept pages on the first allocation. >> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is >> used to indicate that the page requires acceptance. >> >> Architecture has to provide two helpers if it wants to support >> unaccepted memory: >> >> - accept_memory() makes a range of physical addresses accepted. >> >> - range_contains_unaccepted_memory() checks anything within the range >> of physical addresses requires acceptance. >> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock >> Reviewed-by: David Hildenbrand <david@redhat.com> >> --- >> include/linux/page-flags.h | 31 +++++++++++++ >> mm/internal.h | 12 +++++ >> mm/memblock.c | 9 ++++ >> mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++++++++- >> 4 files changed, 139 insertions(+), 2 deletions(-) >> > >> diff --git a/mm/memblock.c b/mm/memblock.c >> index e4f03a6e8e56..a1f7f8b304d5 100644 >> --- a/mm/memblock.c >> +++ b/mm/memblock.c >> @@ -1405,6 +1405,15 @@ phys_addr_t __init >> memblock_alloc_range_nid(phys_addr_t size, >> */ >> kmemleak_alloc_phys(found, size, 0, 0); >> + /* >> + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, >> + * require memory to be accepted before it can be used by the >> + * guest. >> + * >> + * Accept the memory of the allocated buffer. >> + */ >> + accept_memory(found, found + size); > > The SNP support will kmalloc a descriptor that can be used to supply the > range to the hypervisor using the GHCB/VMGEXIT. But kmalloc won't work > when called this early, so we are likely to need an early_accept_memory or > some kind of flag to know whether this is an early call or not in order to > use a static descriptor in the file. Then again, the accept_memory() call from mm/page_alloc.c can also be early... maybe I can check system_state and make a determination as to which SNP page state change mechanism to use (GHCB vs MSR protocol). It might not be performance optimal, though, as the MSR protocol is one 4K page at a time. Thanks, Tom > > Thanks, > Tom > >> + >> return found; >> }
On Tue, Jun 14, 2022 at 03:02:19PM +0300, Kirill A. Shutemov wrote: > On-demand memory accept means latency spikes every time kernel steps > onto a new memory block. The spikes will go away once workload data > set size gets stabilized or all memory gets accepted. What does that mean? If we're accepting 2M pages and considering referential locality, how are those "spikes" even noticeable?
On 7/21/22 08:14, Borislav Petkov wrote: > On Tue, Jun 14, 2022 at 03:02:19PM +0300, Kirill A. Shutemov wrote: >> On-demand memory accept means latency spikes every time kernel steps >> onto a new memory block. The spikes will go away once workload data >> set size gets stabilized or all memory gets accepted. > What does that mean? > > If we're accepting 2M pages and considering referential locality, how > are those "spikes" even noticeable? Acceptance is slow and the heavy lifting is done inside the TDX module. It involves flushing old aliases out of the caches and initializing the memory integrity metadata for every cacheline. This implementation does acceptance in 2MB chunks while holding a global lock. So, those (effective) 2MB clflush+memset's (plus a few thousand cycles for the hypercall/tdcall transitions) can't happen in parallel. They are serialized and must wait on each other. If you have a few hundred CPUs all trying to allocate memory (say, doing the first kernel compile after a reboot), this is going to be very, very painful for a while. That said, I think this is the right place to _start_. There is going to need to be some kind of follow-on solution (likely background acceptance of some kind). But, even with that solution, *this* code is still needed to handle the degenerate case where the background accepter can't keep up with foreground memory needs.
On Thu, Jul 21, 2022 at 08:49:31AM -0700, Dave Hansen wrote: > Acceptance is slow and the heavy lifting is done inside the TDX module. > It involves flushing old aliases out of the caches and initializing the > memory integrity metadata for every cacheline. This implementation does > acceptance in 2MB chunks while holding a global lock. Oh, fun. > So, those (effective) 2MB clflush+memset's (plus a few thousand cycles > for the hypercall/tdcall transitions) So this sounds strange - page validation on AMD - judging by the pseudocode of the PVALIDATE insn - does a bunch of sanity checks on the gVA of the page and then installs it into the RMP and also "PVALIDATE performs the same segmentation and paging checks as a 1-byte read. PVALIDATE does not invalidate TLB caches." But that still sounds a lot less work than what the TDX module needs to do... > can't happen in parallel. They are serialized and must wait on each > other. Ofc, the Intel version of the RMP table needs to be protected. :-) > If you have a few hundred CPUs all trying to allocate memory (say, > doing the first kernel compile after a reboot), this is going to be > very, very painful for a while. > > That said, I think this is the right place to _start_. There is going > to need to be some kind of follow-on solution (likely background > acceptance of some kind). But, even with that solution, *this* code > is still needed to handle the degenerate case where the background > accepter can't keep up with foreground memory needs. I'm still catering to the view that it should be a two-tier thing: you validate during boot a certain amount - say 4G - a size for which the boot delay is acceptable and you do the rest on-demand along with a background accepter. That should give you the best of both worlds... Thx.
On 7/22/22 12:18, Borislav Petkov wrote: > On Thu, Jul 21, 2022 at 08:49:31AM -0700, Dave Hansen wrote: >> So, those (effective) 2MB clflush+memset's (plus a few thousand cycles >> for the hypercall/tdcall transitions) > > So this sounds strange - page validation on AMD - judging by the > pseudocode of the PVALIDATE insn - does a bunch of sanity checks on the > gVA of the page and then installs it into the RMP and also "PVALIDATE > performs the same segmentation and paging checks as a 1-byte read. > PVALIDATE does not invalidate TLB caches." > > But that still sounds a lot less work than what the TDX module needs to > do... Sure does... *Something* has to manage the cache coherency so that old physical aliases of the converted memory don't write back and clobber new data. But, maybe the hardware is doing that now. >> If you have a few hundred CPUs all trying to allocate memory (say, >> doing the first kernel compile after a reboot), this is going to be >> very, very painful for a while. >> >> That said, I think this is the right place to _start_. There is going >> to need to be some kind of follow-on solution (likely background >> acceptance of some kind). But, even with that solution, *this* code >> is still needed to handle the degenerate case where the background >> accepter can't keep up with foreground memory needs. > > I'm still catering to the view that it should be a two-tier thing: you > validate during boot a certain amount - say 4G - a size for which the > boot delay is acceptable and you do the rest on-demand along with a > background accepter. > > That should give you the best of both worlds... Yeah, that two-tier system is the way it's happening today from what I understand. This whole conversation is about how to handle the >4GB memory.
On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote: > Sure does... *Something* has to manage the cache coherency so that old > physical aliases of the converted memory don't write back and clobber > new data. But, maybe the hardware is doing that now. Let's hope. > Yeah, that two-tier system is the way it's happening today from what > I understand. This whole conversation is about how to handle the >4GB > memory. Would it be possible to pre-accept a bunch of mem - think "pre-fault" - from userspace? I.e., I'm thinking some huge process is going to start in the VM, VM userspace goes and causes a chunk of memory to be pre-accepted and then the process starts and runs more-or-less smoothly as the majority of its memory has already been "prepared". Or does that not make any sense from mm perspective?
On 25.07.22 14:23, Borislav Petkov wrote: > On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote: >> Sure does... *Something* has to manage the cache coherency so that old >> physical aliases of the converted memory don't write back and clobber >> new data. But, maybe the hardware is doing that now. > > Let's hope. > >> Yeah, that two-tier system is the way it's happening today from what >> I understand. This whole conversation is about how to handle the >4GB >> memory. > > Would it be possible to pre-accept a bunch of mem - think "pre-fault" - > from userspace? > > I.e., I'm thinking some huge process is going to start in the VM, VM > userspace goes and causes a chunk of memory to be pre-accepted and then > the process starts and runs more-or-less smoothly as the majority of its > memory has already been "prepared". > > Or does that not make any sense from mm perspective? > The less core-MM code to handle unaccepted memory the better. Meaning, that any kind of additional pre-acceptance (in addition to what we have here) needs good justification.
On Mon, Jul 25, 2022 at 02:38:57PM +0200, David Hildenbrand wrote: > The less core-MM code to handle unaccepted memory the better. Meaning, > that any kind of additional pre-acceptance (in addition to what we have > here) needs good justification. I actually mean to have a user interface to have the core code pre-accept. I.e., interface to what will be already there anyway. Or is that frowned upon too?
On Mon, Jul 25, 2022 at 02:23:20PM +0200, Borislav Petkov wrote: > On Fri, Jul 22, 2022 at 12:30:36PM -0700, Dave Hansen wrote: > > Sure does... *Something* has to manage the cache coherency so that old > > physical aliases of the converted memory don't write back and clobber > > new data. But, maybe the hardware is doing that now. > > Let's hope. > > > Yeah, that two-tier system is the way it's happening today from what > > I understand. This whole conversation is about how to handle the >4GB > > memory. > > Would it be possible to pre-accept a bunch of mem - think "pre-fault" - > from userspace? > > I.e., I'm thinking some huge process is going to start in the VM, VM > userspace goes and causes a chunk of memory to be pre-accepted and then > the process starts and runs more-or-less smoothly as the majority of its > memory has already been "prepared". An application in the VM can do mlock() or mmap(..., MAP_POPULATE, ...) and this will essentially force acceptance of that memory. But there's no sysctl or something for that. > Or does that not make any sense from mm perspective? > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jul 25, 2022 at 04:00:14PM +0300, Mike Rapoport wrote: > An application in the VM can do mlock() or mmap(..., MAP_POPULATE, ...) and > this will essentially force acceptance of that memory. Ah, cool, that's what I meant. > But there's no sysctl or something for that. Yeah, no need. I was simply wondering whether one can relocate the acceptance work to the moment prior to starting the process so that it can run smoothly once started and doesn't cause spikes due to on-demand acceptance. At least not too many. Thx.
On 25.07.22 14:53, Borislav Petkov wrote: > On Mon, Jul 25, 2022 at 02:38:57PM +0200, David Hildenbrand wrote: >> The less core-MM code to handle unaccepted memory the better. Meaning, >> that any kind of additional pre-acceptance (in addition to what we have >> here) needs good justification. > > I actually mean to have a user interface to have the core code > pre-accept. I.e., interface to what will be already there anyway. > Ah, got it. The "issue" with ordinary prefaulting in user space is that you'll also end up zeroing the memory. But yeah, it should just accept a bunch of memory. > Or is that frowned upon too? I was assuming you'd want some additional interface that only accept memory without actually temporarily allocating it.
On 6/14/22 14:02, Kirill A. Shutemov wrote: > UEFI Specification version 2.9 introduces the concept of memory > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD > SEV-SNP, require memory to be accepted before it can be used by the > guest. Accepting happens via a protocol specific to the Virtual Machine > platform. > > There are several ways kernel can deal with unaccepted memory: > > 1. Accept all the memory during the boot. It is easy to implement and > it doesn't have runtime cost once the system is booted. The downside > is very long boot time. > > Accept can be parallelized to multiple CPUs to keep it manageable > (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate > memory bandwidth and does not scale beyond the point. > > 2. Accept a block of memory on the first use. It requires more > infrastructure and changes in page allocator to make it work, but > it provides good boot time. > > On-demand memory accept means latency spikes every time kernel steps > onto a new memory block. The spikes will go away once workload data > set size gets stabilized or all memory gets accepted. > > 3. Accept all memory in background. Introduce a thread (or multiple) > that gets memory accepted proactively. It will minimize time the > system experience latency spikes on memory allocation while keeping > low boot time. > > This approach cannot function on its own. It is an extension of #2: > background memory acceptance requires functional scheduler, but the > page allocator may need to tap into unaccepted memory before that. > > The downside of the approach is that these threads also steal CPU > cycles and memory bandwidth from the user's workload and may hurt > user experience. > > Implement #2 for now. It is a reasonable default. Some workloads may > want to use #1 or #3 and they can be implemented later based on user's > demands. > > Support of unaccepted memory requires a few changes in core-mm code: > > - memblock has to accept memory on allocation; > > - page allocator has to accept memory on the first allocation of the > page; > > Memblock change is trivial. > > The page allocator is modified to accept pages on the first allocation. > The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is > used to indicate that the page requires acceptance. > > Architecture has to provide two helpers if it wants to support > unaccepted memory: > > - accept_memory() makes a range of physical addresses accepted. > > - range_contains_unaccepted_memory() checks anything within the range > of physical addresses requires acceptance. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock > Reviewed-by: David Hildenbrand <david@redhat.com> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed before, but it's really not great how this affects the core page allocator paths. Wouldn't it be possible to only release pages to page allocator when accepted, and otherwise use some new per-zone variables together with the bitmap to track how much exactly is where to accept? Then it could be hooked in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - if we fail zone_watermark_fast() and there are unaccepted pages in the zone, accept them and continue. With a static key to flip in case we eventually accept everything. Because this is really similar scenario to the deferred init and that one was solved in a way that adds minimal overhead. > --- > include/linux/page-flags.h | 31 +++++++++++++ > mm/internal.h | 12 +++++ > mm/memblock.c | 9 ++++ > mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++++++++- > 4 files changed, 139 insertions(+), 2 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index e66f7aa3191d..444ba8f4bfa0 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page) > #define PG_offline 0x00000100 > #define PG_table 0x00000200 > #define PG_guard 0x00000400 > +#define PG_unaccepted 0x00000800 > + > +/* > + * Page types allowed at page_expected_state() > + * > + * PageUnaccepted() will get cleared in post_alloc_hook(). > + */ > +#define PAGE_TYPES_EXPECTED PG_unaccepted > > #define PageType(page, flag) \ > ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) > @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page) \ > page->page_type |= PG_##lname; \ > } > > +#define PAGE_TYPE_OPS_FALSE(uname) \ > +static __always_inline int Page##uname(struct page *page) \ > +{ \ > + return false; \ > +} \ > +static __always_inline void __SetPage##uname(struct page *page) \ > +{ \ > +} \ > +static __always_inline void __ClearPage##uname(struct page *page) \ > +{ \ > +} > + > /* > * PageBuddy() indicates that the page is free and in the buddy system > * (see mm/page_alloc.c). > @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy) > */ > PAGE_TYPE_OPS(Offline, offline) > > +/* > + * PageUnaccepted() indicates that the page has to be "accepted" before it can > + * be read or written. The page allocator must call accept_page() before > + * touching the page or returning it to the caller. > + */ > +#ifdef CONFIG_UNACCEPTED_MEMORY > +PAGE_TYPE_OPS(Unaccepted, unaccepted) > +#else > +PAGE_TYPE_OPS_FALSE(Unaccepted) > +#endif > + > extern void page_offline_freeze(void); > extern void page_offline_thaw(void); > extern void page_offline_begin(void); > diff --git a/mm/internal.h b/mm/internal.h > index c0f8fbe0445b..7c1a653edfc8 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); > > DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); > > +#ifndef CONFIG_UNACCEPTED_MEMORY > +static inline bool range_contains_unaccepted_memory(phys_addr_t start, > + phys_addr_t end) > +{ > + return false; > +} > + > +static inline void accept_memory(phys_addr_t start, phys_addr_t end) > +{ > +} > +#endif > + > #endif /* __MM_INTERNAL_H */ > diff --git a/mm/memblock.c b/mm/memblock.c > index e4f03a6e8e56..a1f7f8b304d5 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, > */ > kmemleak_alloc_phys(found, size, 0, 0); > > + /* > + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > + * require memory to be accepted before it can be used by the > + * guest. > + * > + * Accept the memory of the allocated buffer. > + */ > + accept_memory(found, found + size); > + > return found; > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e008a3df0485..279c2746aaa8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t; > */ > #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) > > +/* > + * Check if the page needs to be marked as PageUnaccepted(). > + * Used for the new pages added to the buddy allocator for the first time. > + */ > +#define FPI_UNACCEPTED_SLOWPATH ((__force fpi_t)BIT(3)) > + > /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ > static DEFINE_MUTEX(pcp_batch_high_lock); > #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) > @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, > NULL) != NULL; > } > > +/* > + * Page acceptance can be very slow. Do not call under critical locks. > + */ > +static void accept_page(struct page *page, unsigned int order) > +{ > + phys_addr_t start = page_to_phys(page); > + int i; > + > + if (!PageUnaccepted(page)) > + return; > + > + accept_memory(start, start + (PAGE_SIZE << order)); > + __ClearPageUnaccepted(page); > + > + /* Assert that there is no PageUnaccepted() on tail pages */ > + if (IS_ENABLED(CONFIG_DEBUG_VM)) { > + for (i = 1; i < (1 << order); i++) > + VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i); > + } > +} > + > +static bool page_contains_unaccepted(struct page *page, unsigned int order) > +{ > + phys_addr_t start = page_to_phys(page); > + phys_addr_t end = start + (PAGE_SIZE << order); > + > + return range_contains_unaccepted_memory(start, end); > +} > + > /* > * Freeing function for a buddy system allocator. > * > @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page, > unsigned long combined_pfn; > struct page *buddy; > bool to_tail; > + bool page_needs_acceptance = false; > > VM_BUG_ON(!zone_is_initialized(zone)); > VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); > @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page, > VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); > VM_BUG_ON_PAGE(bad_range(zone, page), page); > > + if (PageUnaccepted(page)) { > + page_needs_acceptance = true; > + __ClearPageUnaccepted(page); > + } > + > while (order < MAX_ORDER - 1) { > if (compaction_capture(capc, page, order, migratetype)) { > __mod_zone_freepage_state(zone, -(1 << order), > @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page, > clear_page_guard(zone, buddy, order, migratetype); > else > del_page_from_free_list(buddy, zone, order); > + > + /* Mark page unaccepted if any of merged pages were unaccepted */ > + if (PageUnaccepted(buddy)) { > + page_needs_acceptance = true; > + __ClearPageUnaccepted(buddy); > + } > + > combined_pfn = buddy_pfn & pfn; > page = page + (combined_pfn - pfn); > pfn = combined_pfn; > @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page, > done_merging: > set_buddy_order(page, order); > > + /* > + * The page gets marked as PageUnaccepted() if any of merged-in pages > + * is PageUnaccepted(). > + * > + * New pages, just being added to buddy allocator, do not have > + * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the > + * page is new and page_contains_unaccepted() check is required to > + * determinate if acceptance is required. > + * > + * Avoid calling page_contains_unaccepted() if it is known that the page > + * needs acceptance. It can be costly. > + */ > + if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH)) > + page_needs_acceptance = page_contains_unaccepted(page, order); > + if (page_needs_acceptance) > + __SetPageUnaccepted(page); > + > if (fpi_flags & FPI_TO_TAIL) > to_tail = true; > else if (is_shuffle_order(order)) > @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page, > static inline bool page_expected_state(struct page *page, > unsigned long check_flags) > { > - if (unlikely(atomic_read(&page->_mapcount) != -1)) > + /* > + * The page must not be mapped to userspace and must not have > + * a PageType other than listed in PAGE_TYPES_EXPECTED. > + * > + * Note, bit cleared means the page type is set. > + */ > + if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1)) > return false; > > if (unlikely((unsigned long)page->mapping | > @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order) > * Bypass PCP and place fresh pages right to the tail, primarily > * relevant for memory onlining. > */ > - __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); > + __free_pages_ok(page, order, > + FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | > + FPI_UNACCEPTED_SLOWPATH); > } > > #ifdef CONFIG_NUMA > @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn, > return; > } > > + /* Accept chunks smaller than page-block upfront */ > + accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages)); > + > for (i = 0; i < nr_pages; i++, page++, pfn++) { > if ((pfn & (pageblock_nr_pages - 1)) == 0) > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page, > if (set_page_guard(zone, &page[size], high, migratetype)) > continue; > > + /* > + * Transfer PageUnaccepted() to the newly split pages so > + * they can be accepted after dropping the zone lock. > + */ > + if (PageUnaccepted(page)) > + __SetPageUnaccepted(&page[size]); > + > add_to_free_list(&page[size], zone, high, migratetype); > set_buddy_order(&page[size], high); > } > @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > */ > kernel_unpoison_pages(page, 1 << order); > > + accept_page(page, order); > + > /* > * As memory initialization might be integrated into KASAN, > * KASAN unpoisoning and memory initializion code must be
On 05.08.22 13:49, Vlastimil Babka wrote: > On 6/14/22 14:02, Kirill A. Shutemov wrote: >> UEFI Specification version 2.9 introduces the concept of memory >> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD >> SEV-SNP, require memory to be accepted before it can be used by the >> guest. Accepting happens via a protocol specific to the Virtual Machine >> platform. >> >> There are several ways kernel can deal with unaccepted memory: >> >> 1. Accept all the memory during the boot. It is easy to implement and >> it doesn't have runtime cost once the system is booted. The downside >> is very long boot time. >> >> Accept can be parallelized to multiple CPUs to keep it manageable >> (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate >> memory bandwidth and does not scale beyond the point. >> >> 2. Accept a block of memory on the first use. It requires more >> infrastructure and changes in page allocator to make it work, but >> it provides good boot time. >> >> On-demand memory accept means latency spikes every time kernel steps >> onto a new memory block. The spikes will go away once workload data >> set size gets stabilized or all memory gets accepted. >> >> 3. Accept all memory in background. Introduce a thread (or multiple) >> that gets memory accepted proactively. It will minimize time the >> system experience latency spikes on memory allocation while keeping >> low boot time. >> >> This approach cannot function on its own. It is an extension of #2: >> background memory acceptance requires functional scheduler, but the >> page allocator may need to tap into unaccepted memory before that. >> >> The downside of the approach is that these threads also steal CPU >> cycles and memory bandwidth from the user's workload and may hurt >> user experience. >> >> Implement #2 for now. It is a reasonable default. Some workloads may >> want to use #1 or #3 and they can be implemented later based on user's >> demands. >> >> Support of unaccepted memory requires a few changes in core-mm code: >> >> - memblock has to accept memory on allocation; >> >> - page allocator has to accept memory on the first allocation of the >> page; >> >> Memblock change is trivial. >> >> The page allocator is modified to accept pages on the first allocation. >> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is >> used to indicate that the page requires acceptance. >> >> Architecture has to provide two helpers if it wants to support >> unaccepted memory: >> >> - accept_memory() makes a range of physical addresses accepted. >> >> - range_contains_unaccepted_memory() checks anything within the range >> of physical addresses requires acceptance. >> >> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock >> Reviewed-by: David Hildenbrand <david@redhat.com> > > Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed > before, but it's really not great how this affects the core page allocator > paths. Wouldn't it be possible to only release pages to page allocator when > accepted, and otherwise use some new per-zone variables together with the > bitmap to track how much exactly is where to accept? Then it could be hooked > in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - > if we fail zone_watermark_fast() and there are unaccepted pages in the zone, > accept them and continue. With a static key to flip in case we eventually > accept everything. Because this is really similar scenario to the deferred > init and that one was solved in a way that adds minimal overhead. I kind of like just having the memory stats being correct (e.g., free memory) and acceptance being an internal detail to be triggered when allocating pages -- just like the arch_alloc_page() callback. I'm sure we could optimize for the !unaccepted memory via static keys also in this version with some checks at the right places if we find this to hurt performance?
On 8/5/22 14:09, David Hildenbrand wrote: > On 05.08.22 13:49, Vlastimil Babka wrote: >> On 6/14/22 14:02, Kirill A. Shutemov wrote: >>> UEFI Specification version 2.9 introduces the concept of memory >>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD >>> SEV-SNP, require memory to be accepted before it can be used by the >>> guest. Accepting happens via a protocol specific to the Virtual Machine >>> platform. >>> >>> There are several ways kernel can deal with unaccepted memory: >>> >>> 1. Accept all the memory during the boot. It is easy to implement and >>> it doesn't have runtime cost once the system is booted. The downside >>> is very long boot time. >>> >>> Accept can be parallelized to multiple CPUs to keep it manageable >>> (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate >>> memory bandwidth and does not scale beyond the point. >>> >>> 2. Accept a block of memory on the first use. It requires more >>> infrastructure and changes in page allocator to make it work, but >>> it provides good boot time. >>> >>> On-demand memory accept means latency spikes every time kernel steps >>> onto a new memory block. The spikes will go away once workload data >>> set size gets stabilized or all memory gets accepted. >>> >>> 3. Accept all memory in background. Introduce a thread (or multiple) >>> that gets memory accepted proactively. It will minimize time the >>> system experience latency spikes on memory allocation while keeping >>> low boot time. >>> >>> This approach cannot function on its own. It is an extension of #2: >>> background memory acceptance requires functional scheduler, but the >>> page allocator may need to tap into unaccepted memory before that. >>> >>> The downside of the approach is that these threads also steal CPU >>> cycles and memory bandwidth from the user's workload and may hurt >>> user experience. >>> >>> Implement #2 for now. It is a reasonable default. Some workloads may >>> want to use #1 or #3 and they can be implemented later based on user's >>> demands. >>> >>> Support of unaccepted memory requires a few changes in core-mm code: >>> >>> - memblock has to accept memory on allocation; >>> >>> - page allocator has to accept memory on the first allocation of the >>> page; >>> >>> Memblock change is trivial. >>> >>> The page allocator is modified to accept pages on the first allocation. >>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is >>> used to indicate that the page requires acceptance. >>> >>> Architecture has to provide two helpers if it wants to support >>> unaccepted memory: >>> >>> - accept_memory() makes a range of physical addresses accepted. >>> >>> - range_contains_unaccepted_memory() checks anything within the range >>> of physical addresses requires acceptance. >>> >>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >>> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock >>> Reviewed-by: David Hildenbrand <david@redhat.com> >> >> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed >> before, but it's really not great how this affects the core page allocator >> paths. Wouldn't it be possible to only release pages to page allocator when >> accepted, and otherwise use some new per-zone variables together with the >> bitmap to track how much exactly is where to accept? Then it could be hooked >> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - >> if we fail zone_watermark_fast() and there are unaccepted pages in the zone, >> accept them and continue. With a static key to flip in case we eventually >> accept everything. Because this is really similar scenario to the deferred >> init and that one was solved in a way that adds minimal overhead. > > I kind of like just having the memory stats being correct (e.g., free > memory) and acceptance being an internal detail to be triggered when > allocating pages -- just like the arch_alloc_page() callback. Hm, good point about the stats. Could be tweaked perhaps so it appears correct on the outside, but might be tricky. > I'm sure we could optimize for the !unaccepted memory via static keys > also in this version with some checks at the right places if we find > this to hurt performance? It would be great if we would at least somehow hit the necessary code only when dealing with a >=pageblock size block. The bitmap approach and accepting everything smaller uprofront actually seems rather compatible. Yet in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size while merging. A list that sits besides the existing free_area, contains only >=pageblock order sizes of unaccepted pages (no migratetype distinguished) and we tap into it approximately before __rmqueue_fallback()? There would be some trickery around releasing zone-lock for doing accept_memory(), but should be manageable.
On 05.08.22 15:38, Vlastimil Babka wrote: > On 8/5/22 14:09, David Hildenbrand wrote: >> On 05.08.22 13:49, Vlastimil Babka wrote: >>> On 6/14/22 14:02, Kirill A. Shutemov wrote: >>>> UEFI Specification version 2.9 introduces the concept of memory >>>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD >>>> SEV-SNP, require memory to be accepted before it can be used by the >>>> guest. Accepting happens via a protocol specific to the Virtual Machine >>>> platform. >>>> >>>> There are several ways kernel can deal with unaccepted memory: >>>> >>>> 1. Accept all the memory during the boot. It is easy to implement and >>>> it doesn't have runtime cost once the system is booted. The downside >>>> is very long boot time. >>>> >>>> Accept can be parallelized to multiple CPUs to keep it manageable >>>> (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate >>>> memory bandwidth and does not scale beyond the point. >>>> >>>> 2. Accept a block of memory on the first use. It requires more >>>> infrastructure and changes in page allocator to make it work, but >>>> it provides good boot time. >>>> >>>> On-demand memory accept means latency spikes every time kernel steps >>>> onto a new memory block. The spikes will go away once workload data >>>> set size gets stabilized or all memory gets accepted. >>>> >>>> 3. Accept all memory in background. Introduce a thread (or multiple) >>>> that gets memory accepted proactively. It will minimize time the >>>> system experience latency spikes on memory allocation while keeping >>>> low boot time. >>>> >>>> This approach cannot function on its own. It is an extension of #2: >>>> background memory acceptance requires functional scheduler, but the >>>> page allocator may need to tap into unaccepted memory before that. >>>> >>>> The downside of the approach is that these threads also steal CPU >>>> cycles and memory bandwidth from the user's workload and may hurt >>>> user experience. >>>> >>>> Implement #2 for now. It is a reasonable default. Some workloads may >>>> want to use #1 or #3 and they can be implemented later based on user's >>>> demands. >>>> >>>> Support of unaccepted memory requires a few changes in core-mm code: >>>> >>>> - memblock has to accept memory on allocation; >>>> >>>> - page allocator has to accept memory on the first allocation of the >>>> page; >>>> >>>> Memblock change is trivial. >>>> >>>> The page allocator is modified to accept pages on the first allocation. >>>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is >>>> used to indicate that the page requires acceptance. >>>> >>>> Architecture has to provide two helpers if it wants to support >>>> unaccepted memory: >>>> >>>> - accept_memory() makes a range of physical addresses accepted. >>>> >>>> - range_contains_unaccepted_memory() checks anything within the range >>>> of physical addresses requires acceptance. >>>> >>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >>>> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock >>>> Reviewed-by: David Hildenbrand <david@redhat.com> >>> >>> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed >>> before, but it's really not great how this affects the core page allocator >>> paths. Wouldn't it be possible to only release pages to page allocator when >>> accepted, and otherwise use some new per-zone variables together with the >>> bitmap to track how much exactly is where to accept? Then it could be hooked >>> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - >>> if we fail zone_watermark_fast() and there are unaccepted pages in the zone, >>> accept them and continue. With a static key to flip in case we eventually >>> accept everything. Because this is really similar scenario to the deferred >>> init and that one was solved in a way that adds minimal overhead. >> >> I kind of like just having the memory stats being correct (e.g., free >> memory) and acceptance being an internal detail to be triggered when >> allocating pages -- just like the arch_alloc_page() callback. > > Hm, good point about the stats. Could be tweaked perhaps so it appears > correct on the outside, but might be tricky. > >> I'm sure we could optimize for the !unaccepted memory via static keys >> also in this version with some checks at the right places if we find >> this to hurt performance? > > It would be great if we would at least somehow hit the necessary code only > when dealing with a >=pageblock size block. The bitmap approach and > accepting everything smaller uprofront actually seems rather compatible. Yet > in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size > while merging. > > A list that sits besides the existing free_area, contains only >=pageblock > order sizes of unaccepted pages (no migratetype distinguished) and we tap > into it approximately before __rmqueue_fallback()? There would be some > trickery around releasing zone-lock for doing accept_memory(), but should be > manageable. > Just curious, do we have a microbenchmark that is able to reveal the impact of such code changes before we start worrying?
On 8/5/22 06:38, Vlastimil Babka wrote: >> I'm sure we could optimize for the !unaccepted memory via static keys >> also in this version with some checks at the right places if we find >> this to hurt performance? > It would be great if we would at least somehow hit the necessary code only > when dealing with a >=pageblock size block. The bitmap approach and > accepting everything smaller uprofront actually seems rather compatible. Yet > in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size > while merging. Needing to check PageUnaccepted() during the merge is fallout from moving the acceptance to post_alloc_hook(). I _think_ an earlier version of this did page acceptance under the zone lock, closer to where the page comes off the 2M/4M lists. But, page acceptance is horribly slow, so I asked Kirill to move it out from under the zone lock. Doing it in post_alloc_hook() (after the zone lock is dropped) makes a lot of sense since we do zeroing in there and zeroing is also nice and slow. But, post_alloc_hook() is long after the 2M page has been split and that means that we have to deal with potentially unaccepted pages during merges. I think there are three basic options: 1. This patch: Do acceptance after the zone lock is dropped and deal with mixed-acceptance merges 2. Do acceptance under the zone lock as pages come off the 2M/4M lists, but before the page is split. 3. Pull the page off the 2M/4M lists, drop the zone lock, accept it, then put it back. I'm not sure any of those other options are better.
On 8/5/22 07:22, David Hildenbrand wrote: >> A list that sits besides the existing free_area, contains only >=pageblock >> order sizes of unaccepted pages (no migratetype distinguished) and we tap >> into it approximately before __rmqueue_fallback()? There would be some >> trickery around releasing zone-lock for doing accept_memory(), but should be >> manageable. >> > Just curious, do we have a microbenchmark that is able to reveal the > impact of such code changes before we start worrying? Nope. I went looking to see if I could find any impact. I think Kirill did too. Too bad that effort didn't make it into the changelog yet. The merging check at least is just checking a field in a cache-hot 'struct page'. The common case is probably three instructions: load to a register check the bit jump if not set It adds a wee bit of icache pressure, but it's also the kind of thing that should be a piece of cake for the branch predictors. That dynamic check could easily be wrapped by a static branch. But, that first requires more code to go dig in the nooks and crannies of the page allocator to make sure *ALL* pages are accepted.
On 8/5/22 16:41, Dave Hansen wrote: > On 8/5/22 06:38, Vlastimil Babka wrote: >>> I'm sure we could optimize for the !unaccepted memory via static keys >>> also in this version with some checks at the right places if we find >>> this to hurt performance? >> It would be great if we would at least somehow hit the necessary code only >> when dealing with a >=pageblock size block. The bitmap approach and >> accepting everything smaller uprofront actually seems rather compatible. Yet >> in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size >> while merging. > > Needing to check PageUnaccepted() during the merge is fallout from > moving the acceptance to post_alloc_hook(). I _think_ an earlier > version of this did page acceptance under the zone lock, closer to where > the page comes off the 2M/4M lists. > > But, page acceptance is horribly slow, so I asked Kirill to move it out > from under the zone lock. Doing it in post_alloc_hook() (after the zone > lock is dropped) makes a lot of sense since we do zeroing in there and > zeroing is also nice and slow. > > But, post_alloc_hook() is long after the 2M page has been split and that > means that we have to deal with potentially unaccepted pages during merges. > > I think there are three basic options: > > 1. This patch: Do acceptance after the zone lock is dropped and deal > with mixed-acceptance merges > 2. Do acceptance under the zone lock as pages come off the 2M/4M lists, > but before the page is split. Rather not, as acceptance can be slow and we shouldn't hog the zone lock while doing it. > 3. Pull the page off the 2M/4M lists, drop the zone lock, accept it, > then put it back. Worth trying, IMHO. Perhaps easier to manage if the lists are distinct from the normal ones, as I suggested. > I'm not sure any of those other options are better.
On 8/5/22 11:17, Vlastimil Babka wrote: >> 3. Pull the page off the 2M/4M lists, drop the zone lock, accept it, >> then put it back. > Worth trying, IMHO. Perhaps easier to manage if the lists are distinct from > the normal ones, as I suggested. I was playing with another series recently where I did this, momentarily taking pages off some of the high-order lists and dropping the zone lock. Kirill, if you go looking at this, just make sure that you don't let this happen to too much memory at once. You might end up yanking memory out of the allocator that's not reflected in NR_FREE_PAGES. You might, for instance want to make sure that only a small number of threads can have pulled memory off the free lists at once. Something *logically* like this: // Limit to two threads at once: atomic_t nr_accepting_threads = ATOMIC_INIT(2); page = del_page_from_free_list(); if (!PageAccepted(page)) { if (atomic_dec_and_test(&nr_accepting_threads)) { // already at the thread limit add_page_from_free_list(page, ...); spin_unlock_irq(&zone->lock); // wait for a slot... spin_lock_irq(&zone->lock); goto retry; } else { spin_unlock_irq(&zone->lock); accept_page(page); spin_lock_irq(&zone->lock); add_page_from_free_list(page, ...); // do merging if it was a 2M page } } It's a little nasty because the whole thing is not a sleepable context. I also know that the merging code needs some refactoring if you want to do merging with 2M pages here. It might all get easier if you move all the page allocator stuff to only work at the 4M granularity. In any case, I'm not trying to dissuade anyone from listening to the other reviewer feedback. Just trying to save you a few cycles on a similar problem I was looking at recently.
On Fri, Aug 05, 2022 at 01:49:41PM +0200, Vlastimil Babka wrote: > On 6/14/22 14:02, Kirill A. Shutemov wrote: > > UEFI Specification version 2.9 introduces the concept of memory > > acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD > > SEV-SNP, require memory to be accepted before it can be used by the > > guest. Accepting happens via a protocol specific to the Virtual Machine > > platform. > > > > There are several ways kernel can deal with unaccepted memory: > > > > 1. Accept all the memory during the boot. It is easy to implement and > > it doesn't have runtime cost once the system is booted. The downside > > is very long boot time. > > > > Accept can be parallelized to multiple CPUs to keep it manageable > > (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate > > memory bandwidth and does not scale beyond the point. > > > > 2. Accept a block of memory on the first use. It requires more > > infrastructure and changes in page allocator to make it work, but > > it provides good boot time. > > > > On-demand memory accept means latency spikes every time kernel steps > > onto a new memory block. The spikes will go away once workload data > > set size gets stabilized or all memory gets accepted. > > > > 3. Accept all memory in background. Introduce a thread (or multiple) > > that gets memory accepted proactively. It will minimize time the > > system experience latency spikes on memory allocation while keeping > > low boot time. > > > > This approach cannot function on its own. It is an extension of #2: > > background memory acceptance requires functional scheduler, but the > > page allocator may need to tap into unaccepted memory before that. > > > > The downside of the approach is that these threads also steal CPU > > cycles and memory bandwidth from the user's workload and may hurt > > user experience. > > > > Implement #2 for now. It is a reasonable default. Some workloads may > > want to use #1 or #3 and they can be implemented later based on user's > > demands. > > > > Support of unaccepted memory requires a few changes in core-mm code: > > > > - memblock has to accept memory on allocation; > > > > - page allocator has to accept memory on the first allocation of the > > page; > > > > Memblock change is trivial. > > > > The page allocator is modified to accept pages on the first allocation. > > The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is > > used to indicate that the page requires acceptance. > > > > Architecture has to provide two helpers if it wants to support > > unaccepted memory: > > > > - accept_memory() makes a range of physical addresses accepted. > > > > - range_contains_unaccepted_memory() checks anything within the range > > of physical addresses requires acceptance. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock > > Reviewed-by: David Hildenbrand <david@redhat.com> > > Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed > before, but it's really not great how this affects the core page allocator > paths. Wouldn't it be possible to only release pages to page allocator when > accepted, and otherwise use some new per-zone variables together with the > bitmap to track how much exactly is where to accept? Then it could be hooked > in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - > if we fail zone_watermark_fast() and there are unaccepted pages in the zone, > accept them and continue. With a static key to flip in case we eventually > accept everything. Because this is really similar scenario to the deferred > init and that one was solved in a way that adds minimal overhead. > I think it might be more straight-forward to always accept pages in the size of the pageblock. Smaller ranges should matter because they have been accepted in deferred_free_range. In expand, if PageUnaccepted is set on a pageblock-sized page, take it off the list, drop the zone->lock leaving IRQs disabled, accept the memory and reacquire the lock to split the page into the required order. IRQs being left disabled is unfortunate but even if the acceptance is slow, it's presumably not so slow to cause major problems. This would would reduce and probably eliminate the need to do the assert check in accept_page. It might also simplify __free_one_page if it's known that a pageblock range of pages are either all accepted or unaccepted. Lastly, the default behaviour should probably be "accept all memory at boot" and use Kconfig to allow acceptable be deferred or overridden by command line. There are at least two reasons for this. Even though this is a virtual machine, there still may be latency sensitive applications running early in boot using pinned vcpu->pcpu and no memory overcommit. The unpredictable performance of the application early in boot may be unacceptable and unavoidable. It might take a long time but it could eventually generate bug reports about "unpredictable performance early in boot" that will be hard to track down unless accept_memory is observed using perf at the right time. Even when that does happen, there will need to be an option to turn it off if the unpredictable performance cannot be tolerated. Second, any benchmarking done early in boot is likely to be disrupted making the series a potential bisection magnet that masks a performance bug elsewhere in the merge window.
> > > The unpredictable performance of the application early in boot may be > unacceptable and unavoidable. It might take a long time but it could > eventually generate bug reports about "unpredictable performance early > in boot" that will be hard to track down unless accept_memory is observed > using perf at the right time. Even when that does happen, there will need > to be an option to turn it off if the unpredictable performance cannot > be tolerated. Second, any benchmarking done early in boot is likely to > be disrupted making the series a potential bisection magnet that masks a > performance bug elsewhere in the merge window. I'm doing some boot performance tests now before I run some workload memory acceptance latency tests. Note that this testing is on AMD SEV-SNP, so this patch series on top of the AMD guest patches v12, plus a patch Brijesh Singh wrote to define __accept_memory for SEV-SNP https://github.com/AMDESE/linux/commit/ecae2582666d50ce1e633975d703d2f904183ece I was getting pretty consistent boot times, only going up slightly as the memory size increased, but at 256GB, the VM crashes because it touches some unaccepted memory without first accepting it. 255GB boots fine. The stack track is in mm/page_alloc.c. I've done a little investigation, but I can't account for why there's a hard cutoff of correctness at 256GB [ 0.065563] RIP: 0010:memmap_init_range+0x108/0x173 [ 0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff ff ff [ 0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX: 0000000000000404 [ 0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000 [ 0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040 [ 0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000 [ 0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000 [ 0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000 [ 0.075285] FS: 0000000000000000(0000) GS:ffffffffadd6c000(0000) knlGS:0000000000000000 [ 0.076365] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0 [ 0.078209] Call Trace: [ 0.078524] <TASK> [ 0.078887] ? free_area_init+0x5c1/0x66c [ 0.079417] ? zone_sizes_init+0x52/0x6c [ 0.079934] ? setup_arch+0xa55/0xb6d [ 0.080417] ? start_kernel+0x64/0x65a [ 0.080897] ? secondary_startup_64_no_verify+0xd6/0xdb [ 0.081620] </TASK> > > -- > Mel Gorman > SUSE Labs
On 8/15/22 16:08, Dionna Amalie Glaze wrote: >> >> >> The unpredictable performance of the application early in boot may be >> unacceptable and unavoidable. It might take a long time but it could >> eventually generate bug reports about "unpredictable performance early >> in boot" that will be hard to track down unless accept_memory is observed >> using perf at the right time. Even when that does happen, there will need >> to be an option to turn it off if the unpredictable performance cannot >> be tolerated. Second, any benchmarking done early in boot is likely to >> be disrupted making the series a potential bisection magnet that masks a >> performance bug elsewhere in the merge window. > > I'm doing some boot performance tests now before I run some workload > memory acceptance latency tests. > Note that this testing is on AMD SEV-SNP, so this patch series on top > of the AMD guest patches v12, plus a > patch Brijesh Singh wrote to define __accept_memory for SEV-SNP > https://github.com/AMDESE/linux/commit/ecae2582666d50ce1e633975d703d2f904183ece Note that there is a bug in Brijesh's version of the patch and it will almost exclusively use the MSR protocol. Please try the version of the patch that I recently sent up based on the current unaccepted memory tree from Kirill. https://lore.kernel.org/lkml/cover.1660579062.git.thomas.lendacky@amd.com/ Thanks, Tom > > I was getting pretty consistent boot times, only going up slightly as > the memory size increased, but at 256GB, the VM crashes because it > touches some unaccepted memory without first accepting it. 255GB boots > fine. > > The stack track is in mm/page_alloc.c. I've done a little > investigation, but I can't account for why there's a hard cutoff of > correctness at 256GB > > [ 0.065563] RIP: 0010:memmap_init_range+0x108/0x173 > [ 0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c > 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48 > 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff > ff ff > [ 0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX: > 0000000000000404 > [ 0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000 > [ 0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040 > [ 0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000 > [ 0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000 > [ 0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000 > [ 0.075285] FS: 0000000000000000(0000) GS:ffffffffadd6c000(0000) > knlGS:0000000000000000 > [ 0.076365] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0 > [ 0.078209] Call Trace: > [ 0.078524] <TASK> > [ 0.078887] ? free_area_init+0x5c1/0x66c > [ 0.079417] ? zone_sizes_init+0x52/0x6c > [ 0.079934] ? setup_arch+0xa55/0xb6d > [ 0.080417] ? start_kernel+0x64/0x65a > [ 0.080897] ? secondary_startup_64_no_verify+0xd6/0xdb > [ 0.081620] </TASK> > >> >> -- >> Mel Gorman >> SUSE Labs > > >
> > The stack track is in mm/page_alloc.c. I've done a little > > investigation, but I can't account for why there's a hard cutoff of > > correctness at 256GB > > > > [ 0.065563] RIP: 0010:memmap_init_range+0x108/0x173 > > [ 0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c > > 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48 > > 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff > > ff ff > > [ 0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX: > > 0000000000000404 > > [ 0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000 > > [ 0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040 > > [ 0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000 > > [ 0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000 > > [ 0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000 > > [ 0.075285] FS: 0000000000000000(0000) GS:ffffffffadd6c000(0000) > > knlGS:0000000000000000 > > [ 0.076365] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0 > > [ 0.078209] Call Trace: > > [ 0.078524] <TASK> > > [ 0.078887] ? free_area_init+0x5c1/0x66c > > [ 0.079417] ? zone_sizes_init+0x52/0x6c > > [ 0.079934] ? setup_arch+0xa55/0xb6d > > [ 0.080417] ? start_kernel+0x64/0x65a > > [ 0.080897] ? secondary_startup_64_no_verify+0xd6/0xdb > > [ 0.081620] </TASK> > > Note that there is a bug in Brijesh's version of the patch and it will > almost exclusively use the MSR protocol. Please try the version of the > patch that I recently sent up based on the current unaccepted memory tree > from Kirill. > I've now tested this patch set with Tom's new patch set, and it appears to be that the problem with 256GB is more likely to be due to this unaccepted memory patch set rather than something AMD-specific. Kirill, do you see any problems with 256GB on TDX? It seems there is some unaccepted memory getting touched in memmap_init_range when the VM's memory size is at least 256GB
On 8/29/22 09:02, Dionna Amalie Glaze wrote: >>> The stack track is in mm/page_alloc.c. I've done a little >>> investigation, but I can't account for why there's a hard cutoff of >>> correctness at 256GB >>> >>> [ 0.065563] RIP: 0010:memmap_init_range+0x108/0x173 >>> [ 0.066309] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c >>> 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d d9 a2 66 ff 48 >>> 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff >>> ff ff >>> [ 0.069108] RSP: 0000:ffffffffad603dc8 EFLAGS: 00010082 ORIG_RAX: >>> 0000000000000404 >>> [ 0.070193] RAX: ffffdba740000048 RBX: 0000000000000001 RCX: 0000000000000000 >>> [ 0.071170] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffdba740000040 >>> [ 0.072224] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000 >>> [ 0.073283] R10: 0000000000000001 R11: ffffffffad645c60 R12: 0000000000000000 >>> [ 0.074304] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000 >>> [ 0.075285] FS: 0000000000000000(0000) GS:ffffffffadd6c000(0000) >>> knlGS:0000000000000000 >>> [ 0.076365] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 0.077194] CR2: ffffdba740000074 CR3: 0008001ee3a0c000 CR4: 00000000000606b0 >>> [ 0.078209] Call Trace: >>> [ 0.078524] <TASK> >>> [ 0.078887] ? free_area_init+0x5c1/0x66c >>> [ 0.079417] ? zone_sizes_init+0x52/0x6c >>> [ 0.079934] ? setup_arch+0xa55/0xb6d >>> [ 0.080417] ? start_kernel+0x64/0x65a >>> [ 0.080897] ? secondary_startup_64_no_verify+0xd6/0xdb >>> [ 0.081620] </TASK> >> Note that there is a bug in Brijesh's version of the patch and it will >> almost exclusively use the MSR protocol. Please try the version of the >> patch that I recently sent up based on the current unaccepted memory tree >> from Kirill. >> > I've now tested this patch set with Tom's new patch set, and it > appears to be that the problem with 256GB is more likely to be due to > this unaccepted memory patch set rather than something AMD-specific. > > Kirill, do you see any problems with 256GB on TDX? It seems there is > some unaccepted memory getting touched in memmap_init_range when the > VM's memory size is at least 256GB It really helps this kind of stuff if you can post the *actual* error. I assume this was a page fault, so there should have been some other stuff before the RIP:... Another thing that's really nice is to do the disassembly of the "Code:" or share disassembly of memmap_init_range. Even nicer would be to give an faddr2line of the RIP value and track down which C code was actually at fault. It's *possible* to look into these things from what you posted, but it's just slow. I'm sure Kirill will appreciate the help.
> > It really helps this kind of stuff if you can post the *actual* error. > I assume this was a page fault, so there should have been some other > stuff before the RIP:... > I posted the error on August 15th. I was bumping in my last post since I confirmed with Tom Lendacky that it wasn't AMD's patches at fault. Here's a new dump below that matches the disassembly: [ 0.043137] Faking a node at [mem 0x0000000000000000-0x000000403fffffff] [ 0.044018] NODE_DATA(0) allocated [mem 0x403fffc000-0x403fffffff] [ 0.044922] Zone ranges: [ 0.045250] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.046039] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] [ 0.046828] Normal [mem 0x0000000100000000-0x000000403fffffff] [ 0.047657] Movable zone start for each node [ 0.048201] Early memory node ranges [ 0.048674] node 0: [mem 0x0000000000001000-0x000000000009ffff] [ 0.049474] node 0: [mem 0x0000000000100000-0x000000000080cfff [ 0.050274] node 0: [mem 0x000000000080f000-0x00000000beceefff] [ 0.051074] node 0: [mem 0x00000000befff000-0x00000000bfbb0fff] [ 0.051874] node 0: [mem 0x00000000bfbb2000-0x00000000bffdbfff] [ 0.052674] node 0: [mem 0x0000000100000000-0x000000403fffffff] [ 0.053530] Initmem setup node 0 [mem 0x0000000000001000-0x000000403fffffff] PANIC: Unsupported exit-code 0x404 in early #VC exception (IP: 0xfffffffface0cdd0) [ 0.056667] CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc6-173762-gffb12b02c6d7-dirty #1 [ 0.057744] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 0.058920] RIP: 0010:memmap_init_range+0x11d/0x188 [ 0.059686] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d a4 1e 65 ff 48 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff ff ff [ 0.062121] RSP: 0000:ffffffffac603dc0 EFLAGS: 00010082 ORIG_RAX: 0000000000000404 [ 0.063087] RAX: ffffda1ac0000048 RBX: 0000000000000001 RCX: 0000000000000000 [ 0.063998] RDX: 0300000000000000 RSI: 0000000000000000 RDI: ffffda1ac000004 [ 0.064944] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000 [ 0.065873] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 0.066782] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000 [ 0.067695] FS: 0000000000000000(0000) GS:ffffffffacd88000(0000) knlGS:0000000000000000 [ 0.068727] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.069488] CR2: ffffda1ac0000074 CR3: 00080020b680c000 CR4: 00000000000606b0 [ 0.070397] Call Trace: [ 0.070710] <TASK> [ 0.070976] ? free_area_init+0x724/0x7d4 [ 0.071486] ? zone_sizes_init+0x52/0x6c [ 0.071986] ? setup_arch+0xa55/0xb77 [ 0.072453] ? start_kernel+0x64/0x65f [ 0.072931] ? secondary_startup_64_no_verify+0xd6/0xdb [ 0.073598] </TASK> Note this is a crash in SEV-SNP, but I assume we'd get the same #VE in TDX. > Another thing that's really nice is to do the disassembly of the "Code:" > or share disassembly of memmap_init_range. 0000000000000172 <memmap_init_range>: 172: 41 56 push %r14 174: 89 f0 mov %esi,%eax 176: 45 89 ce mov %r9d,%r14d 179: 41 55 push %r13 17b: 4c 8d 2c 39 lea (%rcx,%rdi,1),%r13 17f: 41 54 push %r12 181: 49 89 d4 mov %rdx,%r12 184: 49 8d 55 ff lea -0x1(%r13),%rdx 188: 48 3b 15 00 00 00 00 cmp 0x0(%rip),%rdx # 18f <memmap_init_range+0x1d> 18f: 55 push %rbp 190: 53 push %rbx 191: 48 89 cb mov %rcx,%rbx 194: 76 07 jbe 19d <memmap_init_range+0x2b> 196: 48 89 15 00 00 00 00 mov %rdx,0x0(%rip) # 19d <memmap_init_range+0x2b> 19d: 4c 89 e5 mov %r12,%rbp 1a0: ba 03 00 00 00 mov $0x3,%edx 1a5: 48 c1 e0 3a shl $0x3a,%rax 1a9: 48 c1 e5 38 shl $0x38,%rbp 1ad: 48 c1 e2 38 shl $0x38,%rdx 1b1: 48 21 d5 and %rdx,%rbp 1b4: 48 09 c5 or %rax,%rbp 1b7: 49 39 dd cmp %rbx,%r13 1ba: 0f 86 31 01 00 00 jbe 2f1 <memmap_init_range+0x17f> 1c0: 45 85 f6 test %r14d,%r14d 1c3: 0f 85 b4 00 00 00 jne 27d <memmap_init_range+0x10b> 1c9: 49 83 fc 03 cmp $0x3,%r12 1cd: 0f 94 c1 sete %cl 1d0: 22 0d 00 00 00 00 and 0x0(%rip),%cl # 1d6 <memmap_init_range+0x64> 1d6: 0f 84 a1 00 00 00 je 27d <memmap_init_range+0x10b> 1dc: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 1e3 <memmap_init_range+0x71> 1e3: 48 85 d2 test %rdx,%rdx 1e6: 74 10 je 1f8 <memmap_init_range+0x86> 1e8: 48 8b 42 08 mov 0x8(%rdx),%rax 1ec: 48 03 02 add (%rdx),%rax 1ef: 48 c1 e8 0c shr $0xc,%rax 1f3: 48 39 d8 cmp %rbx,%rax 1f6: 77 55 ja 24d <memmap_init_range+0xdb> 1f8: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 1ff <memmap_init_range+0x8d> 1ff: 4c 6b 05 00 00 00 00 imul $0x18,0x0(%rip),%r8 # 207 <memmap_init_range+0x95> 206: 18 207: 31 f6 xor %esi,%esi 209: 48 89 05 00 00 00 00 mov %rax,0x0(%rip) # 210 <memmap_init_range+0x9e> 210: 49 01 c0 add %rax,%r8 213: 48 89 c7 mov %rax,%rdi 216: 4c 39 c0 cmp %r8,%rax 219: 73 26 jae 241 <memmap_init_range+0xcf> 21b: 48 8b 57 08 mov 0x8(%rdi),%rdx 21f: 48 03 17 add (%rdi),%rdx 222: 48 83 c0 18 add $0x18,%rax 226: 48 c1 ea 0c shr $0xc,%rdx 22a: 48 39 da cmp %rbx,%rdx 22d: 76 0e jbe 23d <memmap_init_range+0xcb> 22f: 40 84 f6 test %sil,%sil 232: 74 19 je 24d <memmap_init_range+0xdb> 234: 48 89 3d 00 00 00 00 mov %rdi,0x0(%rip) # 23b <memmap_init_range+0xc9> 23b: eb 10 jmp 24d <memmap_init_range+0xdb> 23d: 89 ce mov %ecx,%esi 23f: eb d2 jmp 213 <memmap_init_range+0xa1> 241: 40 84 f6 test %sil,%sil 244: 74 07 je 24d <memmap_init_range+0xdb> 246: 48 89 05 00 00 00 00 mov %rax,0x0(%rip) # 24d <memmap_init_range+0xdb> 24d: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 254 <memmap_init_range+0xe2> 254: 48 8b 02 mov (%rdx),%rax 257: 48 8d 88 ff 0f 00 00 lea 0xfff(%rax),%rcx 25e: 48 c1 e9 0c shr $0xc,%rcx 262: 48 39 d9 cmp %rbx,%rcx 265: 77 16 ja 27d <memmap_init_range+0x10b> 267: f6 42 10 02 testb $0x2,0x10(%rdx) 26b: 74 10 je 27d <memmap_init_range+0x10b> 26d: 48 03 42 08 add 0x8(%rdx),%rax 271: 48 c1 e8 0c shr $0xc,%rax 275: 48 89 c3 mov %rax,%rbx 278: e9 3a ff ff ff jmp 1b7 <memmap_init_range+0x45> 27d: 48 89 df mov %rbx,%rdi 280: 48 c1 e7 06 shl $0x6,%rdi 284: 48 03 3d 00 00 00 00 add 0x0(%rip),%rdi # 28b <memmap_init_range+0x119> 28b: 48 8d 47 08 lea 0x8(%rdi),%rax 28f: c7 47 34 01 00 00 00 movl $0x1,0x34(%rdi) # Here's where the crash RIP is. 296: 48 c7 47 38 00 00 00 movq $0x0,0x38(%rdi) 29d: 00 29e: c7 47 30 ff ff ff ff movl $0xffffffff,0x30(%rdi) 2a5: 48 c7 47 28 00 00 00 movq $0x0,0x28(%rdi) 2ac: 00 2ad: 48 c7 47 20 00 00 00 movq $0x0,0x20(%rdi) 2b4: 00 2b5: 48 c7 47 18 00 00 00 movq $0x0,0x18(%rdi) 2bc: 00 2bd: 48 89 2f mov %rbp,(%rdi) 2c0: 48 89 47 08 mov %rax,0x8(%rdi) 2c4: 48 89 47 10 mov %rax,0x10(%rdi) 2c8: 41 83 fe 01 cmp $0x1,%r14d 2cc: 75 05 jne 2d3 <memmap_init_range+0x161> 2ce: 48 0f ba 2f 0c btsq $0xc,(%rdi) 2d3: f7 c3 ff 01 00 00 test $0x1ff,%ebx 2d9: 75 0e jne 2e9 <memmap_init_range+0x177> 2db: 8b 74 24 38 mov 0x38(%rsp),%esi 2df: e8 00 00 00 00 call 2e4 <memmap_init_range+0x172> 2e4: e8 00 00 00 00 call 2e9 <memmap_init_range+0x177> 2e9: 48 ff c3 inc %rbx 2ec: e9 c6 fe ff ff jmp 1b7 <memmap_init_range+0x45> 2f1: 5b pop %rbx 2f2: 5d pop %rbp 2f3: 41 5c pop %r12 2f5: 41 5d pop %r13 2f7: 41 5e pop %r14 2f9: c3 ret > Even nicer would be to give > an faddr2line of the RIP value and track down which C code was actually > at fault. arch_atomic_set arch/x86/include/asm/atomic.h:41 of INIT_LIST_HEAD in __init_single_page, called from memmap_init_range. -- -Dionna Glaze, PhD (she/her)
On Tue, Sep 06, 2022 at 10:50:42AM -0700, Dionna Amalie Glaze wrote: > > > > It really helps this kind of stuff if you can post the *actual* error. > > I assume this was a page fault, so there should have been some other > > stuff before the RIP:... > > > > I posted the error on August 15th. I was bumping in my last post > since I confirmed with Tom Lendacky that it wasn't AMD's patches at > fault. > Here's a new dump below that matches the disassembly: > > [ 0.043137] Faking a node at [mem 0x0000000000000000-0x000000403fffffff] > [ 0.044018] NODE_DATA(0) allocated [mem 0x403fffc000-0x403fffffff] > [ 0.044922] Zone ranges: > [ 0.045250] DMA [mem 0x0000000000001000-0x0000000000ffffff] > [ 0.046039] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] > [ 0.046828] Normal [mem 0x0000000100000000-0x000000403fffffff] > [ 0.047657] Movable zone start for each node > [ 0.048201] Early memory node ranges > [ 0.048674] node 0: [mem 0x0000000000001000-0x000000000009ffff] > [ 0.049474] node 0: [mem 0x0000000000100000-0x000000000080cfff > [ 0.050274] node 0: [mem 0x000000000080f000-0x00000000beceefff] > [ 0.051074] node 0: [mem 0x00000000befff000-0x00000000bfbb0fff] > [ 0.051874] node 0: [mem 0x00000000bfbb2000-0x00000000bffdbfff] > [ 0.052674] node 0: [mem 0x0000000100000000-0x000000403fffffff] > [ 0.053530] Initmem setup node 0 [mem 0x0000000000001000-0x000000403fffffff] > PANIC: Unsupported exit-code 0x404 in early #VC exception (IP: > 0xfffffffface0cdd0) > [ 0.056667] CPU: 0 PID: 0 Comm: swapper Not tainted > 5.17.0-rc6-173762-gffb12b02c6d7-dirty #1 > [ 0.057744] Hardware name: Google Google Compute Engine/Google > Compute Engine, BIOS Google 01/01/2011 > [ 0.058920] RIP: 0010:memmap_init_range+0x11d/0x188 > [ 0.059686] Code: 77 16 f6 42 10 02 74 10 48 03 42 08 48 c1 e8 0c > 48 89 c3 e9 3a ff ff ff 48 89 df 48 c1 e7 06 48 03 3d a4 1e 65 ff 48 > 8d 47 08 <c7> 47 34 01 00 00 00 48 c7 47 38 00 00 00 00 c7 47 30 ff ff > ff ff > [ 0.062121] RSP: 0000:ffffffffac603dc0 EFLAGS: 00010082 ORIG_RAX: > 0000000000000404 > [ 0.063087] RAX: ffffda1ac0000048 RBX: 0000000000000001 RCX: 0000000000000000 > [ 0.063998] RDX: 0300000000000000 RSI: 0000000000000000 RDI: ffffda1ac000004 > [ 0.064944] RBP: 0000000000000000 R08: 0000000000001000 R09: 0000000000000000 > [ 0.065873] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 > [ 0.066782] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000 > [ 0.067695] FS: 0000000000000000(0000) GS:ffffffffacd88000(0000) > knlGS:0000000000000000 > [ 0.068727] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.069488] CR2: ffffda1ac0000074 CR3: 00080020b680c000 CR4: 00000000000606b0 > [ 0.070397] Call Trace: > [ 0.070710] <TASK> > [ 0.070976] ? free_area_init+0x724/0x7d4 > [ 0.071486] ? zone_sizes_init+0x52/0x6c > [ 0.071986] ? setup_arch+0xa55/0xb77 > [ 0.072453] ? start_kernel+0x64/0x65f > [ 0.072931] ? secondary_startup_64_no_verify+0xd6/0xdb > [ 0.073598] </TASK> > > Note this is a crash in SEV-SNP, but I assume we'd get the same #VE in TDX. > > > Another thing that's really nice is to do the disassembly of the "Code:" > > or share disassembly of memmap_init_range. > > 0000000000000172 <memmap_init_range>: > 172: 41 56 push %r14 > 174: 89 f0 mov %esi,%eax > 176: 45 89 ce mov %r9d,%r14d > 179: 41 55 push %r13 > 17b: 4c 8d 2c 39 lea (%rcx,%rdi,1),%r13 > 17f: 41 54 push %r12 > 181: 49 89 d4 mov %rdx,%r12 > 184: 49 8d 55 ff lea -0x1(%r13),%rdx > 188: 48 3b 15 00 00 00 00 cmp 0x0(%rip),%rdx # 18f > <memmap_init_range+0x1d> > 18f: 55 push %rbp > 190: 53 push %rbx > 191: 48 89 cb mov %rcx,%rbx > 194: 76 07 jbe 19d <memmap_init_range+0x2b> > 196: 48 89 15 00 00 00 00 mov %rdx,0x0(%rip) # 19d > <memmap_init_range+0x2b> > 19d: 4c 89 e5 mov %r12,%rbp > 1a0: ba 03 00 00 00 mov $0x3,%edx > 1a5: 48 c1 e0 3a shl $0x3a,%rax > 1a9: 48 c1 e5 38 shl $0x38,%rbp > 1ad: 48 c1 e2 38 shl $0x38,%rdx > 1b1: 48 21 d5 and %rdx,%rbp > 1b4: 48 09 c5 or %rax,%rbp > 1b7: 49 39 dd cmp %rbx,%r13 > 1ba: 0f 86 31 01 00 00 jbe 2f1 <memmap_init_range+0x17f> > 1c0: 45 85 f6 test %r14d,%r14d > 1c3: 0f 85 b4 00 00 00 jne 27d <memmap_init_range+0x10b> > 1c9: 49 83 fc 03 cmp $0x3,%r12 > 1cd: 0f 94 c1 sete %cl > 1d0: 22 0d 00 00 00 00 and 0x0(%rip),%cl # 1d6 > <memmap_init_range+0x64> > 1d6: 0f 84 a1 00 00 00 je 27d <memmap_init_range+0x10b> > 1dc: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 1e3 > <memmap_init_range+0x71> > 1e3: 48 85 d2 test %rdx,%rdx > 1e6: 74 10 je 1f8 <memmap_init_range+0x86> > 1e8: 48 8b 42 08 mov 0x8(%rdx),%rax > 1ec: 48 03 02 add (%rdx),%rax > 1ef: 48 c1 e8 0c shr $0xc,%rax > 1f3: 48 39 d8 cmp %rbx,%rax > 1f6: 77 55 ja 24d <memmap_init_range+0xdb> > 1f8: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 1ff > <memmap_init_range+0x8d> > 1ff: 4c 6b 05 00 00 00 00 imul $0x18,0x0(%rip),%r8 # > 207 <memmap_init_range+0x95> > 206: 18 > 207: 31 f6 xor %esi,%esi > 209: 48 89 05 00 00 00 00 mov %rax,0x0(%rip) # 210 > <memmap_init_range+0x9e> > 210: 49 01 c0 add %rax,%r8 > 213: 48 89 c7 mov %rax,%rdi > 216: 4c 39 c0 cmp %r8,%rax > 219: 73 26 jae 241 <memmap_init_range+0xcf> > 21b: 48 8b 57 08 mov 0x8(%rdi),%rdx > 21f: 48 03 17 add (%rdi),%rdx > 222: 48 83 c0 18 add $0x18,%rax > 226: 48 c1 ea 0c shr $0xc,%rdx > 22a: 48 39 da cmp %rbx,%rdx > 22d: 76 0e jbe 23d <memmap_init_range+0xcb> > 22f: 40 84 f6 test %sil,%sil > 232: 74 19 je 24d <memmap_init_range+0xdb> > 234: 48 89 3d 00 00 00 00 mov %rdi,0x0(%rip) # 23b > <memmap_init_range+0xc9> > 23b: eb 10 jmp 24d <memmap_init_range+0xdb> > 23d: 89 ce mov %ecx,%esi > 23f: eb d2 jmp 213 <memmap_init_range+0xa1> > 241: 40 84 f6 test %sil,%sil > 244: 74 07 je 24d <memmap_init_range+0xdb> > 246: 48 89 05 00 00 00 00 mov %rax,0x0(%rip) # 24d > <memmap_init_range+0xdb> > 24d: 48 8b 15 00 00 00 00 mov 0x0(%rip),%rdx # 254 > <memmap_init_range+0xe2> > 254: 48 8b 02 mov (%rdx),%rax > 257: 48 8d 88 ff 0f 00 00 lea 0xfff(%rax),%rcx > 25e: 48 c1 e9 0c shr $0xc,%rcx > 262: 48 39 d9 cmp %rbx,%rcx > 265: 77 16 ja 27d <memmap_init_range+0x10b> > 267: f6 42 10 02 testb $0x2,0x10(%rdx) > 26b: 74 10 je 27d <memmap_init_range+0x10b> > 26d: 48 03 42 08 add 0x8(%rdx),%rax > 271: 48 c1 e8 0c shr $0xc,%rax > 275: 48 89 c3 mov %rax,%rbx > 278: e9 3a ff ff ff jmp 1b7 <memmap_init_range+0x45> > 27d: 48 89 df mov %rbx,%rdi > 280: 48 c1 e7 06 shl $0x6,%rdi > 284: 48 03 3d 00 00 00 00 add 0x0(%rip),%rdi # 28b > <memmap_init_range+0x119> > 28b: 48 8d 47 08 lea 0x8(%rdi),%rax > 28f: c7 47 34 01 00 00 00 movl $0x1,0x34(%rdi) # Here's > where the crash RIP is. > 296: 48 c7 47 38 00 00 00 movq $0x0,0x38(%rdi) > 29d: 00 > 29e: c7 47 30 ff ff ff ff movl $0xffffffff,0x30(%rdi) > 2a5: 48 c7 47 28 00 00 00 movq $0x0,0x28(%rdi) > 2ac: 00 > 2ad: 48 c7 47 20 00 00 00 movq $0x0,0x20(%rdi) > 2b4: 00 > 2b5: 48 c7 47 18 00 00 00 movq $0x0,0x18(%rdi) > 2bc: 00 > 2bd: 48 89 2f mov %rbp,(%rdi) > 2c0: 48 89 47 08 mov %rax,0x8(%rdi) > 2c4: 48 89 47 10 mov %rax,0x10(%rdi) > 2c8: 41 83 fe 01 cmp $0x1,%r14d > 2cc: 75 05 jne 2d3 <memmap_init_range+0x161> > 2ce: 48 0f ba 2f 0c btsq $0xc,(%rdi) > 2d3: f7 c3 ff 01 00 00 test $0x1ff,%ebx > 2d9: 75 0e jne 2e9 <memmap_init_range+0x177> > 2db: 8b 74 24 38 mov 0x38(%rsp),%esi > 2df: e8 00 00 00 00 call 2e4 <memmap_init_range+0x172> > 2e4: e8 00 00 00 00 call 2e9 <memmap_init_range+0x177> > 2e9: 48 ff c3 inc %rbx > 2ec: e9 c6 fe ff ff jmp 1b7 <memmap_init_range+0x45> > 2f1: 5b pop %rbx > 2f2: 5d pop %rbp > 2f3: 41 5c pop %r12 > 2f5: 41 5d pop %r13 > 2f7: 41 5e pop %r14 > 2f9: c3 ret > > > Even nicer would be to give > > an faddr2line of the RIP value and track down which C code was actually > > at fault. > > arch_atomic_set > arch/x86/include/asm/atomic.h:41 > > of INIT_LIST_HEAD in __init_single_page, called from memmap_init_range. Looks like the first access to the memory map fails, although I think it's not in INIT_LIST_HEAD() but rather in init_page_count(). I'd start with making sure that page_alloc::memmap_alloc() actually returns accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map will poisoned in this function, so my guess is it'd crash there. > -- > -Dionna Glaze, PhD (she/her)
> > Looks like the first access to the memory map fails, although I think > it's not in INIT_LIST_HEAD() but rather in init_page_count(). > > I'd start with making sure that page_alloc::memmap_alloc() actually returns > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map > will poisoned in this function, so my guess is it'd crash there. > That's a wonderful hint, thank you! I did not run this test CONFIG_DEBUG_VM set, but you think it's possible it could still be here? > -- > Sincerely yours, > Mike.
On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: > > > > Looks like the first access to the memory map fails, although I think > > it's not in INIT_LIST_HEAD() but rather in init_page_count(). > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map > > will poisoned in this function, so my guess is it'd crash there. > > > > That's a wonderful hint, thank you! I did not run this test > CONFIG_DEBUG_VM set, but you think it's possible it could still be > here? It depends on how you configured your kernel. Say, defconfig does not set it.
On 9/8/22 14:28, Mike Rapoport wrote: > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: >>> >>> Looks like the first access to the memory map fails, although I think >>> it's not in INIT_LIST_HEAD() but rather in init_page_count(). >>> >>> I'd start with making sure that page_alloc::memmap_alloc() actually returns >>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map >>> will poisoned in this function, so my guess is it'd crash there. >>> >> >> That's a wonderful hint, thank you! I did not run this test >> CONFIG_DEBUG_VM set, but you think it's possible it could still be >> here? > > It depends on how you configured your kernel. Say, defconfig does not set > it. > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP and fails in memmap_init_range() when attempting to add the first PFN. It looks like the underlying page that is backing the vmemmap has not been accepted (I receive a #VC 0x404 => page not validated). Kirill, is this a path that you've looked at? It would appear that somewhere in the vmemmap_populate_hugepages() path, some memory acceptance needs to be done for the pages that are used to back vmemmap. I'm not very familiar with this code, so I'm not sure why everything works for a guest with 255GB of memory, but then fails for a guest with 256GB of memory. Thanks, Tom
On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote: > On 9/8/22 14:28, Mike Rapoport wrote: > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: > > > > > > > > Looks like the first access to the memory map fails, although I think > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count(). > > > > > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map > > > > will poisoned in this function, so my guess is it'd crash there. > > > > > > > > > > That's a wonderful hint, thank you! I did not run this test > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be > > > here? > > > > It depends on how you configured your kernel. Say, defconfig does not set > > it. > > > > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP > and fails in memmap_init_range() when attempting to add the first PFN. It > looks like the underlying page that is backing the vmemmap has not been > accepted (I receive a #VC 0x404 => page not validated). > > Kirill, is this a path that you've looked at? It would appear that somewhere > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be > done for the pages that are used to back vmemmap. I'm not very familiar with > this code, so I'm not sure why everything works for a guest with 255GB of > memory, but then fails for a guest with 256GB of memory. Hm. I don't have machine that large at hands at the moment. And I have not looked at the codepath before. I will try to look into the issue.
On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote: > On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote: > > On 9/8/22 14:28, Mike Rapoport wrote: > > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: > > > > > > > > > > Looks like the first access to the memory map fails, although I think > > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count(). > > > > > > > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns > > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map > > > > > will poisoned in this function, so my guess is it'd crash there. > > > > > > > > > > > > > That's a wonderful hint, thank you! I did not run this test > > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be > > > > here? > > > > > > It depends on how you configured your kernel. Say, defconfig does not set > > > it. > > > > > > > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP > > and fails in memmap_init_range() when attempting to add the first PFN. It > > looks like the underlying page that is backing the vmemmap has not been > > accepted (I receive a #VC 0x404 => page not validated). > > > > Kirill, is this a path that you've looked at? It would appear that somewhere > > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be > > done for the pages that are used to back vmemmap. I'm not very familiar with > > this code, so I'm not sure why everything works for a guest with 255GB of > > memory, but then fails for a guest with 256GB of memory. > > Hm. I don't have machine that large at hands at the moment. And I have not > looked at the codepath before. > > I will try to look into the issue. I'd add some printks to verify we actually try to accept the allocated memory. E.g. something like the patch below: diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c index 9ec2304272dc..8f00639facc4 100644 --- a/arch/x86/mm/unaccepted_memory.c +++ b/arch/x86/mm/unaccepted_memory.c @@ -22,6 +22,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end) if (!boot_params.unaccepted_memory) return; + if (system_state == SYSTEM_BOOTING) + pr_info("%s: start: %pa end: %pa %pS\n", __func__, &start, &end, (void *)_RET_IP_); + bitmap = __va(boot_params.unaccepted_memory); range_start = start / PMD_SIZE; diff --git a/mm/memblock.c b/mm/memblock.c index a1f7f8b304d5..029dd520102d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1535,7 +1535,7 @@ void * __init memblock_alloc_exact_nid_raw( phys_addr_t min_addr, phys_addr_t max_addr, int nid) { - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", + pr_info("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", __func__, (u64)size, (u64)align, nid, &min_addr, &max_addr, (void *)_RET_IP_); @@ -1567,7 +1567,7 @@ void * __init memblock_alloc_try_nid_raw( phys_addr_t min_addr, phys_addr_t max_addr, int nid) { - memblock_dbg("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", + pr_info("%s: %llu bytes align=0x%llx nid=%d from=%pa max_addr=%pa %pS\n", __func__, (u64)size, (u64)align, nid, &min_addr, &max_addr, (void *)_RET_IP_); > -- > Kiryl Shutsemau / Kirill A. Shutemov >
On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote: > On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote: > > On 9/8/22 14:28, Mike Rapoport wrote: > > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: > > > > > > > > > > Looks like the first access to the memory map fails, although I think > > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count(). > > > > > > > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns > > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map > > > > > will poisoned in this function, so my guess is it'd crash there. > > > > > > > > > > > > > That's a wonderful hint, thank you! I did not run this test > > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be > > > > here? > > > > > > It depends on how you configured your kernel. Say, defconfig does not set > > > it. > > > > > > > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP > > and fails in memmap_init_range() when attempting to add the first PFN. It > > looks like the underlying page that is backing the vmemmap has not been > > accepted (I receive a #VC 0x404 => page not validated). > > > > Kirill, is this a path that you've looked at? It would appear that somewhere > > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be > > done for the pages that are used to back vmemmap. I'm not very familiar with > > this code, so I'm not sure why everything works for a guest with 255GB of > > memory, but then fails for a guest with 256GB of memory. > > Hm. I don't have machine that large at hands at the moment. And I have not > looked at the codepath before. > > I will try to look into the issue. I'm not able to trigger the bug. With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell with 256G and 1T of guest memory just fine. Any chance it is SEV-SNP specific? Or maybe there some difference in kernel config? Could you share yours?
On 9/26/22 07:10, Kirill A. Shutemov wrote: > On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote: >> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote: >>> On 9/8/22 14:28, Mike Rapoport wrote: >>>> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: >>>>>> >>>>>> Looks like the first access to the memory map fails, although I think >>>>>> it's not in INIT_LIST_HEAD() but rather in init_page_count(). >>>>>> >>>>>> I'd start with making sure that page_alloc::memmap_alloc() actually returns >>>>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map >>>>>> will poisoned in this function, so my guess is it'd crash there. >>>>>> >>>>> >>>>> That's a wonderful hint, thank you! I did not run this test >>>>> CONFIG_DEBUG_VM set, but you think it's possible it could still be >>>>> here? >>>> >>>> It depends on how you configured your kernel. Say, defconfig does not set >>>> it. >>>> >>> >>> I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP >>> and fails in memmap_init_range() when attempting to add the first PFN. It >>> looks like the underlying page that is backing the vmemmap has not been >>> accepted (I receive a #VC 0x404 => page not validated). >>> >>> Kirill, is this a path that you've looked at? It would appear that somewhere >>> in the vmemmap_populate_hugepages() path, some memory acceptance needs to be >>> done for the pages that are used to back vmemmap. I'm not very familiar with >>> this code, so I'm not sure why everything works for a guest with 255GB of >>> memory, but then fails for a guest with 256GB of memory. >> >> Hm. I don't have machine that large at hands at the moment. And I have not >> looked at the codepath before. >> >> I will try to look into the issue. > > I'm not able to trigger the bug. > > With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell > with 256G and 1T of guest memory just fine. > > Any chance it is SEV-SNP specific? There's always a chance. I'll do some more tracing and see what I can find to try and be certain. > > Or maybe there some difference in kernel config? Could you share yours? Yes, I'll send that to you off-list. Thanks, Tom >
On Mon, Sep 26, 2022 at 08:38:34AM -0500, Tom Lendacky wrote: > On 9/26/22 07:10, Kirill A. Shutemov wrote: > > On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote: > > > On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote: > > > > On 9/8/22 14:28, Mike Rapoport wrote: > > > > > On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: > > > > > > > > > > > > > > Looks like the first access to the memory map fails, although I think > > > > > > > it's not in INIT_LIST_HEAD() but rather in init_page_count(). > > > > > > > > > > > > > > I'd start with making sure that page_alloc::memmap_alloc() actually returns > > > > > > > accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the memory map > > > > > > > will poisoned in this function, so my guess is it'd crash there. > > > > > > > > > > > > > > > > > > > That's a wonderful hint, thank you! I did not run this test > > > > > > CONFIG_DEBUG_VM set, but you think it's possible it could still be > > > > > > here? > > > > > > > > > > It depends on how you configured your kernel. Say, defconfig does not set > > > > > it. > > > > > > > > > > > > > I also hit the issue at 256GB. My config is using CONFIG_SPARSEMEM_VMEMMAP > > > > and fails in memmap_init_range() when attempting to add the first PFN. It > > > > looks like the underlying page that is backing the vmemmap has not been > > > > accepted (I receive a #VC 0x404 => page not validated). > > > > > > > > Kirill, is this a path that you've looked at? It would appear that somewhere > > > > in the vmemmap_populate_hugepages() path, some memory acceptance needs to be > > > > done for the pages that are used to back vmemmap. I'm not very familiar with > > > > this code, so I'm not sure why everything works for a guest with 255GB of > > > > memory, but then fails for a guest with 256GB of memory. > > > > > > Hm. I don't have machine that large at hands at the moment. And I have not > > > looked at the codepath before. > > > > > > I will try to look into the issue. > > > > I'm not able to trigger the bug. > > > > With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell > > with 256G and 1T of guest memory just fine. > > > > Any chance it is SEV-SNP specific? > > There's always a chance. I'll do some more tracing and see what I can find > to try and be certain. > > > > > Or maybe there some difference in kernel config? Could you share yours? > > Yes, I'll send that to you off-list. Still nothing with your config :/
On 9/26/22 08:38, Tom Lendacky wrote: > On 9/26/22 07:10, Kirill A. Shutemov wrote: >> On Sat, Sep 24, 2022 at 04:03:02AM +0300, Kirill A. Shutemov wrote: >>> On Thu, Sep 22, 2022 at 09:31:12AM -0500, Tom Lendacky wrote: >>>> On 9/8/22 14:28, Mike Rapoport wrote: >>>>> On Thu, Sep 08, 2022 at 09:23:07AM -0700, Dionna Amalie Glaze wrote: >>>>>>> >>>>>>> Looks like the first access to the memory map fails, although I think >>>>>>> it's not in INIT_LIST_HEAD() but rather in init_page_count(). >>>>>>> >>>>>>> I'd start with making sure that page_alloc::memmap_alloc() actually >>>>>>> returns >>>>>>> accepted memory. If you build kernel with CONFIG_DEBUG_VM=y the >>>>>>> memory map >>>>>>> will poisoned in this function, so my guess is it'd crash there. >>>>>>> >>>>>> >>>>>> That's a wonderful hint, thank you! I did not run this test >>>>>> CONFIG_DEBUG_VM set, but you think it's possible it could still be >>>>>> here? >>>>> >>>>> It depends on how you configured your kernel. Say, defconfig does not >>>>> set >>>>> it. >>>>> >>>> >>>> I also hit the issue at 256GB. My config is using >>>> CONFIG_SPARSEMEM_VMEMMAP >>>> and fails in memmap_init_range() when attempting to add the first PFN. It >>>> looks like the underlying page that is backing the vmemmap has not been >>>> accepted (I receive a #VC 0x404 => page not validated). >>>> >>>> Kirill, is this a path that you've looked at? It would appear that >>>> somewhere >>>> in the vmemmap_populate_hugepages() path, some memory acceptance needs >>>> to be >>>> done for the pages that are used to back vmemmap. I'm not very >>>> familiar with >>>> this code, so I'm not sure why everything works for a guest with 255GB of >>>> memory, but then fails for a guest with 256GB of memory. >>> >>> Hm. I don't have machine that large at hands at the moment. And I have not >>> looked at the codepath before. >>> >>> I will try to look into the issue. >> >> I'm not able to trigger the bug. >> >> With help of vm.overcommit_memory=1, I was managed boot TDX guest to shell >> with 256G and 1T of guest memory just fine. >> >> Any chance it is SEV-SNP specific? > > There's always a chance. I'll do some more tracing and see what I can find > to try and be certain. Indeed, it was an issue in the SNP path, shifting the number of pages as an unsigned int by PAGE_SIZE and adding to an unsigned long. The value to be shifted was 0x100000 and resulted in 0. Changing the number of pages to an unsigned long fixed it. There are a few places where that is being done, so I'll address those with some pre-patches as fixes. Thanks for verifying it was working for you. Tom > >> >> Or maybe there some difference in kernel config? Could you share yours? > > Yes, I'll send that to you off-list. > > Thanks, > Tom > >>
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e66f7aa3191d..444ba8f4bfa0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -942,6 +942,14 @@ static inline bool is_page_hwpoison(struct page *page) #define PG_offline 0x00000100 #define PG_table 0x00000200 #define PG_guard 0x00000400 +#define PG_unaccepted 0x00000800 + +/* + * Page types allowed at page_expected_state() + * + * PageUnaccepted() will get cleared in post_alloc_hook(). + */ +#define PAGE_TYPES_EXPECTED PG_unaccepted #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) @@ -967,6 +975,18 @@ static __always_inline void __ClearPage##uname(struct page *page) \ page->page_type |= PG_##lname; \ } +#define PAGE_TYPE_OPS_FALSE(uname) \ +static __always_inline int Page##uname(struct page *page) \ +{ \ + return false; \ +} \ +static __always_inline void __SetPage##uname(struct page *page) \ +{ \ +} \ +static __always_inline void __ClearPage##uname(struct page *page) \ +{ \ +} + /* * PageBuddy() indicates that the page is free and in the buddy system * (see mm/page_alloc.c). @@ -997,6 +1017,17 @@ PAGE_TYPE_OPS(Buddy, buddy) */ PAGE_TYPE_OPS(Offline, offline) +/* + * PageUnaccepted() indicates that the page has to be "accepted" before it can + * be read or written. The page allocator must call accept_page() before + * touching the page or returning it to the caller. + */ +#ifdef CONFIG_UNACCEPTED_MEMORY +PAGE_TYPE_OPS(Unaccepted, unaccepted) +#else +PAGE_TYPE_OPS_FALSE(Unaccepted) +#endif + extern void page_offline_freeze(void); extern void page_offline_thaw(void); extern void page_offline_begin(void); diff --git a/mm/internal.h b/mm/internal.h index c0f8fbe0445b..7c1a653edfc8 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -861,4 +861,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags); DECLARE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); +#ifndef CONFIG_UNACCEPTED_MEMORY +static inline bool range_contains_unaccepted_memory(phys_addr_t start, + phys_addr_t end) +{ + return false; +} + +static inline void accept_memory(phys_addr_t start, phys_addr_t end) +{ +} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memblock.c b/mm/memblock.c index e4f03a6e8e56..a1f7f8b304d5 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1405,6 +1405,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, */ kmemleak_alloc_phys(found, size, 0, 0); + /* + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, + * require memory to be accepted before it can be used by the + * guest. + * + * Accept the memory of the allocated buffer. + */ + accept_memory(found, found + size); + return found; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e008a3df0485..279c2746aaa8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -122,6 +122,12 @@ typedef int __bitwise fpi_t; */ #define FPI_SKIP_KASAN_POISON ((__force fpi_t)BIT(2)) +/* + * Check if the page needs to be marked as PageUnaccepted(). + * Used for the new pages added to the buddy allocator for the first time. + */ +#define FPI_UNACCEPTED_SLOWPATH ((__force fpi_t)BIT(3)) + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -993,6 +999,35 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, NULL) != NULL; } +/* + * Page acceptance can be very slow. Do not call under critical locks. + */ +static void accept_page(struct page *page, unsigned int order) +{ + phys_addr_t start = page_to_phys(page); + int i; + + if (!PageUnaccepted(page)) + return; + + accept_memory(start, start + (PAGE_SIZE << order)); + __ClearPageUnaccepted(page); + + /* Assert that there is no PageUnaccepted() on tail pages */ + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + for (i = 1; i < (1 << order); i++) + VM_BUG_ON_PAGE(PageUnaccepted(page + i), page + i); + } +} + +static bool page_contains_unaccepted(struct page *page, unsigned int order) +{ + phys_addr_t start = page_to_phys(page); + phys_addr_t end = start + (PAGE_SIZE << order); + + return range_contains_unaccepted_memory(start, end); +} + /* * Freeing function for a buddy system allocator. * @@ -1027,6 +1062,7 @@ static inline void __free_one_page(struct page *page, unsigned long combined_pfn; struct page *buddy; bool to_tail; + bool page_needs_acceptance = false; VM_BUG_ON(!zone_is_initialized(zone)); VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); @@ -1038,6 +1074,11 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page); VM_BUG_ON_PAGE(bad_range(zone, page), page); + if (PageUnaccepted(page)) { + page_needs_acceptance = true; + __ClearPageUnaccepted(page); + } + while (order < MAX_ORDER - 1) { if (compaction_capture(capc, page, order, migratetype)) { __mod_zone_freepage_state(zone, -(1 << order), @@ -1072,6 +1113,13 @@ static inline void __free_one_page(struct page *page, clear_page_guard(zone, buddy, order, migratetype); else del_page_from_free_list(buddy, zone, order); + + /* Mark page unaccepted if any of merged pages were unaccepted */ + if (PageUnaccepted(buddy)) { + page_needs_acceptance = true; + __ClearPageUnaccepted(buddy); + } + combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; @@ -1081,6 +1129,23 @@ static inline void __free_one_page(struct page *page, done_merging: set_buddy_order(page, order); + /* + * The page gets marked as PageUnaccepted() if any of merged-in pages + * is PageUnaccepted(). + * + * New pages, just being added to buddy allocator, do not have + * PageUnaccepted() set. FPI_UNACCEPTED_SLOWPATH indicates that the + * page is new and page_contains_unaccepted() check is required to + * determinate if acceptance is required. + * + * Avoid calling page_contains_unaccepted() if it is known that the page + * needs acceptance. It can be costly. + */ + if (!page_needs_acceptance && (fpi_flags & FPI_UNACCEPTED_SLOWPATH)) + page_needs_acceptance = page_contains_unaccepted(page, order); + if (page_needs_acceptance) + __SetPageUnaccepted(page); + if (fpi_flags & FPI_TO_TAIL) to_tail = true; else if (is_shuffle_order(order)) @@ -1164,7 +1229,13 @@ int split_free_page(struct page *free_page, static inline bool page_expected_state(struct page *page, unsigned long check_flags) { - if (unlikely(atomic_read(&page->_mapcount) != -1)) + /* + * The page must not be mapped to userspace and must not have + * a PageType other than listed in PAGE_TYPES_EXPECTED. + * + * Note, bit cleared means the page type is set. + */ + if (unlikely((atomic_read(&page->_mapcount) | PAGE_TYPES_EXPECTED) != -1)) return false; if (unlikely((unsigned long)page->mapping | @@ -1669,7 +1740,9 @@ void __free_pages_core(struct page *page, unsigned int order) * Bypass PCP and place fresh pages right to the tail, primarily * relevant for memory onlining. */ - __free_pages_ok(page, order, FPI_TO_TAIL | FPI_SKIP_KASAN_POISON); + __free_pages_ok(page, order, + FPI_TO_TAIL | FPI_SKIP_KASAN_POISON | + FPI_UNACCEPTED_SLOWPATH); } #ifdef CONFIG_NUMA @@ -1822,6 +1895,9 @@ static void __init deferred_free_range(unsigned long pfn, return; } + /* Accept chunks smaller than page-block upfront */ + accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages)); + for (i = 0; i < nr_pages; i++, page++, pfn++) { if ((pfn & (pageblock_nr_pages - 1)) == 0) set_pageblock_migratetype(page, MIGRATE_MOVABLE); @@ -2281,6 +2357,13 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high, migratetype)) continue; + /* + * Transfer PageUnaccepted() to the newly split pages so + * they can be accepted after dropping the zone lock. + */ + if (PageUnaccepted(page)) + __SetPageUnaccepted(&page[size]); + add_to_free_list(&page[size], zone, high, migratetype); set_buddy_order(&page[size], high); } @@ -2411,6 +2494,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order, */ kernel_unpoison_pages(page, 1 << order); + accept_page(page, order); + /* * As memory initialization might be integrated into KASAN, * KASAN unpoisoning and memory initializion code must be