Message ID | 20250204004038.1680123-1-jthoughton@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86/mmu: Age sptes locklessly | expand |
On Tue, 04 Feb 2025 00:40:27 +0000, James Houghton wrote: > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck > waiting for aging. This contention reduction improves guest performance > and saves a significant amount of Google Cloud's CPU usage, and it has > valuable improvements for ChromeOS, as Yu has mentioned previously[1]. > > Please see v8[8] for some performance results using > access_tracking_perf_test patched to use MGLRU. > > [...] Applied to kvm-x86 mmu, thanks! [01/11] KVM: Rename kvm_handle_hva_range() https://github.com/kvm-x86/linux/commit/374ccd63600b [02/11] KVM: Add lockless memslot walk to KVM https://github.com/kvm-x86/linux/commit/aa34b811650c [03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine https://github.com/kvm-x86/linux/commit/e29b74920e6f [04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() https://github.com/kvm-x86/linux/commit/b146a9b34aed [05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write() https://github.com/kvm-x86/linux/commit/61d65f2dc766 [06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young https://github.com/kvm-x86/linux/commit/e25c2332346f [07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 https://github.com/kvm-x86/linux/commit/8c403cf23119 [08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock https://github.com/kvm-x86/linux/commit/9fb13ba6b5ff [09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock https://github.com/kvm-x86/linux/commit/4834eaded91e [10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs https://github.com/kvm-x86/linux/commit/bb6c7749ccee [11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns https://github.com/kvm-x86/linux/commit/af3b6a9eba48 -- https://github.com/kvm-x86/linux/tree/next
On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote: > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck > waiting for aging. This contention reduction improves guest performance > and saves a significant amount of Google Cloud's CPU usage, and it has > valuable improvements for ChromeOS, as Yu has mentioned previously[1]. > > Please see v8[8] for some performance results using > access_tracking_perf_test patched to use MGLRU. > > Neither access_tracking_perf_test nor mmu_stress_test trigger any > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU. Hi, I have a question about this patch series and about the access_tracking_perf_test: Some time ago, I investigated a failure in access_tracking_perf_test which shows up in our CI. The root cause was that 'folio_clear_idle' doesn't clear the idle bit when MGLRU is enabled, and overall I got the impression that MGLRU is not compatible with idle page tracking. I thought that this patch series and the 'mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK' patch series could address this but the test still fails. For the reference the exact problem is: 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings. 3. A NUMA autobalance code write protects the guest memory. KVM in response evicts the SPTE mappings with A/D bit set, and while doing so tells mm that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) ) but due to MLGRU the call doesn't clear the idle bit and thus all the traces of the guest access disappear and the kernel thinks that the page is still idle. I can say that the root cause of this is that folio_mark_accessed doesn't do what it supposed to do. Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed() will probably fix this but I don't have enough confidence to say if this is all that is needed to fix this. If this is the case I can send a patch. This patch makes the test pass (but only on 6.12 kernel and below, see below): diff --git a/mm/swap.c b/mm/swap.c index 59f30a981c6f..2013e1f4d572 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio) { if (lru_gen_enabled()) { folio_inc_refs(folio); - return; + goto clear_idle_bit; } if (!folio_test_referenced(folio)) { @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio) folio_clear_referenced(folio); workingset_activation(folio); } +clear_idle_bit: if (folio_test_idle(folio)) folio_clear_idle(folio); } To always reproduce this, it is best to use a patch to make the test run in a loop, like below (although the test fails without this as well). diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c index 3c7defd34f56..829774e325fa 100644 --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c @@ -131,6 +131,7 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm, uint64_t pages = vcpu_args->pages; uint64_t page; uint64_t still_idle = 0; + uint64_t failed_to_mark_idle = 0; uint64_t no_pfn = 0; int page_idle_fd; int pagemap_fd; @@ -160,6 +161,14 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm, } mark_page_idle(page_idle_fd, pfn); + + + if (!is_page_idle(page_idle_fd, pfn)) { + failed_to_mark_idle++; + continue; + } + + } /* @@ -183,16 +192,15 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm, * explicitly flush the TLB when aging SPTEs. As a result, more pages * are cached and the guest won't see the "idle" bit cleared. */ - if (still_idle >= pages / 10) { + //if (still_idle >= pages / 10) { #ifdef __x86_64__ - TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR), - "vCPU%d: Too many pages still idle (%lu out of %lu)", - vcpu_idx, still_idle, pages); + // TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR), + // "vCPU%d: Too many pages still idle (%lu out of %lu)", + // vcpu_idx, still_idle, pages); #endif - printf("WARNING: vCPU%d: Too many pages still idle (%lu out of %lu), " - "this will affect performance results.\n", - vcpu_idx, still_idle, pages); - } + printf("vCPU%d: idle pages: %lu out of %lu, failed to mark idle: %lu no pfn: %lu\n", + vcpu_idx, still_idle, pages, failed_to_mark_idle, no_pfn); + //} close(page_idle_fd); close(pagemap_fd); @@ -315,14 +323,16 @@ static void run_test(enum vm_guest_mode mode, void *arg) access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory"); /* As a control, read and write to the populated memory first. */ - access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory"); - access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory"); + //access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory"); + //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory"); /* Repeat on memory that has been marked as idle. */ +again: mark_memory_idle(vm, nr_vcpus); access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory"); - mark_memory_idle(vm, nr_vcpus); - access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory"); + //mark_memory_idle(vm, nr_vcpus); + //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory"); + goto again; memstress_join_vcpu_threads(nr_vcpus); memstress_destroy_vm(vm); With the above patch applied, you will notice after 4-6 iterations that the number of still idle pages soars: Populating memory : 0.798882357s vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0 Mark memory idle : 3.003939277s Writing to idle memory : 0.503653562s vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0 Mark memory idle : 3.060128175s Writing to idle memory : 0.502705587s vCPU0: idle pages: 2048 out of 262144, failed to mark idle: 0 no pfn: 0 Mark memory idle : 3.039294079s Writing to idle memory : 0.092227612s vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0 Mark memory idle : 3.046216234s Writing to idle memory : 0.295077724s vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0 Mark memory idle : 2.711946690s Writing to idle memory : 0.302882502s ... (*) Turns out that since kernel 6.13, this code that sets accessed bit in the primary paging structure, when the secondary was zapped was *removed*. I bisected this to commit: 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs So now the access_tracking_test is broken regardless of MGLRU. Any ideas on how to fix all this mess? Best regards, Maxim Levitsky > > === Previous Versions === > > Since v8[8]: > - Re-added the kvm_handle_hva_range helpers and applied Sean's > kvm_{handle -> age}_hva_range rename. > - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and > removed its Accessed bit check. Undid change to > tdp_mmu_spte_need_atomic_write(). > - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS. > - cpu_relax(), lockdep, preempt_disable(), and locking fixups for > per-rmap lock (thanks Lai and Sean). > - Renamed kvm_{has -> may_have}_shadow_mmu_sptes(). > - Rebased onto latest kvm/next, including changing > for_each_tdp_mmu_root_rcu to use `types`. > - Dropped MGLRU changes from access_tracking_perf_test. > - Picked up Acked-bys from Yu. (thank you!) > > Since v7[7]: > - Dropped MGLRU changes. > - Dropped DAMON cleanup. > - Dropped MMU notifier changes completely. > - Made shadow MMU aging *always* lockless, not just lockless when the > now-removed "fast_only" clear notifier was used. > - Given that the MGLRU changes no longer introduce a new MGLRU > capability, drop the new capability check from the selftest. > - Rebased on top of latest kvm-x86/next, including the x86 mmu changes > for marking pages as dirty. > > Since v6[6]: > - Rebased on top of kvm-x86/next and Sean's lockless rmap walking > changes. > - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM). > - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks > DavidM and Sean). > - Improved new MMU notifier documentation (thanks DavidH). > - Dropped arm64 locking change. > - No longer retry for CAS failure in TDP MMU non-A/D case (thanks > Sean). > - Added some R-bys and A-bys. > > Since v5[5]: > - Reworked test_clear_young_fast_only() into a new parameter for the > existing notifiers (thanks Sean). > - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only > notifiers should be done. > - Added mm_has_fast_young_notifiers() to inform users if calling > fast-only notifier helpers is worthwhile (for look-around to use). > - Changed MGLRU to invoke a single notifier instead of two when > aging and doing look-around (thanks Yu). > - For KVM/x86, check indirect_shadow_pages > 0 instead of > kvm_memslots_have_rmaps() when collecting age information > (thanks Sean). > - For KVM/arm, some fixes from Oliver. > - Small fixes to access_tracking_perf_test. > - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young(). > > Since v4[4]: > - Removed Kconfig that controlled when aging was enabled. Aging will > be done whenever the architecture supports it (thanks Yu). > - Added a new MMU notifier, test_clear_young_fast_only(), specifically > for MGLRU to use. > - Add kvm_fast_{test_,}age_gfn, implemented by x86. > - Fix locking for clear_flush_young(). > - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes > (thanks Sean). > - Fix WARN_ON and other cleanup for the arm64 locking changes > (thanks Oliver). > > Since v3[3]: > - Vastly simplified the series (thanks David). Removed mmu notifier > batching logic entirely. > - Cleaned up how locking is done for mmu_notifier_test/clear_young > (thanks David). > - Look-around is now only done when there are no secondary MMUs > subscribed to MMU notifiers. > - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added. > - Fixed the lockless implementation of kvm_{test,}age_gfn for x86 > (thanks David). > - Added MGLRU functional and performance tests to > access_tracking_perf_test (thanks Axel). > - In v3, an mm would be completely ignored (for aging) if there was a > secondary MMU but support for secondary MMU walking was missing. Now, > missing secondary MMU walking support simply skips the notifier > calls (except for eviction). > - Added a sanity check for that range->lockless and range->on_lock are > never both provided for the memslot walk. > > For the changes since v2[2], see v3. > > Based on latest kvm/next. > > [1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/ > [2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/ > [3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/ > [4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/ > [5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/ > [6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/ > [7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/ > [8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/ > > James Houghton (7): > KVM: Rename kvm_handle_hva_range() > KVM: Add lockless memslot walk to KVM > KVM: x86/mmu: Factor out spte atomic bit clearing routine > KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() > KVM: x86/mmu: Rename spte_has_volatile_bits() to > spte_needs_atomic_write() > KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as > young > KVM: x86/mmu: Only check gfn age in shadow MMU if > indirect_shadow_pages > 0 > > Sean Christopherson (4): > KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o > mmu_lock > KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of > mmu_lock > KVM: x86/mmu: Add support for lockless walks of rmap SPTEs > KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging > gfns > > Documentation/virt/kvm/locking.rst | 4 +- > arch/x86/include/asm/kvm_host.h | 4 +- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/mmu/mmu.c | 364 +++++++++++++++++++++-------- > arch/x86/kvm/mmu/spte.c | 19 +- > arch/x86/kvm/mmu/spte.h | 2 +- > arch/x86/kvm/mmu/tdp_iter.h | 26 ++- > arch/x86/kvm/mmu/tdp_mmu.c | 36 ++- > include/linux/kvm_host.h | 1 + > virt/kvm/Kconfig | 2 + > virt/kvm/kvm_main.c | 56 +++-- > 11 files changed, 364 insertions(+), 151 deletions(-) > > > base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b
On Tue, Feb 18, 2025, Maxim Levitsky wrote: > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote: > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck > > waiting for aging. This contention reduction improves guest performance > > and saves a significant amount of Google Cloud's CPU usage, and it has > > valuable improvements for ChromeOS, as Yu has mentioned previously[1]. > > > > Please see v8[8] for some performance results using > > access_tracking_perf_test patched to use MGLRU. > > > > Neither access_tracking_perf_test nor mmu_stress_test trigger any > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU. > > > Hi, I have a question about this patch series and about the > access_tracking_perf_test: > > Some time ago, I investigated a failure in access_tracking_perf_test which > shows up in our CI. > > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when > MGLRU is enabled, and overall I got the impression that MGLRU is not > compatible with idle page tracking. > > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary > MMUs participate in MM_WALK' patch series could address this but the test > still fails. > > > For the reference the exact problem is: > > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap > > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings. > > 3. A NUMA autobalance code write protects the guest memory. KVM in response > evicts the SPTE mappings with A/D bit set, and while doing so tells mm > that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) ) > but due to MLGRU the call doesn't clear the idle bit and thus all the traces > of the guest access disappear and the kernel thinks that the page is still idle. > > I can say that the root cause of this is that folio_mark_accessed doesn't do > what it supposed to do. > > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed() > will probably fix this but I don't have enough confidence to say if this is > all that is needed to fix this. If this is the case I can send a patch. My understanding is that the behavior is deliberate. Per Yu[1], page_idle/bitmap effectively isn't supported by MGLRU. [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com > This patch makes the test pass (but only on 6.12 kernel and below, see below): > > diff --git a/mm/swap.c b/mm/swap.c > index 59f30a981c6f..2013e1f4d572 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio) > { > if (lru_gen_enabled()) { > folio_inc_refs(folio); > - return; > + goto clear_idle_bit; > } > > if (!folio_test_referenced(folio)) { > @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio) > folio_clear_referenced(folio); > workingset_activation(folio); > } > +clear_idle_bit: > if (folio_test_idle(folio)) > folio_clear_idle(folio); > } > > > To always reproduce this, it is best to use a patch to make the test run in a > loop, like below (although the test fails without this as well). .. > With the above patch applied, you will notice after 4-6 iterations that the > number of still idle pages soars: > > Populating memory : 0.798882357s ... > vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0 > Mark memory idle : 2.711946690s > Writing to idle memory : 0.302882502s > > ... > > (*) Turns out that since kernel 6.13, this code that sets accessed bit in the > primary paging structure, when the secondary was zapped was *removed*. I > bisected this to commit: > > 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs > > So now the access_tracking_test is broken regardless of MGLRU. Just to confirm, do you see failures on 6.13 with MGLRU disabled? > Any ideas on how to fix all this mess? The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is enabled. In a real-world scenario, if the guest is actually accessing the pages that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're faulted back in. There's a window where page_idle/bitmap could be read between making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's one of the many flaws of NUMA balancing. That said, one thing is quite odd. In the failing case, *half* of the guest pages are still idle. That's quite insane. Aha! I wonder if in the failing case, the vCPU gets migrated to a pCPU on a different node, and that causes NUMA balancing to go crazy and zap pretty much all of guest memory. If that's what's happening, then a better solution for the NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard pin it to a single pCPU?).
On Tue, Feb 18, 2025 at 5:13 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Feb 18, 2025, Maxim Levitsky wrote: > > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote: > > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither > > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck > > > waiting for aging. This contention reduction improves guest performance > > > and saves a significant amount of Google Cloud's CPU usage, and it has > > > valuable improvements for ChromeOS, as Yu has mentioned previously[1]. > > > > > > Please see v8[8] for some performance results using > > > access_tracking_perf_test patched to use MGLRU. > > > > > > Neither access_tracking_perf_test nor mmu_stress_test trigger any > > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU. > > > > > > Hi, I have a question about this patch series and about the > > access_tracking_perf_test: > > > > Some time ago, I investigated a failure in access_tracking_perf_test which > > shows up in our CI. > > > > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when > > MGLRU is enabled, and overall I got the impression that MGLRU is not > > compatible with idle page tracking. > > > > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary > > MMUs participate in MM_WALK' patch series could address this but the test > > still fails. > > > > > > For the reference the exact problem is: > > > > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap > > > > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings. > > > > 3. A NUMA autobalance code write protects the guest memory. KVM in response > > evicts the SPTE mappings with A/D bit set, and while doing so tells mm > > that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) ) > > but due to MLGRU the call doesn't clear the idle bit and thus all the traces > > of the guest access disappear and the kernel thinks that the page is still idle. > > > > I can say that the root cause of this is that folio_mark_accessed doesn't do > > what it supposed to do. > > > > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed() > > will probably fix this but I don't have enough confidence to say if this is > > all that is needed to fix this. If this is the case I can send a patch. > > My understanding is that the behavior is deliberate. Per Yu[1], page_idle/bitmap > effectively isn't supported by MGLRU. > > [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com Yu's suggestion was to look at the generation numbers themselves, and that is exactly what my patched access_tracking_perf_test does[2]. :) So I think to make this work with MGLRU, I'll re-post my access_tracking_perf_test patch, but if MGLRU is enabled, always use the MGLRU debugfs instead of using page_idle/bitmap. It needs some cleanup first though. [2]: https://lore.kernel.org/kvm/20241105184333.2305744-12-jthoughton@google.com/ > > Any ideas on how to fix all this mess? > > The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is > enabled. In a real-world scenario, if the guest is actually accessing the pages > that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're > faulted back in. There's a window where page_idle/bitmap could be read between > making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's > one of the many flaws of NUMA balancing. > > That said, one thing is quite odd. In the failing case, *half* of the guest pages > are still idle. That's quite insane. > > Aha! I wonder if in the failing case, the vCPU gets migrated to a pCPU on a > different node, and that causes NUMA balancing to go crazy and zap pretty much > all of guest memory. If that's what's happening, then a better solution for the > NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard > pin it to a single pCPU?). +1 to this idea, if this is really what's going on. If NUMA balancing is only migrating a few pages, the 90% threshold in the test should be low enough that we tolerate the few pages that were moved. Or we could just print a warning (instead of fail) if NUMA balancing is enabled.