mbox series

[v9,00/11] KVM: x86/mmu: Age sptes locklessly

Message ID 20250204004038.1680123-1-jthoughton@google.com (mailing list archive)
Headers show
Series KVM: x86/mmu: Age sptes locklessly | expand

Message

James Houghton Feb. 4, 2025, 12:40 a.m. UTC
By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
waiting for aging. This contention reduction improves guest performance
and saves a significant amount of Google Cloud's CPU usage, and it has
valuable improvements for ChromeOS, as Yu has mentioned previously[1].

Please see v8[8] for some performance results using
access_tracking_perf_test patched to use MGLRU.

Neither access_tracking_perf_test nor mmu_stress_test trigger any
splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.

=== Previous Versions ===

Since v8[8]:
 - Re-added the kvm_handle_hva_range helpers and applied Sean's
   kvm_{handle -> age}_hva_range rename.
 - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and
   removed its Accessed bit check. Undid change to
   tdp_mmu_spte_need_atomic_write().
 - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS.
 - cpu_relax(), lockdep, preempt_disable(), and locking fixups for
   per-rmap lock (thanks Lai and Sean).
 - Renamed kvm_{has -> may_have}_shadow_mmu_sptes().
 - Rebased onto latest kvm/next, including changing
   for_each_tdp_mmu_root_rcu to use `types`.
 - Dropped MGLRU changes from access_tracking_perf_test.
 - Picked up Acked-bys from Yu. (thank you!)

Since v7[7]:
 - Dropped MGLRU changes.
 - Dropped DAMON cleanup.
 - Dropped MMU notifier changes completely.
 - Made shadow MMU aging *always* lockless, not just lockless when the
   now-removed "fast_only" clear notifier was used.
 - Given that the MGLRU changes no longer introduce a new MGLRU
   capability, drop the new capability check from the selftest.
 - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
   for marking pages as dirty.

Since v6[6]:
 - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
   changes.
 - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
 - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
   DavidM and Sean).
 - Improved new MMU notifier documentation (thanks DavidH).
 - Dropped arm64 locking change.
 - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
   Sean).
 - Added some R-bys and A-bys.

Since v5[5]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[4]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[3]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[2], see v3.

Based on latest kvm/next.

[1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/
[2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
[3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
[7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
[8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/

James Houghton (7):
  KVM: Rename kvm_handle_hva_range()
  KVM: Add lockless memslot walk to KVM
  KVM: x86/mmu: Factor out spte atomic bit clearing routine
  KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
  KVM: x86/mmu: Rename spte_has_volatile_bits() to
    spte_needs_atomic_write()
  KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as
    young
  KVM: x86/mmu: Only check gfn age in shadow MMU if
    indirect_shadow_pages > 0

Sean Christopherson (4):
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns

 Documentation/virt/kvm/locking.rst |   4 +-
 arch/x86/include/asm/kvm_host.h    |   4 +-
 arch/x86/kvm/Kconfig               |   1 +
 arch/x86/kvm/mmu/mmu.c             | 364 +++++++++++++++++++++--------
 arch/x86/kvm/mmu/spte.c            |  19 +-
 arch/x86/kvm/mmu/spte.h            |   2 +-
 arch/x86/kvm/mmu/tdp_iter.h        |  26 ++-
 arch/x86/kvm/mmu/tdp_mmu.c         |  36 ++-
 include/linux/kvm_host.h           |   1 +
 virt/kvm/Kconfig                   |   2 +
 virt/kvm/kvm_main.c                |  56 +++--
 11 files changed, 364 insertions(+), 151 deletions(-)


base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b

Comments

Sean Christopherson Feb. 15, 2025, 12:50 a.m. UTC | #1
On Tue, 04 Feb 2025 00:40:27 +0000, James Houghton wrote:
> By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> waiting for aging. This contention reduction improves guest performance
> and saves a significant amount of Google Cloud's CPU usage, and it has
> valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> 
> Please see v8[8] for some performance results using
> access_tracking_perf_test patched to use MGLRU.
> 
> [...]

Applied to kvm-x86 mmu, thanks!

[01/11] KVM: Rename kvm_handle_hva_range()
        https://github.com/kvm-x86/linux/commit/374ccd63600b
[02/11] KVM: Add lockless memslot walk to KVM
        https://github.com/kvm-x86/linux/commit/aa34b811650c
[03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine
        https://github.com/kvm-x86/linux/commit/e29b74920e6f
[04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
        https://github.com/kvm-x86/linux/commit/b146a9b34aed
[05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write()
        https://github.com/kvm-x86/linux/commit/61d65f2dc766
[06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young
        https://github.com/kvm-x86/linux/commit/e25c2332346f
[07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0
        https://github.com/kvm-x86/linux/commit/8c403cf23119
[08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
        https://github.com/kvm-x86/linux/commit/9fb13ba6b5ff
[09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
        https://github.com/kvm-x86/linux/commit/4834eaded91e
[10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
        https://github.com/kvm-x86/linux/commit/bb6c7749ccee
[11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
        https://github.com/kvm-x86/linux/commit/af3b6a9eba48

--
https://github.com/kvm-x86/linux/tree/next
Maxim Levitsky Feb. 18, 2025, 7:29 p.m. UTC | #2
On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> waiting for aging. This contention reduction improves guest performance
> and saves a significant amount of Google Cloud's CPU usage, and it has
> valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> 
> Please see v8[8] for some performance results using
> access_tracking_perf_test patched to use MGLRU.
> 
> Neither access_tracking_perf_test nor mmu_stress_test trigger any
> splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.


Hi, I have a question about this patch series and about the access_tracking_perf_test:

Some time ago, I investigated a failure in access_tracking_perf_test which shows up in our CI.

The root cause was that 'folio_clear_idle' doesn't clear the idle bit when MGLRU is enabled,
and overall I got the impression that MGLRU is not compatible with idle page tracking.


I thought that this patch series and the 'mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK' 
patch series could address this but the test still fails.


For the reference the exact problem is:

1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap

2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.

3. A NUMA autobalance code write protects the guest memory. KVM in response evicts the SPTE mappings with A/D bit set,
   and while doing so tells mm that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
   but due to MLGRU the call doesn't clear the idle bit and thus all the traces of the guest access disappear
   and the kernel thinks that the page is still idle.

I can say that the root cause of this is that folio_mark_accessed doesn't do what it supposed to do.

Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed() 
will probably fix this but I don't have enough confidence
to say if this is all that is needed to fix this. 
If this is the case I can send a patch.


This patch makes the test pass (but only on 6.12 kernel and below, see below):

diff --git a/mm/swap.c b/mm/swap.c
index 59f30a981c6f..2013e1f4d572 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
 {
        if (lru_gen_enabled()) {
                folio_inc_refs(folio);
-               return;
+               goto clear_idle_bit;
        }
 
        if (!folio_test_referenced(folio)) {
@@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
                folio_clear_referenced(folio);
                workingset_activation(folio);
        }
+clear_idle_bit:
        if (folio_test_idle(folio))
                folio_clear_idle(folio);
 }


To always reproduce this, it is best to use a patch to make the test run in a loop, 
like below (although the test fails without this as well).


diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..829774e325fa 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -131,6 +131,7 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
        uint64_t pages = vcpu_args->pages;
        uint64_t page;
        uint64_t still_idle = 0;
+       uint64_t failed_to_mark_idle = 0;
        uint64_t no_pfn = 0;
        int page_idle_fd;
        int pagemap_fd;
@@ -160,6 +161,14 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
                }
 
                mark_page_idle(page_idle_fd, pfn);
+
+
+                if (!is_page_idle(page_idle_fd, pfn)) {
+                        failed_to_mark_idle++;
+                        continue;
+                }
+
+
        }
 
        /*
@@ -183,16 +192,15 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
         * explicitly flush the TLB when aging SPTEs.  As a result, more pages
         * are cached and the guest won't see the "idle" bit cleared.
         */
-       if (still_idle >= pages / 10) {
+       //if (still_idle >= pages / 10) {
 #ifdef __x86_64__
-               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
-                           "vCPU%d: Too many pages still idle (%lu out of %lu)",
-                           vcpu_idx, still_idle, pages);
+       //      TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
+       //                  "vCPU%d: Too many pages still idle (%lu out of %lu)",
+       //                  vcpu_idx, still_idle, pages);
 #endif
-               printf("WARNING: vCPU%d: Too many pages still idle (%lu out of %lu), "
-                      "this will affect performance results.\n",
-                      vcpu_idx, still_idle, pages);
-       }
+               printf("vCPU%d: idle pages: %lu out of %lu, failed to mark idle: %lu no pfn: %lu\n",
+                      vcpu_idx, still_idle, pages, failed_to_mark_idle, no_pfn);
+       //}
 
        close(page_idle_fd);
        close(pagemap_fd);
@@ -315,14 +323,16 @@ static void run_test(enum vm_guest_mode mode, void *arg)
        access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory");
 
        /* As a control, read and write to the populated memory first. */
-       access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory");
-       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
+       //access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to populated memory");
+       //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory");
 
        /* Repeat on memory that has been marked as idle. */
+again:
        mark_memory_idle(vm, nr_vcpus);
        access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory");
-       mark_memory_idle(vm, nr_vcpus);
-       access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
+       //mark_memory_idle(vm, nr_vcpus);
+       //access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory");
+       goto again;
 
        memstress_join_vcpu_threads(nr_vcpus);
        memstress_destroy_vm(vm);


With the above patch applied, you will notice after 4-6 iterations that the number of still idle
pages soars:

Populating memory             : 0.798882357s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.003939277s
Writing to idle memory        : 0.503653562s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.060128175s
Writing to idle memory        : 0.502705587s
vCPU0: idle pages: 2048 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.039294079s
Writing to idle memory        : 0.092227612s
vCPU0: idle pages: 0 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 3.046216234s
Writing to idle memory        : 0.295077724s
vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
Mark memory idle              : 2.711946690s
Writing to idle memory        : 0.302882502s

...



(*) Turns out that since kernel 6.13, this code that sets accessed bit in the primary paging
structure, when the secondary was zapped was *removed*. I bisected this to commit:

66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs

So now the access_tracking_test is broken regardless of MGLRU.

Any ideas on how to fix all this mess?


Best regards,
	Maxim Levitsky


> 
> === Previous Versions ===
> 
> Since v8[8]:
>  - Re-added the kvm_handle_hva_range helpers and applied Sean's
>    kvm_{handle -> age}_hva_range rename.
>  - Renamed spte_has_volatile_bits() to spte_needs_atomic_write() and
>    removed its Accessed bit check. Undid change to
>    tdp_mmu_spte_need_atomic_write().
>  - Renamed KVM_MMU_NOTIFIER_{YOUNG -> AGING}_LOCKLESS.
>  - cpu_relax(), lockdep, preempt_disable(), and locking fixups for
>    per-rmap lock (thanks Lai and Sean).
>  - Renamed kvm_{has -> may_have}_shadow_mmu_sptes().
>  - Rebased onto latest kvm/next, including changing
>    for_each_tdp_mmu_root_rcu to use `types`.
>  - Dropped MGLRU changes from access_tracking_perf_test.
>  - Picked up Acked-bys from Yu. (thank you!)
> 
> Since v7[7]:
>  - Dropped MGLRU changes.
>  - Dropped DAMON cleanup.
>  - Dropped MMU notifier changes completely.
>  - Made shadow MMU aging *always* lockless, not just lockless when the
>    now-removed "fast_only" clear notifier was used.
>  - Given that the MGLRU changes no longer introduce a new MGLRU
>    capability, drop the new capability check from the selftest.
>  - Rebased on top of latest kvm-x86/next, including the x86 mmu changes
>    for marking pages as dirty.
> 
> Since v6[6]:
>  - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
>    changes.
>  - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
>  - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
>    DavidM and Sean).
>  - Improved new MMU notifier documentation (thanks DavidH).
>  - Dropped arm64 locking change.
>  - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
>    Sean).
>  - Added some R-bys and A-bys.
> 
> Since v5[5]:
>  - Reworked test_clear_young_fast_only() into a new parameter for the
>    existing notifiers (thanks Sean).
>  - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
>    notifiers should be done.
>  - Added mm_has_fast_young_notifiers() to inform users if calling
>    fast-only notifier helpers is worthwhile (for look-around to use).
>  - Changed MGLRU to invoke a single notifier instead of two when
>    aging and doing look-around (thanks Yu).
>  - For KVM/x86, check indirect_shadow_pages > 0 instead of
>    kvm_memslots_have_rmaps() when collecting age information
>    (thanks Sean).
>  - For KVM/arm, some fixes from Oliver.
>  - Small fixes to access_tracking_perf_test.
>  - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().
> 
> Since v4[4]:
>  - Removed Kconfig that controlled when aging was enabled. Aging will
>    be done whenever the architecture supports it (thanks Yu).
>  - Added a new MMU notifier, test_clear_young_fast_only(), specifically
>    for MGLRU to use.
>  - Add kvm_fast_{test_,}age_gfn, implemented by x86.
>  - Fix locking for clear_flush_young().
>  - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
>    (thanks Sean).
>  - Fix WARN_ON and other cleanup for the arm64 locking changes
>    (thanks Oliver).
> 
> Since v3[3]:
>  - Vastly simplified the series (thanks David). Removed mmu notifier
>    batching logic entirely.
>  - Cleaned up how locking is done for mmu_notifier_test/clear_young
>    (thanks David).
>  - Look-around is now only done when there are no secondary MMUs
>    subscribed to MMU notifiers.
>  - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
>  - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
>    (thanks David).
>  - Added MGLRU functional and performance tests to
>    access_tracking_perf_test (thanks Axel).
>  - In v3, an mm would be completely ignored (for aging) if there was a
>    secondary MMU but support for secondary MMU walking was missing. Now,
>    missing secondary MMU walking support simply skips the notifier
>    calls (except for eviction).
>  - Added a sanity check for that range->lockless and range->on_lock are
>    never both provided for the memslot walk.
> 
> For the changes since v2[2], see v3.
> 
> Based on latest kvm/next.
> 
> [1]: https://lore.kernel.org/kvm/CAOUHufYS0XyLEf_V+q5SCW54Zy2aW5nL8CnSWreM8d1rX5NKYg@mail.gmail.com/
> [2]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
> [3]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
> [4]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
> [5]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
> [6]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
> [7]: https://lore.kernel.org/kvm/20240926013506.860253-1-jthoughton@google.com/
> [8]: https://lore.kernel.org/kvm/20241105184333.2305744-1-jthoughton@google.com/
> 
> James Houghton (7):
>   KVM: Rename kvm_handle_hva_range()
>   KVM: Add lockless memslot walk to KVM
>   KVM: x86/mmu: Factor out spte atomic bit clearing routine
>   KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn()
>   KVM: x86/mmu: Rename spte_has_volatile_bits() to
>     spte_needs_atomic_write()
>   KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as
>     young
>   KVM: x86/mmu: Only check gfn age in shadow MMU if
>     indirect_shadow_pages > 0
> 
> Sean Christopherson (4):
>   KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
>     mmu_lock
>   KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
>     mmu_lock
>   KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
>   KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
>     gfns
> 
>  Documentation/virt/kvm/locking.rst |   4 +-
>  arch/x86/include/asm/kvm_host.h    |   4 +-
>  arch/x86/kvm/Kconfig               |   1 +
>  arch/x86/kvm/mmu/mmu.c             | 364 +++++++++++++++++++++--------
>  arch/x86/kvm/mmu/spte.c            |  19 +-
>  arch/x86/kvm/mmu/spte.h            |   2 +-
>  arch/x86/kvm/mmu/tdp_iter.h        |  26 ++-
>  arch/x86/kvm/mmu/tdp_mmu.c         |  36 ++-
>  include/linux/kvm_host.h           |   1 +
>  virt/kvm/Kconfig                   |   2 +
>  virt/kvm/kvm_main.c                |  56 +++--
>  11 files changed, 364 insertions(+), 151 deletions(-)
> 
> 
> base-commit: f7bafceba76e9ab475b413578c1757ee18c3e44b
Sean Christopherson Feb. 19, 2025, 1:13 a.m. UTC | #3
On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > waiting for aging. This contention reduction improves guest performance
> > and saves a significant amount of Google Cloud's CPU usage, and it has
> > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > 
> > Please see v8[8] for some performance results using
> > access_tracking_perf_test patched to use MGLRU.
> > 
> > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> 
> 
> Hi, I have a question about this patch series and about the
> access_tracking_perf_test:
> 
> Some time ago, I investigated a failure in access_tracking_perf_test which
> shows up in our CI.
> 
> The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> MGLRU is enabled, and overall I got the impression that MGLRU is not
> compatible with idle page tracking.
>
> I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> MMUs participate in MM_WALK' patch series could address this but the test
> still fails.
> 
> 
> For the reference the exact problem is:
> 
> 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> 
> 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> 
> 3. A NUMA autobalance code write protects the guest memory. KVM in response
>    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
>    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
>    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
>    of the guest access disappear and the kernel thinks that the page is still idle.
> 
> I can say that the root cause of this is that folio_mark_accessed doesn't do
> what it supposed to do.
> 
> Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> will probably fix this but I don't have enough confidence to say if this is
> all that is needed to fix this.  If this is the case I can send a patch.

My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
effectively isn't supported by MGLRU.

[1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

> This patch makes the test pass (but only on 6.12 kernel and below, see below):
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 59f30a981c6f..2013e1f4d572 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
>  {
>         if (lru_gen_enabled()) {
>                 folio_inc_refs(folio);
> -               return;
> +               goto clear_idle_bit;
>         }
>  
>         if (!folio_test_referenced(folio)) {
> @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
>                 folio_clear_referenced(folio);
>                 workingset_activation(folio);
>         }
> +clear_idle_bit:
>         if (folio_test_idle(folio))
>                 folio_clear_idle(folio);
>  }
> 
> 
> To always reproduce this, it is best to use a patch to make the test run in a
> loop, like below (although the test fails without this as well).

..

> With the above patch applied, you will notice after 4-6 iterations that the
> number of still idle pages soars:
> 
> Populating memory             : 0.798882357s

...

> vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> Mark memory idle              : 2.711946690s
> Writing to idle memory        : 0.302882502s
> 
> ...
> 
> (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> primary paging structure, when the secondary was zapped was *removed*. I
> bisected this to commit:
> 
> 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> 
> So now the access_tracking_test is broken regardless of MGLRU.

Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

> Any ideas on how to fix all this mess?

The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
enabled.  In a real-world scenario, if the guest is actually accessing the pages
that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
faulted back in.  There's a window where page_idle/bitmap could be read between
making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
one of the many flaws of NUMA balancing.

That said, one thing is quite odd.  In the failing case, *half* of the guest pages
are still idle.  That's quite insane.

Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
different node, and that causes NUMA balancing to go crazy and zap pretty much
all of guest memory.  If that's what's happening, then a better solution for the
NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
pin it to a single pCPU?).
James Houghton Feb. 19, 2025, 6:56 p.m. UTC | #4
On Tue, Feb 18, 2025 at 5:13 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> > On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > > waiting for aging. This contention reduction improves guest performance
> > > and saves a significant amount of Google Cloud's CPU usage, and it has
> > > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > >
> > > Please see v8[8] for some performance results using
> > > access_tracking_perf_test patched to use MGLRU.
> > >
> > > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> >
> >
> > Hi, I have a question about this patch series and about the
> > access_tracking_perf_test:
> >
> > Some time ago, I investigated a failure in access_tracking_perf_test which
> > shows up in our CI.
> >
> > The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> > MGLRU is enabled, and overall I got the impression that MGLRU is not
> > compatible with idle page tracking.
> >
> > I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> > MMUs participate in MM_WALK' patch series could address this but the test
> > still fails.
> >
> >
> > For the reference the exact problem is:
> >
> > 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> >
> > 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> >
> > 3. A NUMA autobalance code write protects the guest memory. KVM in response
> >    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
> >    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
> >    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
> >    of the guest access disappear and the kernel thinks that the page is still idle.
> >
> > I can say that the root cause of this is that folio_mark_accessed doesn't do
> > what it supposed to do.
> >
> > Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> > will probably fix this but I don't have enough confidence to say if this is
> > all that is needed to fix this.  If this is the case I can send a patch.
>
> My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
> effectively isn't supported by MGLRU.
>
> [1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

Yu's suggestion was to look at the generation numbers themselves, and
that is exactly what my patched access_tracking_perf_test does[2]. :)

So I think to make this work with MGLRU, I'll re-post my
access_tracking_perf_test patch, but if MGLRU is enabled, always use
the MGLRU debugfs instead of using page_idle/bitmap. It needs some
cleanup first though.

[2]: https://lore.kernel.org/kvm/20241105184333.2305744-12-jthoughton@google.com/

> > Any ideas on how to fix all this mess?
>
> The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
> enabled.  In a real-world scenario, if the guest is actually accessing the pages
> that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
> faulted back in.  There's a window where page_idle/bitmap could be read between
> making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
> one of the many flaws of NUMA balancing.
>
> That said, one thing is quite odd.  In the failing case, *half* of the guest pages
> are still idle.  That's quite insane.
>
> Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
> different node, and that causes NUMA balancing to go crazy and zap pretty much
> all of guest memory.  If that's what's happening, then a better solution for the
> NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
> pin it to a single pCPU?).

+1 to this idea, if this is really what's going on. If NUMA balancing
is only migrating a few pages, the 90% threshold in the test should be
low enough that we tolerate the few pages that were moved.

Or we could just print a warning (instead of fail) if NUMA balancing is enabled.