mbox series

[v7,00/18] mm: multi-gen LRU: Walk secondary MMU page tables while aging

Message ID 20240926013506.860253-1-jthoughton@google.com (mailing list archive)
Headers show
Series mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand

Message

James Houghton Sept. 26, 2024, 1:34 a.m. UTC
This patchset makes it possible for MGLRU to consult secondary MMUs
while doing aging, not just during eviction. This allows for more
accurate reclaim decisions, which is especially important for proactive
reclaim.

This series includes:
1. Cleanup, add support for locklessly memslot walks in KVM (patches
   1-2).
2. Support for lockless aging for x86 TDP MMU (patches 3-4).
3. Further small optimizations (patches 5-6).
4. Support for lockless harvesting of access information for the x86
   shadow MMU (patches 7-10).
5. Some mm cleanup (patch 11).
6. Add fast-only aging MMU notifiers (patches 12-13).
7. Support fast-only aging in KVM/x86 (patches 14-16).
8. Have KVM participate in MGLRU aging (patch 17).
9. Updates to the access_tracking_perf_test to verify MGLRU
   functionality (patch 18).

Patches 1-10 are pure optimizations and could be applied without the
rest of the series, though the lockless shadow MMU lockless patches
become more useful in the context of MGLRU aging.

Please note that mmu_notifier_test_young_fast_only() is added but not
used in this series. I am happy to remove it if that would be
appropriate.

The fast-only notifiers serve a particular purpose: for aging, we
neither want to delay other operations (e.g. unmapping for eviction)
nor do we want to be delayed by these other operations ourselves. By
default, the implementations of test_young() and clear_young() are meant
to be *accurate*, not fast. The fast-only notifiers will only give age
information that can be gathered fast.

The fast-only notifiers are non-trivially implemented for only x86. The
TDP MMU and the shadow MMU are both supported, but the shadow MMU will
not actually age sptes locklessly if A/D bits in the spte have been
disabled (i.e., if L1 disables them).

access_tracking_perf_test now has a mode (-p) to check performance of
MGLRU aging while the VM is faulting memory in.

This series has been tested with access_tracking_perf_test and Sean's
mmu_stress_test[6], both with tdp_mmu=0 and tdp_mmu=1.

=== Previous Versions ===

Since v6[1]:
 - Rebased on top of kvm-x86/next and Sean's lockless rmap walking
   changes[6].
 - Removed HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY (thanks DavidM).
 - Split up kvm_age_gfn() / kvm_test_age_gfn() optimizations (thanks
   DavidM and Sean).
 - Improved new MMU notifier documentation (thanks DavidH).
 - Dropped arm64 locking change.
 - No longer retry for CAS failure in TDP MMU non-A/D case (thanks
   Sean).
 - Added some R-bys and A-bys.

Since v5[2]:
 - Reworked test_clear_young_fast_only() into a new parameter for the
   existing notifiers (thanks Sean).
 - Added mmu_notifier.has_fast_aging to tell mm if calling fast-only
   notifiers should be done.
 - Added mm_has_fast_young_notifiers() to inform users if calling
   fast-only notifier helpers is worthwhile (for look-around to use).
 - Changed MGLRU to invoke a single notifier instead of two when
   aging and doing look-around (thanks Yu).
 - For KVM/x86, check indirect_shadow_pages > 0 instead of
   kvm_memslots_have_rmaps() when collecting age information
   (thanks Sean).
 - For KVM/arm, some fixes from Oliver.
 - Small fixes to access_tracking_perf_test.
 - Added missing !MMU_NOTIFIER version of mmu_notifier_clear_young().

Since v4[3]:
 - Removed Kconfig that controlled when aging was enabled. Aging will
   be done whenever the architecture supports it (thanks Yu).
 - Added a new MMU notifier, test_clear_young_fast_only(), specifically
   for MGLRU to use.
 - Add kvm_fast_{test_,}age_gfn, implemented by x86.
 - Fix locking for clear_flush_young().
 - Added KVM_MMU_NOTIFIER_YOUNG_LOCKLESS to clean up locking changes
   (thanks Sean).
 - Fix WARN_ON and other cleanup for the arm64 locking changes
   (thanks Oliver).

Since v3[4]:
 - Vastly simplified the series (thanks David). Removed mmu notifier
   batching logic entirely.
 - Cleaned up how locking is done for mmu_notifier_test/clear_young
   (thanks David).
 - Look-around is now only done when there are no secondary MMUs
   subscribed to MMU notifiers.
 - CONFIG_LRU_GEN_WALKS_SECONDARY_MMU has been added.
 - Fixed the lockless implementation of kvm_{test,}age_gfn for x86
   (thanks David).
 - Added MGLRU functional and performance tests to
   access_tracking_perf_test (thanks Axel).
 - In v3, an mm would be completely ignored (for aging) if there was a
   secondary MMU but support for secondary MMU walking was missing. Now,
   missing secondary MMU walking support simply skips the notifier
   calls (except for eviction).
 - Added a sanity check for that range->lockless and range->on_lock are
   never both provided for the memslot walk.

For the changes since v2[5], see v3.

Based on latest kvm-x86/next.

[1]: https://lore.kernel.org/linux-mm/20240724011037.3671523-1-jthoughton@google.com/
[2]: https://lore.kernel.org/linux-mm/20240611002145.2078921-1-jthoughton@google.com/
[3]: https://lore.kernel.org/linux-mm/20240529180510.2295118-1-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20240401232946.1837665-1-jthoughton@google.com/
[5]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuzhao@google.com/
[6]: https://lore.kernel.org/kvm/20240809194335.1726916-1-seanjc@google.com/

James Houghton (14):
  KVM: Remove kvm_handle_hva_range helper functions
  KVM: Add lockless memslot walk to KVM
  KVM: x86/mmu: Factor out spte atomic bit clearing routine
  KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
  KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
  KVM: x86/mmu: Only check gfn age in shadow MMU if
    indirect_shadow_pages > 0
  mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
  mm: Add has_fast_aging to struct mmu_notifier
  mm: Add fast_only bool to test_young and clear_young MMU notifiers
  KVM: Pass fast_only to kvm_{test_,}age_gfn
  KVM: x86/mmu: Locklessly harvest access information from shadow MMU
  KVM: x86/mmu: Enable has_fast_aging
  mm: multi-gen LRU: Have secondary MMUs participate in aging
  KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test

Sean Christopherson (4):
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 arch/x86/include/asm/kvm_host.h               |   4 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        | 355 ++++++++++++----
 arch/x86/kvm/mmu/tdp_iter.h                   |  27 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  57 ++-
 include/linux/kvm_host.h                      |   2 +
 include/linux/mmu_notifier.h                  |  82 +++-
 include/linux/mmzone.h                        |   6 +-
 include/trace/events/kvm.h                    |  19 +-
 mm/damon/vaddr.c                              |   2 -
 mm/mmu_notifier.c                             |  38 +-
 mm/rmap.c                                     |   9 +-
 mm/vmscan.c                                   | 148 +++++--
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/access_tracking_perf_test.c | 369 +++++++++++++++--
 .../selftests/kvm/include/lru_gen_util.h      |  55 +++
 .../testing/selftests/kvm/lib/lru_gen_util.c  | 391 ++++++++++++++++++
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           | 124 +++---
 20 files changed, 1451 insertions(+), 248 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h
 create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c


base-commit: 3cc25d5adcfd2a2c33baa0b2a1979c2dbc9b990b

Comments

Sean Christopherson Oct. 14, 2024, 11:22 p.m. UTC | #1
On Thu, Sep 26, 2024, James Houghton wrote:
> This patchset makes it possible for MGLRU to consult secondary MMUs
> while doing aging, not just during eviction. This allows for more
> accurate reclaim decisions, which is especially important for proactive
> reclaim.

...

> James Houghton (14):
>   KVM: Remove kvm_handle_hva_range helper functions
>   KVM: Add lockless memslot walk to KVM
>   KVM: x86/mmu: Factor out spte atomic bit clearing routine
>   KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
>   KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
>   KVM: x86/mmu: Only check gfn age in shadow MMU if
>     indirect_shadow_pages > 0
>   mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
>   mm: Add has_fast_aging to struct mmu_notifier
>   mm: Add fast_only bool to test_young and clear_young MMU notifiers

Per offline discussions, there's a non-zero chance that fast_only won't be needed,
because it may be preferable to incorporate secondary MMUs into MGLRU, even if
they don't support "fast" aging.

What's the status on that front?  Even if the status is "TBD", it'd be very helpful
to let others know, so that they don't spend time reviewing code that might be
completely thrown away.

>   KVM: Pass fast_only to kvm_{test_,}age_gfn
>   KVM: x86/mmu: Locklessly harvest access information from shadow MMU
>   KVM: x86/mmu: Enable has_fast_aging
>   mm: multi-gen LRU: Have secondary MMUs participate in aging
>   KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test
> 
> Sean Christopherson (4):
>   KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
>     mmu_lock
>   KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
>     mmu_lock
>   KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
>   KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
>     gfns
James Houghton Oct. 15, 2024, 12:07 a.m. UTC | #2
On Mon, Oct 14, 2024 at 4:22 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Sep 26, 2024, James Houghton wrote:
> > This patchset makes it possible for MGLRU to consult secondary MMUs
> > while doing aging, not just during eviction. This allows for more
> > accurate reclaim decisions, which is especially important for proactive
> > reclaim.
>
> ...
>
> > James Houghton (14):
> >   KVM: Remove kvm_handle_hva_range helper functions
> >   KVM: Add lockless memslot walk to KVM
> >   KVM: x86/mmu: Factor out spte atomic bit clearing routine
> >   KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
> >   KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
> >   KVM: x86/mmu: Only check gfn age in shadow MMU if
> >     indirect_shadow_pages > 0
> >   mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
> >   mm: Add has_fast_aging to struct mmu_notifier
> >   mm: Add fast_only bool to test_young and clear_young MMU notifiers
>
> Per offline discussions, there's a non-zero chance that fast_only won't be needed,
> because it may be preferable to incorporate secondary MMUs into MGLRU, even if
> they don't support "fast" aging.
>
> What's the status on that front?  Even if the status is "TBD", it'd be very helpful
> to let others know, so that they don't spend time reviewing code that might be
> completely thrown away.

The fast_only MMU notifier changes will probably be removed in v8.

ChromeOS folks found that the way MGLRU *currently* interacts with KVM
is problematic. That is, today, with the MM_WALK MGLRU capability
enabled, normal PTEs have their Accessed bits cleared via a page table
scan and then during an rmap walk upon attempted eviction, whereas,
KVM SPTEs only have their Accessed bits cleared via the rmap walk at
eviction time. So KVM SPTEs have their Accessed bits cleared less
frequently than normal PTEs, and therefore they appear younger than
they should.

It turns out that this causes tab open latency regressions on ChromeOS
where a significant amount of memory is being used by a VM. IIUC, the
fix for this is to have MGLRU age SPTEs as often as it ages normal
PTEs; i.e., it should call the correct MMU notifiers each time it
clears A bits on PTEs. The final patch in this series sort of does
this, but instead of calling the new fast_only notifier, we need to
call the normal test/clear_young() notifiers regardless of how fast
they are.

This also means that the MGLRU changes no longer depend on the KVM
optimizations, as they can motivated independently.

Yu, have I gotten anything wrong here? Do you have any more details to share?
Yu Zhao Oct. 15, 2024, 10:47 p.m. UTC | #3
On Mon, Oct 14, 2024 at 6:07 PM James Houghton <jthoughton@google.com> wrote:
>
> On Mon, Oct 14, 2024 at 4:22 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Sep 26, 2024, James Houghton wrote:
> > > This patchset makes it possible for MGLRU to consult secondary MMUs
> > > while doing aging, not just during eviction. This allows for more
> > > accurate reclaim decisions, which is especially important for proactive
> > > reclaim.
> >
> > ...
> >
> > > James Houghton (14):
> > >   KVM: Remove kvm_handle_hva_range helper functions
> > >   KVM: Add lockless memslot walk to KVM
> > >   KVM: x86/mmu: Factor out spte atomic bit clearing routine
> > >   KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn
> > >   KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn
> > >   KVM: x86/mmu: Only check gfn age in shadow MMU if
> > >     indirect_shadow_pages > 0
> > >   mm: Add missing mmu_notifier_clear_young for !MMU_NOTIFIER
> > >   mm: Add has_fast_aging to struct mmu_notifier
> > >   mm: Add fast_only bool to test_young and clear_young MMU notifiers
> >
> > Per offline discussions, there's a non-zero chance that fast_only won't be needed,
> > because it may be preferable to incorporate secondary MMUs into MGLRU, even if
> > they don't support "fast" aging.
> >
> > What's the status on that front?  Even if the status is "TBD", it'd be very helpful
> > to let others know, so that they don't spend time reviewing code that might be
> > completely thrown away.
>
> The fast_only MMU notifier changes will probably be removed in v8.
>
> ChromeOS folks found that the way MGLRU *currently* interacts with KVM
> is problematic. That is, today, with the MM_WALK MGLRU capability
> enabled, normal PTEs have their Accessed bits cleared via a page table
> scan and then during an rmap walk upon attempted eviction, whereas,
> KVM SPTEs only have their Accessed bits cleared via the rmap walk at
> eviction time. So KVM SPTEs have their Accessed bits cleared less
> frequently than normal PTEs, and therefore they appear younger than
> they should.
>
> It turns out that this causes tab open latency regressions on ChromeOS
> where a significant amount of memory is being used by a VM. IIUC, the
> fix for this is to have MGLRU age SPTEs as often as it ages normal
> PTEs; i.e., it should call the correct MMU notifiers each time it
> clears A bits on PTEs. The final patch in this series sort of does
> this, but instead of calling the new fast_only notifier, we need to
> call the normal test/clear_young() notifiers regardless of how fast
> they are.
>
> This also means that the MGLRU changes no longer depend on the KVM
> optimizations, as they can motivated independently.
>
> Yu, have I gotten anything wrong here? Do you have any more details to share?

Yes, that's precisely the problem. My original justification [1] for
not scanning KVM MMU when lockless is not supported turned out to be
harmful to some workloads too.

On one hand, scanning KVM MMU when not lockless can cause the KVM MMU
lock contention; on the other hand, not scanning KVM MMU can skew
anon/file LRU aging and thrash page cache. Given the lock contention
is being tackled, the latter seems to be the lesser of two evils.

[1] https://lore.kernel.org/linux-mm/CAOUHufYFHKLwt1PWp2uS6g174GZYRZURWJAmdUWs5eaKmhEeyQ@mail.gmail.com/