Message ID | 20240724011037.3671523-4-jthoughton@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: multi-gen LRU: Walk secondary MMU page tables while aging | expand |
On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote: > > Replace the MMU write locks (taken in the memslot iteration loop) for > read locks. > > Grabbing the read lock instead of the write lock is safe because the > only requirement we have is that the stage-2 page tables do not get > deallocated while we are walking them. The stage2_age_walker() callback > is safe to race with itself; update the comment to reflect the > synchronization change. > > Signed-off-by: James Houghton <jthoughton@google.com> > --- Here is some data to show that this patch at least *can* be helpful: # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY) # The test is faulting memory in while doing aging as fast as possible. # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory -p -v 32 -m 3 # Write lock vcpu wall time : 3.039207157s lru_gen avg pass duration : 1.660541541s, (passes:2, total:3.321083083s) # Read lock vcpu wall time : 3.010848445s lru_gen avg pass duration : 0.306623698s, (passes:11, total:3.372860688s) Aging is able to run significantly faster, but vCPU runtime isn't affected much (in this test). It would be really nice to motivate this patch with a test that didn't require patching the kernel... Oliver and Marc, please let me know if you'd like to see more data. I'm also happy to simply drop this patch.
On Thu, Jul 25, 2024, James Houghton wrote: > On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote: > > > > Replace the MMU write locks (taken in the memslot iteration loop) for > > read locks. > > > > Grabbing the read lock instead of the write lock is safe because the > > only requirement we have is that the stage-2 page tables do not get > > deallocated while we are walking them. The stage2_age_walker() callback > > is safe to race with itself; update the comment to reflect the > > synchronization change. > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > --- > > Here is some data to show that this patch at least *can* be helpful: > > # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY) > # The test is faulting memory in while doing aging as fast as possible. > # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory > -p -v 32 -m 3 > > # Write lock > vcpu wall time : 3.039207157s > lru_gen avg pass duration : 1.660541541s, (passes:2, total:3.321083083s) > > # Read lock > vcpu wall time : 3.010848445s > lru_gen avg pass duration : 0.306623698s, (passes:11, total:3.372860688s) > > Aging is able to run significantly faster, but vCPU runtime isn't > affected much (in this test). Were you expecting vCPU runtime to improve (more)? If so, lack of movement could be due to KVM arm64 taking mmap_lock for read when handling faults: https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com
On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Jul 25, 2024, James Houghton wrote: > > On Tue, Jul 23, 2024 at 6:11 PM James Houghton <jthoughton@google.com> wrote: > > > > > > Replace the MMU write locks (taken in the memslot iteration loop) for > > > read locks. > > > > > > Grabbing the read lock instead of the write lock is safe because the > > > only requirement we have is that the stage-2 page tables do not get > > > deallocated while we are walking them. The stage2_age_walker() callback > > > is safe to race with itself; update the comment to reflect the > > > synchronization change. > > > > > > Signed-off-by: James Houghton <jthoughton@google.com> > > > --- > > > > Here is some data to show that this patch at least *can* be helpful: > > > > # arm64 patched to do aging (i.e., set HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY) > > # The test is faulting memory in while doing aging as fast as possible. > > # taskset -c 0-32 ./access_tracking_perf_test -l -r /dev/cgroup/memory > > -p -v 32 -m 3 > > > > # Write lock > > vcpu wall time : 3.039207157s > > lru_gen avg pass duration : 1.660541541s, (passes:2, total:3.321083083s) > > > > # Read lock > > vcpu wall time : 3.010848445s > > lru_gen avg pass duration : 0.306623698s, (passes:11, total:3.372860688s) > > > > Aging is able to run significantly faster, but vCPU runtime isn't > > affected much (in this test). > > Were you expecting vCPU runtime to improve (more)? If so, lack of movement could > be due to KVM arm64 taking mmap_lock for read when handling faults: > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com For the above test, I don't think it's mmap_lock -- the reclaim path, e.g., when zswapping guest memory, has two stages: aging (scanning PTEs) and eviction (unmapping PTEs). Only testing the former isn't realistic at all. IOW, for a r/w lock use case, only testing the read lock path would be bad coverage.
On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote: > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote: [...] > > Were you expecting vCPU runtime to improve (more)? If so, lack of movement could > > be due to KVM arm64 taking mmap_lock for read when handling faults: > > > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com > > For the above test, I don't think it's mmap_lock Yeah, I don't think this is related to the mmap_lock. James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't fault for an Access flag update. Even if he's on a machine w/o it, Access flag faults are handled outside the mmap_lock. Forcing SW management of the AF at stage-2 would be the best case for demonstrating the locking improvement: diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index a24a2a857456..a640e8a8c6ea 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -669,8 +669,6 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) * happen to be running on a design that has unadvertised support for * HAFDBS. Here be dragons. */ - if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) - vtcr |= VTCR_EL2_HA; #endif /* CONFIG_ARM64_HW_AFDBM */ if (kvm_lpa2_is_enabled()) Changing the config option would work too, but I wasn't sure if FEAT_HAFDBS on the primary MMU influenced MGLRU heuristics. > -- the reclaim path, > e.g., when zswapping guest memory, has two stages: aging (scanning > PTEs) and eviction (unmapping PTEs). Only testing the former isn't > realistic at all. AIUI, the intention of this test data is to provide some justification for why Marc + I should consider the locking change *outside* of any MMU notifier changes. So from that POV, this is meant as a hacked up microbenchmark and not meant to be realistic. And really, the arm64 change has nothing to do with this series at this point, which is disappointing. In the interest of moving this feature along for both architectures, would you be able help James with: - Identifying a benchmark that you believe is realistic - Suggestions on how to run that benchmark on Google infrastructure Asking since you had a setup / data earlier on when you were carrying the series. Hopefully with supportive data we can get arm64 to opt-in to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well.
On Mon, Aug 19, 2024, Oliver Upton wrote: > On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote: > > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote: > > [...] > > > > Were you expecting vCPU runtime to improve (more)? If so, lack of movement could > > > be due to KVM arm64 taking mmap_lock for read when handling faults: > > > > > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com > > > > For the above test, I don't think it's mmap_lock > > Yeah, I don't think this is related to the mmap_lock. > > James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't > fault for an Access flag update. Huh, didn't know that was a thing on ARM. Ooh, that lends even more credence to my assertion that marking folios accessed in handle_access_fault() can go away[*]. I assume hardware-assisted updates means this code in handle_access_fault() will no longer be hit, as KVM simply won't ever get access faults? If so, I'll add that info to the changelog. if (kvm_pte_valid(pte)) kvm_set_pfn_accessed(kvm_pte_to_pfn(pte)); [*] https://lore.kernel.org/all/20240726235234.228822-83-seanjc@google.com > Even if he's on a machine w/o it, Access flag faults are handled outside the > mmap_lock. Oh, right, they go down handle_access_fault(), not user_mem_abort(). Reviewing late Friday afternoon, never a good idea ;-)
On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Fri, Aug 16, 2024 at 07:03:27PM -0600, Yu Zhao wrote: > > On Fri, Aug 16, 2024 at 6:46 PM Sean Christopherson <seanjc@google.com> wrote: > > [...] > > > > Were you expecting vCPU runtime to improve (more)? If so, lack of movement could > > > be due to KVM arm64 taking mmap_lock for read when handling faults: I had no real expectation. I was hoping that maybe there could be a vCPU runtime improvement, given that user_mem_abort() (being called because we're faulting memory in continuously in this test) has to take the KVM MMU lock for reading, and aging is taking it for reading vs. writing. I think that's why aging is a lot slower when using the write lock: it is waiting for the readers to drop the lock, but I guess the delay on the *readers* due to the pending writer seems to be pretty minimal. > > > > > > https://lore.kernel.org/all/Zr0ZbPQHVNzmvwa6@google.com > > > > For the above test, I don't think it's mmap_lock > > Yeah, I don't think this is related to the mmap_lock. > > James is likely using hardware that has FEAT_HAFDBS, so vCPUs won't > fault for an Access flag update. Even if he's on a machine w/o it, > Access flag faults are handled outside the mmap_lock. Yeah I was running on Ampere Altra CPUs. > Forcing SW management of the AF at stage-2 would be the best case for > demonstrating the locking improvement: > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index a24a2a857456..a640e8a8c6ea 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -669,8 +669,6 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) > * happen to be running on a design that has unadvertised support for > * HAFDBS. Here be dragons. > */ > - if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38)) > - vtcr |= VTCR_EL2_HA; > #endif /* CONFIG_ARM64_HW_AFDBM */ > > if (kvm_lpa2_is_enabled()) Thanks! > Changing the config option would work too, but I wasn't sure if > FEAT_HAFDBS on the primary MMU influenced MGLRU heuristics. Indeed, disabling CONFIG_ARM64_HW_AFDBM will cause MGLRU not to do aging. > > -- the reclaim path, > > e.g., when zswapping guest memory, has two stages: aging (scanning > > PTEs) and eviction (unmapping PTEs). Only testing the former isn't > > realistic at all. > > AIUI, the intention of this test data is to provide some justification > for why Marc + I should consider the locking change *outside* of any > MMU notifier changes. So from that POV, this is meant as a hacked > up microbenchmark and not meant to be realistic. > > And really, the arm64 change has nothing to do with this series at > this point, which is disappointing. In the interest of moving this > feature along for both architectures, would you be able help James > with: > > - Identifying a benchmark that you believe is realistic > > - Suggestions on how to run that benchmark on Google infrastructure > > Asking since you had a setup / data earlier on when you were carrying > the series. Hopefully with supportive data we can get arm64 to opt-in > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well. I'll keep trying some other approaches I can take for getting similar testing that Yu had; it is somewhat difficult for me to reproduce those tests (and it really shouldn't be.... sorry). I think it makes most sense for me to drop the arm64 patch for now and re-propose it (or something stronger) alongside enabling aging. Does that sound ok?
On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote: > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Asking since you had a setup / data earlier on when you were carrying > > the series. Hopefully with supportive data we can get arm64 to opt-in > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well. > > I'll keep trying some other approaches I can take for getting similar > testing that Yu had; it is somewhat difficult for me to reproduce > those tests (and it really shouldn't be.... sorry). No need to apologize. Getting good test hardware for arm64 is a complete chore. Sure would love a functional workstation with cores from this decade... > I think it makes most sense for me to drop the arm64 patch for now and > re-propose it (or something stronger) alongside enabling aging. Does > that sound ok? I'm a bit disappointed that we haven't gotten forward progress on the arm64 patches, but I also recognize this is the direction of travel as the x86 patches are shaping up. So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out soon while the context is still fresh.
On Thu, Aug 29, 2024 at 5:48 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote: > > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > Asking since you had a setup / data earlier on when you were carrying > > > the series. Hopefully with supportive data we can get arm64 to opt-in > > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well. > > > > I'll keep trying some other approaches I can take for getting similar > > testing that Yu had; it is somewhat difficult for me to reproduce > > those tests (and it really shouldn't be.... sorry). > > No need to apologize. Getting good test hardware for arm64 is a complete > chore. Sure would love a functional workstation with cores from this > decade... > > > I think it makes most sense for me to drop the arm64 patch for now and > > re-propose it (or something stronger) alongside enabling aging. Does > > that sound ok? > > I'm a bit disappointed that we haven't gotten forward progress on the > arm64 patches, but I also recognize this is the direction of travel as > the x86 patches are shaping up. > > So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out > soon while the context is still fresh. Converting the aging notifiers to holding mmu_lock for read seems like a pure win and minimal churn. Can we keep that patch in v7 (which depends on the lockless notifier refactor, i.e. is not completely stand-alone)? We can revisit enabling MGLRU on arm64 in a subsequent series. > > -- > Thanks, > Oliver
Hey David, On Fri, Aug 30, 2024 at 08:33:59AM -0700, David Matlack wrote: > On Thu, Aug 29, 2024 at 5:48 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > On Thu, Aug 29, 2024 at 05:33:00PM -0700, James Houghton wrote: > > > On Mon, Aug 19, 2024 at 1:42 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Asking since you had a setup / data earlier on when you were carrying > > > > the series. Hopefully with supportive data we can get arm64 to opt-in > > > > to HAVE_KVM_MMU_NOTIFIER_YOUNG_FAST_ONLY as well. > > > > > > I'll keep trying some other approaches I can take for getting similar > > > testing that Yu had; it is somewhat difficult for me to reproduce > > > those tests (and it really shouldn't be.... sorry). > > > > No need to apologize. Getting good test hardware for arm64 is a complete > > chore. Sure would love a functional workstation with cores from this > > decade... > > > > > I think it makes most sense for me to drop the arm64 patch for now and > > > re-propose it (or something stronger) alongside enabling aging. Does > > > that sound ok? > > > > I'm a bit disappointed that we haven't gotten forward progress on the > > arm64 patches, but I also recognize this is the direction of travel as > > the x86 patches are shaping up. > > > > So yeah, I'm OK with it, but I'd love to get the arm64 side sorted out > > soon while the context is still fresh. > > Converting the aging notifiers to holding mmu_lock for read seems like > a pure win and minimal churn. Can we keep that patch in v7 (which > depends on the lockless notifier refactor, i.e. is not completely > stand-alone)? We can revisit enabling MGLRU on arm64 in a subsequent > series. Even though the churn is minimal in LOC, locking changes are significant. If one thing has become clear, there are some strong opinions about arm64 participating in MGLRU w/ the read lock. So it is almost guaranteed that these read lock changes will eventually get thrown out in favor of an RCU-protected walker. Then we're stuck with potentially 3 flavors of locking in kernels that people actually use, and dealing with breakage that only affects that intermediate step is gonna be annoying.
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig index 58f09370d17e..7a1af8141c0e 100644 --- a/arch/arm64/kvm/Kconfig +++ b/arch/arm64/kvm/Kconfig @@ -22,6 +22,7 @@ menuconfig KVM select KVM_COMMON select KVM_GENERIC_HARDWARE_ENABLING select KVM_GENERIC_MMU_NOTIFIER + select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS select HAVE_KVM_CPU_RELAX_INTERCEPT select KVM_MMIO select KVM_GENERIC_DIRTYLOG_READ_PROTECT diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 9e2bbee77491..a24a2a857456 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1319,10 +1319,10 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx, data->young = true; /* - * stage2_age_walker() is always called while holding the MMU lock for - * write, so this will always succeed. Nonetheless, this deliberately - * follows the race detection pattern of the other stage-2 walkers in - * case the locking mechanics of the MMU notifiers is ever changed. + * This walk is not exclusive; the PTE is permitted to change from + * under us. If there is a race to update this PTE, then the GFN is + * most likely young, so failing to clear the AF is likely to be + * inconsequential. */ if (data->mkold && !stage2_try_set_pte(ctx, new)) return -EAGAIN; @@ -1345,10 +1345,13 @@ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, struct kvm_pgtable_walker walker = { .cb = stage2_age_walker, .arg = &data, - .flags = KVM_PGTABLE_WALK_LEAF, + .flags = KVM_PGTABLE_WALK_LEAF | + KVM_PGTABLE_WALK_SHARED, }; + int r; - WARN_ON(kvm_pgtable_walk(pgt, addr, size, &walker)); + r = kvm_pgtable_walk(pgt, addr, size, &walker); + WARN_ON_ONCE(r && r != -EAGAIN); return data.young; } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 6981b1bc0946..e37765f6f2a1 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1912,29 +1912,43 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; + bool young = false; + + read_lock(&kvm->mmu_lock); if (!kvm->arch.mmu.pgt) - return false; + goto out; - return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt, - range->start << PAGE_SHIFT, - size, true); + young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt, + range->start << PAGE_SHIFT, + size, true); /* * TODO: Handle nested_mmu structures here using the reverse mapping in * a later version of patch series. */ + +out: + read_unlock(&kvm->mmu_lock); + return young; } bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { u64 size = (range->end - range->start) << PAGE_SHIFT; + bool young = false; + + read_lock(&kvm->mmu_lock); if (!kvm->arch.mmu.pgt) - return false; + goto out; - return kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt, - range->start << PAGE_SHIFT, - size, false); + young = kvm_pgtable_stage2_test_clear_young(kvm->arch.mmu.pgt, + range->start << PAGE_SHIFT, + size, false); + +out: + read_unlock(&kvm->mmu_lock); + return young; } phys_addr_t kvm_mmu_get_httbr(void)
Replace the MMU write locks (taken in the memslot iteration loop) for read locks. Grabbing the read lock instead of the write lock is safe because the only requirement we have is that the stage-2 page tables do not get deallocated while we are walking them. The stage2_age_walker() callback is safe to race with itself; update the comment to reflect the synchronization change. Signed-off-by: James Houghton <jthoughton@google.com> --- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++------ arch/arm64/kvm/mmu.c | 30 ++++++++++++++++++++++-------- 3 files changed, 32 insertions(+), 14 deletions(-)