Message ID | 20220321224358.1305530-1-bgardon@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86/MMU: Optimize disabling dirty logging | expand |
On 3/21/22 23:43, Ben Gardon wrote: > Currently disabling dirty logging with the TDP MMU is extremely slow. > On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging > with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This > series optimizes TLB flushes and introduces in-place large page > promotion, to bring the disable dirty log time down to ~3 seconds. > > Testing: > Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This > series introduced no new failures. Thanks, looks good. The one change I'd make is to just place the outcome of build_tdp_shadow_zero_bits_mask() in a global (say tdp_shadow_zero_check) at kvm_configure_mmu() time. The tdp_max_root_level works as a conservative choice for the second argument of build_tdp_shadow_zero_bits_mask(). No need to do anything though, I'll handle this later in 5.19 time (and first merge my changes that factor out the constant part of vcpu->arch.root_mmu initialization, since this is part of the same ideas). Paolo
On Mon, Mar 21, 2022 at 03:43:49PM -0700, Ben Gardon wrote: > Currently disabling dirty logging with the TDP MMU is extremely slow. > On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging > with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This > series optimizes TLB flushes and introduces in-place large page > promotion, to bring the disable dirty log time down to ~3 seconds. > > Testing: > Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This > series introduced no new failures. > > Performance: > > Without this series, TDP MMU: > > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb > Test iterations: 2 > Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > guest physical test memory offset: 0x3fe7c0000000 > Populate memory time: 4.972184425s > Enabling dirty logging time: 0.001943807s > > Iteration 1 dirty memory time: 0.061862112s > Iteration 1 get dirty log time: 0.001416413s > Iteration 1 clear dirty log time: 1.417428057s > Iteration 2 dirty memory time: 0.664103656s > Iteration 2 get dirty log time: 0.000676724s > Iteration 2 clear dirty log time: 1.149387201s > Disabling dirty logging time: 256.682188868s > Get dirty log over 2 iterations took 0.002093137s. (Avg 0.001046568s/iteration) > Clear dirty log over 2 iterations took 2.566815258s. (Avg 1.283407629s/iteration) > > Without this series, Legacy MMU: > > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb > Test iterations: 2 > Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > guest physical test memory offset: 0x3fe7c0000000 > Populate memory time: 4.892940915s > Enabling dirty logging time: 0.001864603s > > Iteration 1 dirty memory time: 0.060490391s > Iteration 1 get dirty log time: 0.001416277s > Iteration 1 clear dirty log time: 0.323548614s > Iteration 2 dirty memory time: 29.217064826s > Iteration 2 get dirty log time: 0.000696202s > Iteration 2 clear dirty log time: 0.907089084s > Disabling dirty logging time: 4.246216551s > Get dirty log over 2 iterations took 0.002112479s. (Avg 0.001056239s/iteration) > Clear dirty log over 2 iterations took 1.230637698s. (Avg 0.615318849s/iteration) > > With this series, TDP MMU: > (Updated since RFC. Pulling out patches 1-4 could have a performance impact.) > > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb > Test iterations: 2 > Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > guest physical test memory offset: 0x3fe7c0000000 > Populate memory time: 4.878083336s > Enabling dirty logging time: 0.001874340s > > Iteration 1 dirty memory time: 0.054867383s > Iteration 1 get dirty log time: 0.001368377s > Iteration 1 clear dirty log time: 1.406960856s > Iteration 2 dirty memory time: 0.679301083s > Iteration 2 get dirty log time: 0.000662905s > Iteration 2 clear dirty log time: 1.138263359s > Disabling dirty logging time: 3.169381810s > Get dirty log over 2 iterations took 0.002031282s. (Avg 0.001015641s/iteration) > Clear dirty log over 2 iterations took 2.545224215s. (Avg 1.272612107s/iteration) > > Patch breakdown: > Patches 1-4 remove the need for a vCPU pointer to make_spte > Patches 5-8 are small refactors in preparation for in-place lpage promotion > Patch 9 implements in-place largepage promotion when disabling dirty logging > > Changelog: > RFC -> v1: > Dropped the first 4 patches from the series. Patch 1 was sent > separately, patches 2-4 will be taken over by Sean Christopherson. > Incorporated David Matlack's Reviewed-by. > v1 -> v2: > Several patches were queued and dropped from this revision. > Incorporated feedback from Peter Xu on the last patch in the series. > Refreshed performance data > Between versions 1 and 2 of this series, disable time without > the TDP MMU went from 45s to 256, a major regression. I was > testing on a skylake before and haswell this time, but that > does not explain the huge performance loss. > > Ben Gardon (9): > KVM: x86/mmu: Move implementation of make_spte to a helper > KVM: x86/mmu: Factor mt_mask out of __make_spte > KVM: x86/mmu: Factor shadow_zero_check out of __make_spte > KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte > KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask > KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend > on vcpu > KVM: x86/mmu: Add try_get_mt_mask to x86_ops > KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c > KVM: x86/mmu: Promote pages in-place when disabling dirty logging Use () after function names to make it clear you are referring to a function and not something else. e.g. KVM: x86/mmu: Move implementation of make_spte to a helper becomes KVM: x86/mmu: Move implementation of make_spte() to a helper This applies throughout the series, in commit messages and comments. > > arch/x86/include/asm/kvm-x86-ops.h | 1 + > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu/mmu.c | 21 +++++---- > arch/x86/kvm/mmu/mmu_internal.h | 6 +++ > arch/x86/kvm/mmu/spte.c | 39 +++++++++++----- > arch/x86/kvm/mmu/spte.h | 6 +++ > arch/x86/kvm/mmu/tdp_mmu.c | 73 +++++++++++++++++++++++++++++- > arch/x86/kvm/svm/svm.c | 9 ++++ > arch/x86/kvm/vmx/vmx.c | 25 ++++++++-- > 9 files changed, 155 insertions(+), 27 deletions(-) > > -- > 2.35.1.894.gb6a874cedc-goog >
On Fri, Mar 25, 2022, Paolo Bonzini wrote: > On 3/21/22 23:43, Ben Gardon wrote: > > Currently disabling dirty logging with the TDP MMU is extremely slow. > > On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging > > with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This > > series optimizes TLB flushes and introduces in-place large page > > promotion, to bring the disable dirty log time down to ~3 seconds. > > > > Testing: > > Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This > > series introduced no new failures. > > Thanks, looks good. The one change I'd make is to just place the outcome of > build_tdp_shadow_zero_bits_mask() in a global (say tdp_shadow_zero_check) at > kvm_configure_mmu() time. The tdp_max_root_level works as a conservative > choice for the second argument of build_tdp_shadow_zero_bits_mask(). > > No need to do anything though, I'll handle this later in 5.19 time (and > first merge my changes that factor out the constant part of > vcpu->arch.root_mmu initialization, since this is part of the same ideas). This fell through the cracks. Ben is on a long vacation, I'll find my copy of the Necronomicon and do a bit of resurrection, and address the feedback from v2 along the way.
On 7/12/22 03:37, Sean Christopherson wrote: > This fell through the cracks. Ben is on a long vacation, I'll find my copy of > the Necronomicon and do a bit of resurrection, and address the feedback from v2 > along the way. This was superseded by the simple patch to zap only the leaves I think? Paolo
On Thu, Jul 14, 2022, Paolo Bonzini wrote: > On 7/12/22 03:37, Sean Christopherson wrote: > > This fell through the cracks. Ben is on a long vacation, I'll find my copy of > > the Necronomicon and do a bit of resurrection, and address the feedback from v2 > > along the way. > > This was superseded by the simple patch to zap only the leaves I think? Ah, right you are, commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging"). I got somewhat confused because there's a stale comment above the inner helper: /* * Clear leaf entries which could be replaced by large mappings, for * GFNs within the slot. */ If we drop the "only refcounted struct pages can be huge" requirement, then the flow becomes much simpler as there's no need to recurse down to the leafs only to step back up: for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { retry: if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) continue; if (!is_shadow_present_pte(iter.old_spte)) continue; /* * Don't zap leaf SPTEs, if a leaf SPTE could be replaced with * a large page size, then its parent would have been zapped * instead of stepping down. */ if (is_last_spte(iter.old_spte, iter.level)) continue; max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn, PG_LEVEL_NUM); if (max_mapping_level <= iter.level) continue; /* Note, a successful atomic zap also does a remote TLB flush. */ if (tdp_mmu_zap_spte_atomic(kvm, &iter)) goto retry; }