mbox series

[v2,0/9] KVM: x86/MMU: Optimize disabling dirty logging

Message ID 20220321224358.1305530-1-bgardon@google.com (mailing list archive)
Headers show
Series KVM: x86/MMU: Optimize disabling dirty logging | expand

Message

Ben Gardon March 21, 2022, 10:43 p.m. UTC
Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
series optimizes TLB flushes and introduces in-place large page
promotion, to bring the disable dirty log time down to ~3 seconds.

Testing:
Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
series introduced no new failures.

Performance:

Without this series, TDP MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 4.972184425s
Enabling dirty logging time: 0.001943807s

Iteration 1 dirty memory time: 0.061862112s
Iteration 1 get dirty log time: 0.001416413s
Iteration 1 clear dirty log time: 1.417428057s
Iteration 2 dirty memory time: 0.664103656s
Iteration 2 get dirty log time: 0.000676724s
Iteration 2 clear dirty log time: 1.149387201s
Disabling dirty logging time: 256.682188868s
Get dirty log over 2 iterations took 0.002093137s. (Avg 0.001046568s/iteration)
Clear dirty log over 2 iterations took 2.566815258s. (Avg 1.283407629s/iteration)

Without this series, Legacy MMU:
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 4.892940915s
Enabling dirty logging time: 0.001864603s

Iteration 1 dirty memory time: 0.060490391s
Iteration 1 get dirty log time: 0.001416277s
Iteration 1 clear dirty log time: 0.323548614s
Iteration 2 dirty memory time: 29.217064826s
Iteration 2 get dirty log time: 0.000696202s
Iteration 2 clear dirty log time: 0.907089084s
Disabling dirty logging time: 4.246216551s
Get dirty log over 2 iterations took 0.002112479s. (Avg 0.001056239s/iteration)
Clear dirty log over 2 iterations took 1.230637698s. (Avg 0.615318849s/iteration)

With this series, TDP MMU:
(Updated since RFC. Pulling out patches 1-4 could have a performance impact.)
> ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
Test iterations: 2
Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
guest physical test memory offset: 0x3fe7c0000000
Populate memory time: 4.878083336s
Enabling dirty logging time: 0.001874340s

Iteration 1 dirty memory time: 0.054867383s
Iteration 1 get dirty log time: 0.001368377s
Iteration 1 clear dirty log time: 1.406960856s
Iteration 2 dirty memory time: 0.679301083s
Iteration 2 get dirty log time: 0.000662905s
Iteration 2 clear dirty log time: 1.138263359s
Disabling dirty logging time: 3.169381810s
Get dirty log over 2 iterations took 0.002031282s. (Avg 0.001015641s/iteration)
Clear dirty log over 2 iterations took 2.545224215s. (Avg 1.272612107s/iteration)

Patch breakdown:
Patches 1-4 remove the need for a vCPU pointer to make_spte
Patches 5-8 are small refactors in preparation for in-place lpage promotion
Patch 9 implements in-place largepage promotion when disabling dirty logging

Changelog:
RFC -> v1:
	Dropped the first 4 patches from the series. Patch 1 was sent
	separately, patches 2-4 will be taken over by Sean Christopherson.
	Incorporated David Matlack's Reviewed-by.
v1 -> v2:
	Several patches were queued and dropped from this revision.
	Incorporated feedback from Peter Xu on the last patch in the series.
	Refreshed performance data
		Between versions 1 and 2 of this series, disable time without
		the TDP MMU went from 45s to 256, a major regression. I was
		testing on a skylake before and haswell this time, but that
		does not explain the huge performance loss.

Ben Gardon (9):
  KVM: x86/mmu: Move implementation of make_spte to a helper
  KVM: x86/mmu: Factor mt_mask out of __make_spte
  KVM: x86/mmu: Factor shadow_zero_check out of __make_spte
  KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
  KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
  KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
    on vcpu
  KVM: x86/mmu: Add try_get_mt_mask to x86_ops
  KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
  KVM: x86/mmu: Promote pages in-place when disabling dirty logging

 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 +
 arch/x86/kvm/mmu/mmu.c             | 21 +++++----
 arch/x86/kvm/mmu/mmu_internal.h    |  6 +++
 arch/x86/kvm/mmu/spte.c            | 39 +++++++++++-----
 arch/x86/kvm/mmu/spte.h            |  6 +++
 arch/x86/kvm/mmu/tdp_mmu.c         | 73 +++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.c             |  9 ++++
 arch/x86/kvm/vmx/vmx.c             | 25 ++++++++--
 9 files changed, 155 insertions(+), 27 deletions(-)

Comments

Paolo Bonzini March 25, 2022, noon UTC | #1
On 3/21/22 23:43, Ben Gardon wrote:
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
> with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
> series optimizes TLB flushes and introduces in-place large page
> promotion, to bring the disable dirty log time down to ~3 seconds.
> 
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> series introduced no new failures.

Thanks, looks good.  The one change I'd make is to just place the 
outcome of build_tdp_shadow_zero_bits_mask() in a global (say 
tdp_shadow_zero_check) at kvm_configure_mmu() time.  The 
tdp_max_root_level works as a conservative choice for the second 
argument of build_tdp_shadow_zero_bits_mask().

No need to do anything though, I'll handle this later in 5.19 time (and 
first merge my changes that factor out the constant part of 
vcpu->arch.root_mmu initialization, since this is part of the same ideas).

Paolo
David Matlack March 28, 2022, 5:49 p.m. UTC | #2
On Mon, Mar 21, 2022 at 03:43:49PM -0700, Ben Gardon wrote:
> Currently disabling dirty logging with the TDP MMU is extremely slow.
> On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
> with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
> series optimizes TLB flushes and introduces in-place large page
> promotion, to bring the disable dirty log time down to ~3 seconds.
> 
> Testing:
> Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> series introduced no new failures.
> 
> Performance:
> 
> Without this series, TDP MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 4.972184425s
> Enabling dirty logging time: 0.001943807s
> 
> Iteration 1 dirty memory time: 0.061862112s
> Iteration 1 get dirty log time: 0.001416413s
> Iteration 1 clear dirty log time: 1.417428057s
> Iteration 2 dirty memory time: 0.664103656s
> Iteration 2 get dirty log time: 0.000676724s
> Iteration 2 clear dirty log time: 1.149387201s
> Disabling dirty logging time: 256.682188868s
> Get dirty log over 2 iterations took 0.002093137s. (Avg 0.001046568s/iteration)
> Clear dirty log over 2 iterations took 2.566815258s. (Avg 1.283407629s/iteration)
> 
> Without this series, Legacy MMU:
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 4.892940915s
> Enabling dirty logging time: 0.001864603s
> 
> Iteration 1 dirty memory time: 0.060490391s
> Iteration 1 get dirty log time: 0.001416277s
> Iteration 1 clear dirty log time: 0.323548614s
> Iteration 2 dirty memory time: 29.217064826s
> Iteration 2 get dirty log time: 0.000696202s
> Iteration 2 clear dirty log time: 0.907089084s
> Disabling dirty logging time: 4.246216551s
> Get dirty log over 2 iterations took 0.002112479s. (Avg 0.001056239s/iteration)
> Clear dirty log over 2 iterations took 1.230637698s. (Avg 0.615318849s/iteration)
> 
> With this series, TDP MMU:
> (Updated since RFC. Pulling out patches 1-4 could have a performance impact.)
> > ./dirty_log_perf_test -v 96 -s anonymous_hugetlb_1gb
> Test iterations: 2
> Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
> guest physical test memory offset: 0x3fe7c0000000
> Populate memory time: 4.878083336s
> Enabling dirty logging time: 0.001874340s
> 
> Iteration 1 dirty memory time: 0.054867383s
> Iteration 1 get dirty log time: 0.001368377s
> Iteration 1 clear dirty log time: 1.406960856s
> Iteration 2 dirty memory time: 0.679301083s
> Iteration 2 get dirty log time: 0.000662905s
> Iteration 2 clear dirty log time: 1.138263359s
> Disabling dirty logging time: 3.169381810s
> Get dirty log over 2 iterations took 0.002031282s. (Avg 0.001015641s/iteration)
> Clear dirty log over 2 iterations took 2.545224215s. (Avg 1.272612107s/iteration)
> 
> Patch breakdown:
> Patches 1-4 remove the need for a vCPU pointer to make_spte
> Patches 5-8 are small refactors in preparation for in-place lpage promotion
> Patch 9 implements in-place largepage promotion when disabling dirty logging
> 
> Changelog:
> RFC -> v1:
> 	Dropped the first 4 patches from the series. Patch 1 was sent
> 	separately, patches 2-4 will be taken over by Sean Christopherson.
> 	Incorporated David Matlack's Reviewed-by.
> v1 -> v2:
> 	Several patches were queued and dropped from this revision.
> 	Incorporated feedback from Peter Xu on the last patch in the series.
> 	Refreshed performance data
> 		Between versions 1 and 2 of this series, disable time without
> 		the TDP MMU went from 45s to 256, a major regression. I was
> 		testing on a skylake before and haswell this time, but that
> 		does not explain the huge performance loss.
> 
> Ben Gardon (9):
>   KVM: x86/mmu: Move implementation of make_spte to a helper
>   KVM: x86/mmu: Factor mt_mask out of __make_spte
>   KVM: x86/mmu: Factor shadow_zero_check out of __make_spte
>   KVM: x86/mmu: Replace vcpu argument with kvm pointer in make_spte
>   KVM: x86/mmu: Factor out the meat of reset_tdp_shadow_zero_bits_mask
>   KVM: x86/mmu: Factor out part of vmx_get_mt_mask which does not depend
>     on vcpu
>   KVM: x86/mmu: Add try_get_mt_mask to x86_ops
>   KVM: x86/mmu: Make kvm_is_mmio_pfn usable outside of spte.c
>   KVM: x86/mmu: Promote pages in-place when disabling dirty logging

Use () after function names to make it clear you are referring to a
function and not something else. e.g.

  KVM: x86/mmu: Move implementation of make_spte to a helper

becomes

  KVM: x86/mmu: Move implementation of make_spte() to a helper

This applies throughout the series, in commit messages and comments.

> 
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  2 +
>  arch/x86/kvm/mmu/mmu.c             | 21 +++++----
>  arch/x86/kvm/mmu/mmu_internal.h    |  6 +++
>  arch/x86/kvm/mmu/spte.c            | 39 +++++++++++-----
>  arch/x86/kvm/mmu/spte.h            |  6 +++
>  arch/x86/kvm/mmu/tdp_mmu.c         | 73 +++++++++++++++++++++++++++++-
>  arch/x86/kvm/svm/svm.c             |  9 ++++
>  arch/x86/kvm/vmx/vmx.c             | 25 ++++++++--
>  9 files changed, 155 insertions(+), 27 deletions(-)
> 
> -- 
> 2.35.1.894.gb6a874cedc-goog
>
Sean Christopherson July 12, 2022, 1:37 a.m. UTC | #3
On Fri, Mar 25, 2022, Paolo Bonzini wrote:
> On 3/21/22 23:43, Ben Gardon wrote:
> > Currently disabling dirty logging with the TDP MMU is extremely slow.
> > On a 96 vCPU / 96G VM it takes ~256 seconds to disable dirty logging
> > with the TDP MMU, as opposed to ~4 seconds with the legacy MMU. This
> > series optimizes TLB flushes and introduces in-place large page
> > promotion, to bring the disable dirty log time down to ~3 seconds.
> > 
> > Testing:
> > Ran KVM selftests and kvm-unit-tests on an Intel Haswell. This
> > series introduced no new failures.
> 
> Thanks, looks good.  The one change I'd make is to just place the outcome of
> build_tdp_shadow_zero_bits_mask() in a global (say tdp_shadow_zero_check) at
> kvm_configure_mmu() time.  The tdp_max_root_level works as a conservative
> choice for the second argument of build_tdp_shadow_zero_bits_mask().
> 
> No need to do anything though, I'll handle this later in 5.19 time (and
> first merge my changes that factor out the constant part of
> vcpu->arch.root_mmu initialization, since this is part of the same ideas).

This fell through the cracks.  Ben is on a long vacation, I'll find my copy of
the Necronomicon and do a bit of resurrection, and address the feedback from v2
along the way.
Paolo Bonzini July 14, 2022, 7:55 a.m. UTC | #4
On 7/12/22 03:37, Sean Christopherson wrote:
> This fell through the cracks.  Ben is on a long vacation, I'll find my copy of
> the Necronomicon and do a bit of resurrection, and address the feedback from v2
> along the way.

This was superseded by the simple patch to zap only the leaves I think?

Paolo
Sean Christopherson July 14, 2022, 3:27 p.m. UTC | #5
On Thu, Jul 14, 2022, Paolo Bonzini wrote:
> On 7/12/22 03:37, Sean Christopherson wrote:
> > This fell through the cracks.  Ben is on a long vacation, I'll find my copy of
> > the Necronomicon and do a bit of resurrection, and address the feedback from v2
> > along the way.
> 
> This was superseded by the simple patch to zap only the leaves I think?

Ah, right you are, commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when
disabling dirty logging").  I got somewhat confused because there's a stale comment
above the inner helper:

	/*
	 * Clear leaf entries which could be replaced by large mappings, for
	 * GFNs within the slot.
	 */

If we drop the "only refcounted struct pages can be huge" requirement, then the
flow becomes much simpler as there's no need to recurse down to the leafs only to
step back up:

	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
			continue;

		if (!is_shadow_present_pte(iter.old_spte))
			continue;

		/*
		 * Don't zap leaf SPTEs, if a leaf SPTE could be replaced with
		 * a large page size, then its parent would have been zapped
		 * instead of stepping down.
		 */
		if (is_last_spte(iter.old_spte, iter.level))
			continue;

		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
							      iter.gfn, PG_LEVEL_NUM);
		if (max_mapping_level <= iter.level)
			continue;

		/* Note, a successful atomic zap also does a remote TLB flush. */
		if (tdp_mmu_zap_spte_atomic(kvm, &iter))
			goto retry;
	}