diff mbox series

kvm: x86: Improve virtual machine startup performance

Message ID 20220301063756.16817-1-flyingpeng@tencent.com (mailing list archive)
State New, archived
Headers show
Series kvm: x86: Improve virtual machine startup performance | expand

Commit Message

Hao Peng March 1, 2022, 6:37 a.m. UTC
From: Peng Hao <flyingpeng@tencent.com>

vcpu 0 will repeatedly enter/exit the smm state during the startup
phase, and kvm_init_mmu will be called repeatedly during this process.
There are parts of the mmu initialization code that do not need to be
modified after the first initialization.

Statistics on my server, vcpu0 when starting the virtual machine
Calling kvm_init_mmu more than 600 times (due to smm state switching).
The patch can save about 36 microseconds in total.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
---
 arch/x86/kvm/mmu.h        |  2 +-
 arch/x86/kvm/mmu/mmu.c    | 39 ++++++++++++++++++++++-----------------
 arch/x86/kvm/svm/nested.c |  2 +-
 arch/x86/kvm/vmx/nested.c |  2 +-
 arch/x86/kvm/x86.c        |  2 +-
 5 files changed, 26 insertions(+), 21 deletions(-)

Comments

Sean Christopherson March 1, 2022, 5:54 p.m. UTC | #1
On Tue, Mar 01, 2022, Peng Hao wrote:
>  From: Peng Hao <flyingpeng@tencent.com>
> 
> vcpu 0 will repeatedly enter/exit the smm state during the startup
> phase, and kvm_init_mmu will be called repeatedly during this process.
> There are parts of the mmu initialization code that do not need to be
> modified after the first initialization.
> 
> Statistics on my server, vcpu0 when starting the virtual machine
> Calling kvm_init_mmu more than 600 times (due to smm state switching).
> The patch can save about 36 microseconds in total.
> 
> Signed-off-by: Peng Hao <flyingpeng@tencent.com>
> ---
> @@ -5054,7 +5059,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>  {
>  	kvm_mmu_unload(vcpu);
> -	kvm_init_mmu(vcpu);
> +	kvm_init_mmu(vcpu, false);

This is wrong, kvm_mmu_reset_context() is the "big hammer" and is expected to
unconditionally get the MMU to a known good state.  E.g. failure to initialize
means this code:

	context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);

will not update the shadow_root_level as expected in response to userspace changing
guest.MAXPHYADDR in such a way that KVM enables/disables 5-level paging.

The SMM transitions definitely need to be fixed, and we're slowly getting there,
but sadly there's no quick fix.
Hao Peng March 2, 2022, 1:30 a.m. UTC | #2
On Wed, Mar 2, 2022 at 1:54 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Mar 01, 2022, Peng Hao wrote:
> >  From: Peng Hao <flyingpeng@tencent.com>
> >
> > vcpu 0 will repeatedly enter/exit the smm state during the startup
> > phase, and kvm_init_mmu will be called repeatedly during this process.
> > There are parts of the mmu initialization code that do not need to be
> > modified after the first initialization.
> >
> > Statistics on my server, vcpu0 when starting the virtual machine
> > Calling kvm_init_mmu more than 600 times (due to smm state switching).
> > The patch can save about 36 microseconds in total.
> >
> > Signed-off-by: Peng Hao <flyingpeng@tencent.com>
> > ---
> > @@ -5054,7 +5059,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> >  {
> >       kvm_mmu_unload(vcpu);
> > -     kvm_init_mmu(vcpu);
> > +     kvm_init_mmu(vcpu, false);
>
> This is wrong, kvm_mmu_reset_context() is the "big hammer" and is expected to
> unconditionally get the MMU to a known good state.  E.g. failure to initialize
> means this code:
>
>         context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
>
> will not update the shadow_root_level as expected in response to userspace changing
> guest.MAXPHYADDR in such a way that KVM enables/disables 5-level paging.
>
Thanks for pointing this out. However, other than shadow_root_level,
other fields of context will not
change during the entire operation, such as
page_fault/sync_page/direct_map and so on under
the condition of tdp_mmu.
Is this patch still viable after careful confirmation of the fields
that won't be modified?
thanks.
> The SMM transitions definitely need to be fixed, and we're slowly getting there,
> but sadly there's no quick fix.
Sean Christopherson March 3, 2022, 1:29 a.m. UTC | #3
On Wed, Mar 02, 2022, Hao Peng wrote:
> On Wed, Mar 2, 2022 at 1:54 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Mar 01, 2022, Peng Hao wrote:
> > >  From: Peng Hao <flyingpeng@tencent.com>
> > >
> > > vcpu 0 will repeatedly enter/exit the smm state during the startup
> > > phase, and kvm_init_mmu will be called repeatedly during this process.
> > > There are parts of the mmu initialization code that do not need to be
> > > modified after the first initialization.
> > >
> > > Statistics on my server, vcpu0 when starting the virtual machine
> > > Calling kvm_init_mmu more than 600 times (due to smm state switching).
> > > The patch can save about 36 microseconds in total.
> > >
> > > Signed-off-by: Peng Hao <flyingpeng@tencent.com>
> > > ---
> > > @@ -5054,7 +5059,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> > >  {
> > >       kvm_mmu_unload(vcpu);
> > > -     kvm_init_mmu(vcpu);
> > > +     kvm_init_mmu(vcpu, false);
> >
> > This is wrong, kvm_mmu_reset_context() is the "big hammer" and is expected to
> > unconditionally get the MMU to a known good state.  E.g. failure to initialize
> > means this code:
> >
> >         context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
> >
> > will not update the shadow_root_level as expected in response to userspace changing
> > guest.MAXPHYADDR in such a way that KVM enables/disables 5-level paging.
> >
> Thanks for pointing this out. However, other than shadow_root_level,
> other fields of context will not
> change during the entire operation, such as
> page_fault/sync_page/direct_map and so on under
> the condition of tdp_mmu.
> Is this patch still viable after careful confirmation of the fields
> that won't be modified?

No, passing around the "init" flag is a hack.

But, we can achieve what you want simply by initializing the constant data once
per vCPU.  There's a _lot_ of state that is constant for a given MMU now that KVM
uses separate MMUs for L1 vs. L2 when TDP is enabled.  I should get patches posted
tomorrow, just need to test (famous last words).

Also, based on the number of SMM transitions, I'm guessing you're using SeaBIOS.
Have you tried disabling CONFIG_CALL32_SMM, or CONFIG_USE_SMM altogether?  That
might be an even better way to improve performance in your environment.

Last question, do you happen to know why eliminating this code shaves 36us?  The
raw writes don't seem like they'd take that long.  Maybe the writes to function
pointers trigger stalls or mispredicts or something?  If you don't have an easy
answer, don't bother investigating, I'm just curious.
Hao Peng March 3, 2022, 2:56 a.m. UTC | #4
On Thu, Mar 3, 2022 at 9:29 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Mar 02, 2022, Hao Peng wrote:
> > On Wed, Mar 2, 2022 at 1:54 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Mar 01, 2022, Peng Hao wrote:
> > > >  From: Peng Hao <flyingpeng@tencent.com>
> > > >
> > > > vcpu 0 will repeatedly enter/exit the smm state during the startup
> > > > phase, and kvm_init_mmu will be called repeatedly during this process.
> > > > There are parts of the mmu initialization code that do not need to be
> > > > modified after the first initialization.
> > > >
> > > > Statistics on my server, vcpu0 when starting the virtual machine
> > > > Calling kvm_init_mmu more than 600 times (due to smm state switching).
> > > > The patch can save about 36 microseconds in total.
> > > >
> > > > Signed-off-by: Peng Hao <flyingpeng@tencent.com>
> > > > ---
> > > > @@ -5054,7 +5059,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > >  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> > > >  {
> > > >       kvm_mmu_unload(vcpu);
> > > > -     kvm_init_mmu(vcpu);
> > > > +     kvm_init_mmu(vcpu, false);
> > >
> > > This is wrong, kvm_mmu_reset_context() is the "big hammer" and is expected to
> > > unconditionally get the MMU to a known good state.  E.g. failure to initialize
> > > means this code:
> > >
> > >         context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
> > >
> > > will not update the shadow_root_level as expected in response to userspace changing
> > > guest.MAXPHYADDR in such a way that KVM enables/disables 5-level paging.
> > >
> > Thanks for pointing this out. However, other than shadow_root_level,
> > other fields of context will not
> > change during the entire operation, such as
> > page_fault/sync_page/direct_map and so on under
> > the condition of tdp_mmu.
> > Is this patch still viable after careful confirmation of the fields
> > that won't be modified?
>
> No, passing around the "init" flag is a hack.
>
> But, we can achieve what you want simply by initializing the constant data once
> per vCPU.  There's a _lot_ of state that is constant for a given MMU now that KVM
> uses separate MMUs for L1 vs. L2 when TDP is enabled.  I should get patches posted
> tomorrow, just need to test (famous last words).
>
> Also, based on the number of SMM transitions, I'm guessing you're using SeaBIOS.
> Have you tried disabling CONFIG_CALL32_SMM, or CONFIG_USE_SMM altogether?  That
> might be an even better way to improve performance in your environment.
>

Both options are disabled in guest.
> Last question, do you happen to know why eliminating this code shaves 36us?  The
> raw writes don't seem like they'd take that long.  Maybe the writes to function
> pointers trigger stalls or mispredicts or something?  If you don't have an easy
> answer, don't bother investigating, I'm just curious.

I'm guessing it's because of the cache. At first, I wanted to replace
it with memcpy, if the modified fields are continuous enough, I can
use instructions such as erms/fsrm.
Sean Christopherson March 8, 2022, 2:03 a.m. UTC | #5
On Thu, Mar 03, 2022, Hao Peng wrote:
> On Thu, Mar 3, 2022 at 9:29 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Mar 02, 2022, Hao Peng wrote:
> > > Thanks for pointing this out. However, other than shadow_root_level,
> > > other fields of context will not
> > > change during the entire operation, such as
> > > page_fault/sync_page/direct_map and so on under
> > > the condition of tdp_mmu.
> > > Is this patch still viable after careful confirmation of the fields
> > > that won't be modified?
> >
> > No, passing around the "init" flag is a hack.
> >
> > But, we can achieve what you want simply by initializing the constant data once
> > per vCPU.  There's a _lot_ of state that is constant for a given MMU now that KVM
> > uses separate MMUs for L1 vs. L2 when TDP is enabled.  I should get patches posted
> > tomorrow, just need to test (famous last words).

Famous last words indeed.  Long story short, the patches were mostly easy, but I
wandered deep into a rabbit hole when trying to make ->inject_page_fault() constant
per MMU.  I'll get something posted this week, though exactly what that something is
remains to be seen :-)
Paolo Bonzini March 8, 2022, 8:13 a.m. UTC | #6
On 3/8/22 03:03, Sean Christopherson wrote:
> On Thu, Mar 03, 2022, Hao Peng wrote:
>> On Thu, Mar 3, 2022 at 9:29 AM Sean Christopherson <seanjc@google.com> wrote:
>>>
>>> On Wed, Mar 02, 2022, Hao Peng wrote:
>>>> Thanks for pointing this out. However, other than shadow_root_level,
>>>> other fields of context will not
>>>> change during the entire operation, such as
>>>> page_fault/sync_page/direct_map and so on under
>>>> the condition of tdp_mmu.
>>>> Is this patch still viable after careful confirmation of the fields
>>>> that won't be modified?
>>>
>>> No, passing around the "init" flag is a hack.
>>>
>>> But, we can achieve what you want simply by initializing the constant data once
>>> per vCPU.  There's a _lot_ of state that is constant for a given MMU now that KVM
>>> uses separate MMUs for L1 vs. L2 when TDP is enabled.  I should get patches posted
>>> tomorrow, just need to test (famous last words).
> 
> Famous last words indeed.  Long story short, the patches were mostly easy, but I
> wandered deep into a rabbit hole when trying to make ->inject_page_fault() constant
> per MMU.  I'll get something posted this week, though exactly what that something is
> remains to be seen :-)

This is exactly what I have posted a few weeks ago:

https://patchew.org/linux/20220221162243.683208-1-pbonzini@redhat.com/

See in particular

   KVM: nVMX/nSVM: do not monkey-patch inject_page_fault callback
   KVM: x86/mmu: initialize constant-value fields just once

Paolo
Sean Christopherson March 8, 2022, 4:04 p.m. UTC | #7
On Tue, Mar 08, 2022, Paolo Bonzini wrote:
> On 3/8/22 03:03, Sean Christopherson wrote:
> > On Thu, Mar 03, 2022, Hao Peng wrote:
> > > On Thu, Mar 3, 2022 at 9:29 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > On Wed, Mar 02, 2022, Hao Peng wrote:
> > > > > Thanks for pointing this out. However, other than shadow_root_level,
> > > > > other fields of context will not
> > > > > change during the entire operation, such as
> > > > > page_fault/sync_page/direct_map and so on under
> > > > > the condition of tdp_mmu.
> > > > > Is this patch still viable after careful confirmation of the fields
> > > > > that won't be modified?
> > > > 
> > > > No, passing around the "init" flag is a hack.
> > > > 
> > > > But, we can achieve what you want simply by initializing the constant data once
> > > > per vCPU.  There's a _lot_ of state that is constant for a given MMU now that KVM
> > > > uses separate MMUs for L1 vs. L2 when TDP is enabled.  I should get patches posted
> > > > tomorrow, just need to test (famous last words).
> > 
> > Famous last words indeed.  Long story short, the patches were mostly easy, but I
> > wandered deep into a rabbit hole when trying to make ->inject_page_fault() constant
> > per MMU.  I'll get something posted this week, though exactly what that something is
> > remains to be seen :-)
> 
> This is exactly what I have posted a few weeks ago:
> 
> https://patchew.org/linux/20220221162243.683208-1-pbonzini@redhat.com/

Heh, guess who's woefully behind on reviews...  I'll respond to those threads.
Thanks for the heads up!

> See in particular
> 
>   KVM: nVMX/nSVM: do not monkey-patch inject_page_fault callback
>   KVM: x86/mmu: initialize constant-value fields just once
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9ae6168d381e..d263a8ca6d5e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -67,7 +67,7 @@  static __always_inline u64 rsvd_bits(int s, int e)
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
-void kvm_init_mmu(struct kvm_vcpu *vcpu);
+void kvm_init_mmu(struct kvm_vcpu *vcpu, bool init);
 void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
 			     unsigned long cr4, u64 efer, gpa_t nested_cr3);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 33794379949e..fedc71d9bee2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4738,7 +4738,7 @@  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	return role;
 }
 
-static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
+static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, bool init)
 {
 	struct kvm_mmu *context = &vcpu->arch.root_mmu;
 	struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
@@ -4749,14 +4749,17 @@  static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
 		return;
 
 	context->mmu_role.as_u64 = new_role.as_u64;
-	context->page_fault = kvm_tdp_page_fault;
-	context->sync_page = nonpaging_sync_page;
-	context->invlpg = NULL;
-	context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
-	context->direct_map = true;
-	context->get_guest_pgd = get_cr3;
-	context->get_pdptr = kvm_pdptr_read;
-	context->inject_page_fault = kvm_inject_page_fault;
+
+	if (init) {
+		context->page_fault = kvm_tdp_page_fault;
+		context->sync_page = nonpaging_sync_page;
+		context->invlpg = NULL;
+		context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
+		context->direct_map = true;
+		context->get_guest_pgd = get_cr3;
+		context->get_pdptr = kvm_pdptr_read;
+		context->inject_page_fault = kvm_inject_page_fault;
+	}
 	context->root_level = role_regs_to_root_level(&regs);
 
 	if (!is_cr0_pg(context))
@@ -4924,16 +4927,18 @@  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
 
-static void init_kvm_softmmu(struct kvm_vcpu *vcpu)
+static void init_kvm_softmmu(struct kvm_vcpu *vcpu, bool init)
 {
 	struct kvm_mmu *context = &vcpu->arch.root_mmu;
 	struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
 
 	kvm_init_shadow_mmu(vcpu, &regs);
 
-	context->get_guest_pgd     = get_cr3;
-	context->get_pdptr         = kvm_pdptr_read;
-	context->inject_page_fault = kvm_inject_page_fault;
+	if (init) {
+		context->get_guest_pgd     = get_cr3;
+		context->get_pdptr         = kvm_pdptr_read;
+		context->inject_page_fault = kvm_inject_page_fault;
+	}
 }
 
 static union kvm_mmu_role
@@ -4994,14 +4999,14 @@  static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 	reset_guest_paging_metadata(vcpu, g_context);
 }
 
-void kvm_init_mmu(struct kvm_vcpu *vcpu)
+void kvm_init_mmu(struct kvm_vcpu *vcpu, bool init)
 {
 	if (mmu_is_nested(vcpu))
 		init_kvm_nested_mmu(vcpu);
 	else if (tdp_enabled)
-		init_kvm_tdp_mmu(vcpu);
+		init_kvm_tdp_mmu(vcpu, init);
 	else
-		init_kvm_softmmu(vcpu);
+		init_kvm_softmmu(vcpu, init);
 }
 EXPORT_SYMBOL_GPL(kvm_init_mmu);
 
@@ -5054,7 +5059,7 @@  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
-	kvm_init_mmu(vcpu);
+	kvm_init_mmu(vcpu, false);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
 
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index f8b7bc04b3e7..66d70a48e35e 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -447,7 +447,7 @@  static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
-	kvm_init_mmu(vcpu);
+	kvm_init_mmu(vcpu, true);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b213ca966d41..28ce73da9150 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1101,7 +1101,7 @@  static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3,
 	kvm_register_mark_available(vcpu, VCPU_EXREG_CR3);
 
 	/* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */
-	kvm_init_mmu(vcpu);
+	kvm_init_mmu(vcpu, true);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc7eb5fddfd3..fb1e3e945b72 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10895,7 +10895,7 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu_load(vcpu);
 	kvm_set_tsc_khz(vcpu, max_tsc_khz);
 	kvm_vcpu_reset(vcpu, false);
-	kvm_init_mmu(vcpu);
+	kvm_init_mmu(vcpu, true);
 	vcpu_put(vcpu);
 	return 0;