diff mbox series

[RFC,v4,5/7] KVM: x86: add vCPU scoped toggling for disabled exits

Message ID 20220622004924.155191-6-kechenl@nvidia.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: add per-vCPU exits disable capability | expand

Commit Message

Kechen Lu June 22, 2022, 12:49 a.m. UTC
Introduce support of vCPU-scoped ioctl with KVM_CAP_X86_DISABLE_EXITS
cap for disabling exits to enable finer-grained VM exits disabling
on per vCPU scales instead of whole guest. This patch enabled
the vCPU-scoped exits control toggling, also align the VM-scoped
exits control behaviors. Add a new kvm request KVM_REQ_DISABLE_EXITS
to guarantee updating the vmcs before vCPU entry especially for
toggling the VM-scoped exits.

In use cases like Windows guest running heavy CPU-bound
workloads, disabling HLT VM-exits could mitigate host sched ctx switch
overhead. Simply HLT disabling on all vCPUs could bring
performance benefits, but if no pCPUs reserved for host threads, could
happened to the forced preemption as host does not know the time to do
the schedule for other host threads want to run. With this patch, we
could only disable part of vCPUs HLT exits for one guest, this still
keeps performance benefits, and also shows resiliency to host stressing
workload running at the same time.

In the host stressing workload experiment with Windows guest heavy
CPU-bound workloads, it shows good resiliency and having the ~3%
performance improvement. E.g. Passmark running in a Windows guest
with this patch disabling HLT exits on only half of vCPUs still
showing 2.4% higher main score v/s baseline.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Kechen Lu <kechenl@nvidia.com>
---
 Documentation/virt/kvm/api.rst     |  2 +-
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  3 +++
 arch/x86/kvm/svm/svm.c             | 30 ++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c             | 37 ++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c                 | 23 +++++++++++++++----
 6 files changed, 91 insertions(+), 5 deletions(-)

Comments

Sean Christopherson July 20, 2022, 6:41 p.m. UTC | #1
On Tue, Jun 21, 2022, Kechen Lu wrote:
> @@ -5980,6 +5987,8 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
>  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			    struct kvm_enable_cap *cap)
>  {
> +	struct kvm_vcpu *vcpu;
> +	unsigned long i;
>  	int r;
>  
>  	if (cap->flags)
> @@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			break;
>  
>  		mutex_lock(&kvm->lock);
> -		if (kvm->created_vcpus)
> -			goto disable_exits_unlock;
> +		if (kvm->created_vcpus) {

I retract my comment about using a request, I got ahead of myself.

Don't update vCPUs, the whole point of adding the !kvm->created_vcpus check was
to avoid having to update vCPUs when the per-VM behavior changed.

In other words, keep the restriction and drop the request.

> +			kvm_for_each_vcpu(i, vcpu, kvm) {
> +				kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]);
> +				kvm_make_request(KVM_REQ_DISABLE_EXITS, vcpu);
> +			}
> +		}
> +		mutex_unlock(&kvm->lock);
>  
>  		kvm_ioctl_disable_exits(kvm->arch, cap->args[0]);
>  
>  		r = 0;
> -disable_exits_unlock:
> -		mutex_unlock(&kvm->lock);
>  		break;
>  	case KVM_CAP_MSR_PLATFORM_INFO:
>  		kvm->arch.guest_can_read_msr_platform_info = cap->args[0];
> @@ -10175,6 +10187,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
>  			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_DISABLE_EXITS, vcpu))
> +			static_call(kvm_x86_update_disabled_exits)(vcpu);
>  	}
>  
>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> -- 
> 2.32.0
>
Kechen Lu July 20, 2022, 7:04 p.m. UTC | #2
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, July 20, 2022 11:42 AM
> To: Kechen Lu <kechenl@nvidia.com>
> Cc: kvm@vger.kernel.org; pbonzini@redhat.com; chao.gao@intel.com;
> vkuznets@redhat.com; Somdutta Roy <somduttar@nvidia.com>; linux-
> kernel@vger.kernel.org
> Subject: Re: [RFC PATCH v4 5/7] KVM: x86: add vCPU scoped toggling for
> disabled exits
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Jun 21, 2022, Kechen Lu wrote:
> > @@ -5980,6 +5987,8 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm,
> > struct kvm_irq_level *irq_event,  int kvm_vm_ioctl_enable_cap(struct kvm
> *kvm,
> >                           struct kvm_enable_cap *cap)  {
> > +     struct kvm_vcpu *vcpu;
> > +     unsigned long i;
> >       int r;
> >
> >       if (cap->flags)
> > @@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm
> *kvm,
> >                       break;
> >
> >               mutex_lock(&kvm->lock);
> > -             if (kvm->created_vcpus)
> > -                     goto disable_exits_unlock;
> > +             if (kvm->created_vcpus) {
> 
> I retract my comment about using a request, I got ahead of myself.
> 
> Don't update vCPUs, the whole point of adding the !kvm->created_vcpus
> check was to avoid having to update vCPUs when the per-VM behavior
> changed.
> 
> In other words, keep the restriction and drop the request.
> 

I see. If we keep the restriction here and not updating vCPUs when kvm->created_vcpus is true, the per-VM and per-vCPU assumption would be different here? Not sure if I understand right:
For per-VM, we assume the per-VM cap enabling is only before vcpus creation. For per-vCPU cap enabling, we are able to toggle the disabled exits runtime.

If I understand correctly, this also makes sense though.

BR,
Kechen

> > +                     kvm_for_each_vcpu(i, vcpu, kvm) {
> > +                             kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]);
> > +                             kvm_make_request(KVM_REQ_DISABLE_EXITS, vcpu);
> > +                     }
> > +             }
> > +             mutex_unlock(&kvm->lock);
> >
> >               kvm_ioctl_disable_exits(kvm->arch, cap->args[0]);
> >
> >               r = 0;
> > -disable_exits_unlock:
> > -             mutex_unlock(&kvm->lock);
> >               break;
> >       case KVM_CAP_MSR_PLATFORM_INFO:
> >               kvm->arch.guest_can_read_msr_platform_info =
> > cap->args[0]; @@ -10175,6 +10187,9 @@ static int
> > vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >
> >               if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING,
> vcpu))
> >
> > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > +
> > +             if (kvm_check_request(KVM_REQ_DISABLE_EXITS, vcpu))
> > +
> > + static_call(kvm_x86_update_disabled_exits)(vcpu);
> >       }
> >
> >       if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> > --
> > 2.32.0
> >
Sean Christopherson July 20, 2022, 7:30 p.m. UTC | #3
On Wed, Jul 20, 2022, Kechen Lu wrote:
> > > @@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm kvm,
> > >                       break;
> > >
> > >               mutex_lock(&kvm->lock);
> > > -             if (kvm->created_vcpus)
> > > -                     goto disable_exits_unlock;
> > > +             if (kvm->created_vcpus) {
> > 
> > I retract my comment about using a request, I got ahead of myself.
> > 
> > Don't update vCPUs, the whole point of adding the !kvm->created_vcpus
> > check was to avoid having to update vCPUs when the per-VM behavior
> > changed.
> > 
> > In other words, keep the restriction and drop the request.
> > 
> 
> I see. If we keep the restriction here and not updating vCPUs when
> kvm->created_vcpus is true, the per-VM and per-vCPU assumption would be
> different here? Not sure if I understand right:
> For per-VM, we assume the per-VM cap enabling is only before vcpus creation.
> For per-vCPU cap enabling, we are able to toggle the disabled exits runtime.

Yep.  The main reason being that there's no use case for changing per-VM settings
after vCPUs are created.  I.e. we could lift the restriction in the future if a
use case pops up, but until then, keep things simple.

> If I understand correctly, this also makes sense though.

Paging this all back in...

There are two (sane) options for defining KVM's ABI:

  1) KVM combines the per-VM and per-vCPU settings
  2) The per-vCPU settings override the per-VM settings

This series implements (2).

For (1), KVM would need to recheck the per-VM state during the per-vCPU update,
e.g. instead of simply modifying the per-vCPU flags, the vCPU-scoped handler
for KVM_CAP_X86_DISABLE_EXITS would need to merge the incoming settings with the
existing kvm->arch.xxx_in_guest flags.

I like (2) because it's simpler to implement and document (merging state is always
messy) and is more flexible.  E.g. with (1), the only way to have per-vCPU settings
is for userspace to NOT set the per-VM disables and then set disables on a per-vCPU
basis.  Whereas with (2), userspace can set (or not) the per-VM disables and then
override as needed.
Kechen Lu July 20, 2022, 8:23 p.m. UTC | #4
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, July 20, 2022 12:30 PM
> To: Kechen Lu <kechenl@nvidia.com>
> Cc: kvm@vger.kernel.org; pbonzini@redhat.com; chao.gao@intel.com;
> vkuznets@redhat.com; Somdutta Roy <somduttar@nvidia.com>; linux-
> kernel@vger.kernel.org
> Subject: Re: [RFC PATCH v4 5/7] KVM: x86: add vCPU scoped toggling for
> disabled exits
> 
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Jul 20, 2022, Kechen Lu wrote:
> > > > @@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm
> kvm,
> > > >                       break;
> > > >
> > > >               mutex_lock(&kvm->lock);
> > > > -             if (kvm->created_vcpus)
> > > > -                     goto disable_exits_unlock;
> > > > +             if (kvm->created_vcpus) {
> > >
> > > I retract my comment about using a request, I got ahead of myself.
> > >
> > > Don't update vCPUs, the whole point of adding the
> > > !kvm->created_vcpus check was to avoid having to update vCPUs when
> > > the per-VM behavior changed.
> > >
> > > In other words, keep the restriction and drop the request.
> > >
> >
> > I see. If we keep the restriction here and not updating vCPUs when
> > kvm->created_vcpus is true, the per-VM and per-vCPU assumption would
> > kvm->be
> > different here? Not sure if I understand right:
> > For per-VM, we assume the per-VM cap enabling is only before vcpus
> creation.
> > For per-vCPU cap enabling, we are able to toggle the disabled exits runtime.
> 
> Yep.  The main reason being that there's no use case for changing per-VM
> settings after vCPUs are created.  I.e. we could lift the restriction in the future
> if a use case pops up, but until then, keep things simple.
> 
> > If I understand correctly, this also makes sense though.
> 
> Paging this all back in...
> 
> There are two (sane) options for defining KVM's ABI:
> 
>   1) KVM combines the per-VM and per-vCPU settings
>   2) The per-vCPU settings override the per-VM settings
> 
> This series implements (2).
> 
> For (1), KVM would need to recheck the per-VM state during the per-vCPU
> update, e.g. instead of simply modifying the per-vCPU flags, the vCPU-scoped
> handler for KVM_CAP_X86_DISABLE_EXITS would need to merge the
> incoming settings with the existing kvm->arch.xxx_in_guest flags.
> 
> I like (2) because it's simpler to implement and document (merging state is
> always
> messy) and is more flexible.  E.g. with (1), the only way to have per-vCPU
> settings is for userspace to NOT set the per-VM disables and then set
> disables on a per-vCPU basis.  Whereas with (2), userspace can set (or not)
> the per-VM disables and then override as needed.

Gotcha. Makes sense to me. Thanks for the elaboration!

BR,
Kechen
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 89e13b6783b5..7f614b7d5ad8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6948,7 +6948,7 @@  longer intercept some instructions for improved latency in some
 workloads, and is suggested when vCPUs are associated to dedicated
 physical CPUs.  More bits can be added in the future; userspace can
 just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable
-all such vmexits.
+all such vmexits. VM scoped and vCPU scoped capability are both supported.
 
 By default, this capability only disables exits.  To re-enable an exit, or to
 override previous settings, userspace can set KVM_X86_DISABLE_EXITS_OVERRIDE,
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index da47f60a4650..c17d417cb3cf 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -128,6 +128,7 @@  KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
+KVM_X86_OP(update_disabled_exits)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 573a39bf7a84..86baae62af86 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -105,6 +105,7 @@ 
 	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_MMU_FREE_OBSOLETE_ROOTS \
 	KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_DISABLE_EXITS		KVM_ARCH_REQ(32)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1584,6 +1585,8 @@  struct kvm_x86_ops {
 	 * Returns vCPU specific APICv inhibit reasons
 	 */
 	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+	void (*update_disabled_exits)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b32987f54ace..7b3d64b3b901 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4589,6 +4589,33 @@  static void svm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 	sev_vcpu_deliver_sipi_vector(vcpu, vector);
 }
 
+static void svm_update_disabled_exits(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	struct vmcb_control_area *control = &svm->vmcb->control;
+
+	if (kvm_hlt_in_guest(vcpu))
+		svm_clr_intercept(svm, INTERCEPT_HLT);
+	else
+		svm_set_intercept(svm, INTERCEPT_HLT);
+
+	if (kvm_mwait_in_guest(vcpu)) {
+		svm_clr_intercept(svm, INTERCEPT_MONITOR);
+		svm_clr_intercept(svm, INTERCEPT_MWAIT);
+	} else {
+		svm_set_intercept(svm, INTERCEPT_MONITOR);
+		svm_set_intercept(svm, INTERCEPT_MWAIT);
+	}
+
+	if (kvm_pause_in_guest(vcpu)) {
+		svm_clr_intercept(svm, INTERCEPT_PAUSE);
+	} else {
+		control->pause_filter_count = pause_filter_count;
+		if (pause_filter_thresh)
+			control->pause_filter_thresh = pause_filter_thresh;
+	}
+}
+
 static void svm_vm_destroy(struct kvm *kvm)
 {
 	avic_vm_destroy(kvm);
@@ -4732,7 +4759,10 @@  static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.complete_emulated_msr = svm_complete_emulated_msr,
 
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+
 	.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
+
+	.update_disabled_exits = svm_update_disabled_exits,
 };
 
 /*
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f24c9a357f70..2d000638cc9b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7716,6 +7716,41 @@  static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
 	return supported & BIT(reason);
 }
 
+static void vmx_update_disabled_exits(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (kvm_hlt_in_guest(vcpu))
+		exec_controls_clearbit(vmx, CPU_BASED_HLT_EXITING);
+	else
+		exec_controls_setbit(vmx, CPU_BASED_HLT_EXITING);
+
+	if (kvm_mwait_in_guest(vcpu))
+		exec_controls_clearbit(vmx, CPU_BASED_MWAIT_EXITING |
+			CPU_BASED_MONITOR_EXITING);
+	else
+		exec_controls_setbit(vmx, CPU_BASED_MWAIT_EXITING |
+			CPU_BASED_MONITOR_EXITING);
+
+	if (!kvm_pause_in_guest(vcpu)) {
+		vmcs_write32(PLE_GAP, ple_gap);
+		vmx->ple_window = ple_window;
+		vmx->ple_window_dirty = true;
+	}
+
+	if (kvm_cstate_in_guest(vcpu)) {
+		vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C1_RES, MSR_TYPE_R);
+		vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R);
+		vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
+		vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
+	} else {
+		vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C1_RES, MSR_TYPE_R);
+		vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R);
+		vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
+		vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
+	}
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -7849,6 +7884,8 @@  static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.update_disabled_exits = vmx_update_disabled_exits,
 };
 
 static unsigned int vmx_handle_intel_pt_intr(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fe114e319a89..6165f0b046ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5331,6 +5331,13 @@  static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		if (vcpu->arch.pv_cpuid.enforce)
 			kvm_update_pv_runtime(vcpu);
 
+		return 0;
+	case KVM_CAP_X86_DISABLE_EXITS:
+		if (cap->args[0] & ~kvm_get_allowed_disable_exits())
+			return -EINVAL;
+
+		kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]);
+		kvm_make_request(KVM_REQ_DISABLE_EXITS, vcpu);
 		return 0;
 	default:
 		return -EINVAL;
@@ -5980,6 +5987,8 @@  int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
 int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			    struct kvm_enable_cap *cap)
 {
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
 	int r;
 
 	if (cap->flags)
@@ -6036,14 +6045,17 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 			break;
 
 		mutex_lock(&kvm->lock);
-		if (kvm->created_vcpus)
-			goto disable_exits_unlock;
+		if (kvm->created_vcpus) {
+			kvm_for_each_vcpu(i, vcpu, kvm) {
+				kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]);
+				kvm_make_request(KVM_REQ_DISABLE_EXITS, vcpu);
+			}
+		}
+		mutex_unlock(&kvm->lock);
 
 		kvm_ioctl_disable_exits(kvm->arch, cap->args[0]);
 
 		r = 0;
-disable_exits_unlock:
-		mutex_unlock(&kvm->lock);
 		break;
 	case KVM_CAP_MSR_PLATFORM_INFO:
 		kvm->arch.guest_can_read_msr_platform_info = cap->args[0];
@@ -10175,6 +10187,9 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+		if (kvm_check_request(KVM_REQ_DISABLE_EXITS, vcpu))
+			static_call(kvm_x86_update_disabled_exits)(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||