Message ID | 20220615011622.136646-6-kechenl@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: add per-vCPU exits disable capability | expand |
>@@ -5980,6 +5987,8 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, > int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > struct kvm_enable_cap *cap) > { >+ struct kvm_vcpu *vcpu; >+ unsigned long i; > int r; > > if (cap->flags) >@@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > break; > > mutex_lock(&kvm->lock); >- if (kvm->created_vcpus) >- goto disable_exits_unlock; >+ if (kvm->created_vcpus) { >+ kvm_for_each_vcpu(i, vcpu, kvm) { >+ kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]); >+ static_call(kvm_x86_update_disabled_exits)(vcpu); IMO, this won't work on Intel platforms. Because, to manipulate a vCPU's VMCS, vcpu_load() should be invoked in advance to load the VMCS. Alternatively, you can add a request KVM_REQ_XXX and defer updating VMCS to the next vCPU entry. >+ } >+ } >+ mutex_unlock(&kvm->lock); > > kvm_ioctl_disable_exits(kvm->arch, cap->args[0]); > > r = 0; >-disable_exits_unlock: >- mutex_unlock(&kvm->lock); > break; > case KVM_CAP_MSR_PLATFORM_INFO: > kvm->arch.guest_can_read_msr_platform_info = cap->args[0]; >-- >2.32.0 >
> -----Original Message----- > From: Chao Gao <chao.gao@intel.com> > Sent: Tuesday, June 14, 2022 7:43 PM > To: Kechen Lu <kechenl@nvidia.com> > Cc: kvm@vger.kernel.org; pbonzini@redhat.com; seanjc@google.com; > vkuznets@redhat.com; Somdutta Roy <somduttar@nvidia.com>; linux- > kernel@vger.kernel.org > Subject: Re: [RFC PATCH v3 5/7] KVM: x86: add vCPU scoped toggling for > disabled exits > > External email: Use caution opening links or attachments > > > >@@ -5980,6 +5987,8 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct > >kvm_irq_level *irq_event, int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > struct kvm_enable_cap *cap) { > >+ struct kvm_vcpu *vcpu; > >+ unsigned long i; > > int r; > > > > if (cap->flags) > >@@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm > *kvm, > > break; > > > > mutex_lock(&kvm->lock); > >- if (kvm->created_vcpus) > >- goto disable_exits_unlock; > >+ if (kvm->created_vcpus) { > >+ kvm_for_each_vcpu(i, vcpu, kvm) { > >+ kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]); > >+ > >+ static_call(kvm_x86_update_disabled_exits)(vcpu); > > IMO, this won't work on Intel platforms. Because, to manipulate a vCPU's > VMCS, vcpu_load() should be invoked in advance to load the VMCS. > Alternatively, you can add a request KVM_REQ_XXX and defer updating > VMCS to the next vCPU entry. > I see. Then adding a KVM request for VM-scoped exits toggling case on vmcs bits updating makes sense. Thanks for the suggestion. BR, Kechen > >+ } > >+ } > >+ mutex_unlock(&kvm->lock); > > > > kvm_ioctl_disable_exits(kvm->arch, cap->args[0]); > > > > r = 0; > >-disable_exits_unlock: > >- mutex_unlock(&kvm->lock); > > break; > > case KVM_CAP_MSR_PLATFORM_INFO: > > kvm->arch.guest_can_read_msr_platform_info = > >cap->args[0]; > >-- > >2.32.0 > >
On Wed, Jun 15, 2022, Chao Gao wrote: > >@@ -5980,6 +5987,8 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, > > int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > struct kvm_enable_cap *cap) > > { > >+ struct kvm_vcpu *vcpu; > >+ unsigned long i; > > int r; > > > > if (cap->flags) > >@@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > break; > > > > mutex_lock(&kvm->lock); > >- if (kvm->created_vcpus) > >- goto disable_exits_unlock; > >+ if (kvm->created_vcpus) { > >+ kvm_for_each_vcpu(i, vcpu, kvm) { > >+ kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]); > >+ static_call(kvm_x86_update_disabled_exits)(vcpu); > > IMO, this won't work on Intel platforms. It's not safe on AMD either because at best the behavior is non-deterministic if the vCPU is already running in the guest, and at worst could cause explosions, e.g. if hardware doesn't like software modifying in-use VMCB state. > Because, to manipulate a vCPU's VMCS, vcpu_load() should be invoked in > advance to load the VMCS. Alternatively, you can add a request KVM_REQ_XXX > and defer updating VMCS to the next vCPU entry. Definitely use a request, doing vcpu_load() from a KVM-scoped ioctl() would be a mess as KVM would need to acquire the per-vCPU lock for every vCPU.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 89e13b6783b5..7f614b7d5ad8 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6948,7 +6948,7 @@ longer intercept some instructions for improved latency in some workloads, and is suggested when vCPUs are associated to dedicated physical CPUs. More bits can be added in the future; userspace can just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable -all such vmexits. +all such vmexits. VM scoped and vCPU scoped capability are both supported. By default, this capability only disables exits. To re-enable an exit, or to override previous settings, userspace can set KVM_X86_DISABLE_EXITS_OVERRIDE, diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index da47f60a4650..c17d417cb3cf 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -128,6 +128,7 @@ KVM_X86_OP(msr_filter_changed) KVM_X86_OP(complete_emulated_msr) KVM_X86_OP(vcpu_deliver_sipi_vector) KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons); +KVM_X86_OP(update_disabled_exits) #undef KVM_X86_OP #undef KVM_X86_OP_OPTIONAL diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 573a39bf7a84..1c9e6067c34f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1584,6 +1584,8 @@ struct kvm_x86_ops { * Returns vCPU specific APICv inhibit reasons */ unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu); + + void (*update_disabled_exits)(struct kvm_vcpu *vcpu); }; struct kvm_x86_nested_ops { diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b32987f54ace..7b3d64b3b901 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4589,6 +4589,33 @@ static void svm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector) sev_vcpu_deliver_sipi_vector(vcpu, vector); } +static void svm_update_disabled_exits(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + struct vmcb_control_area *control = &svm->vmcb->control; + + if (kvm_hlt_in_guest(vcpu)) + svm_clr_intercept(svm, INTERCEPT_HLT); + else + svm_set_intercept(svm, INTERCEPT_HLT); + + if (kvm_mwait_in_guest(vcpu)) { + svm_clr_intercept(svm, INTERCEPT_MONITOR); + svm_clr_intercept(svm, INTERCEPT_MWAIT); + } else { + svm_set_intercept(svm, INTERCEPT_MONITOR); + svm_set_intercept(svm, INTERCEPT_MWAIT); + } + + if (kvm_pause_in_guest(vcpu)) { + svm_clr_intercept(svm, INTERCEPT_PAUSE); + } else { + control->pause_filter_count = pause_filter_count; + if (pause_filter_thresh) + control->pause_filter_thresh = pause_filter_thresh; + } +} + static void svm_vm_destroy(struct kvm *kvm) { avic_vm_destroy(kvm); @@ -4732,7 +4759,10 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .complete_emulated_msr = svm_complete_emulated_msr, .vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector, + .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons, + + .update_disabled_exits = svm_update_disabled_exits, }; /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f24c9a357f70..2d000638cc9b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7716,6 +7716,41 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason) return supported & BIT(reason); } +static void vmx_update_disabled_exits(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (kvm_hlt_in_guest(vcpu)) + exec_controls_clearbit(vmx, CPU_BASED_HLT_EXITING); + else + exec_controls_setbit(vmx, CPU_BASED_HLT_EXITING); + + if (kvm_mwait_in_guest(vcpu)) + exec_controls_clearbit(vmx, CPU_BASED_MWAIT_EXITING | + CPU_BASED_MONITOR_EXITING); + else + exec_controls_setbit(vmx, CPU_BASED_MWAIT_EXITING | + CPU_BASED_MONITOR_EXITING); + + if (!kvm_pause_in_guest(vcpu)) { + vmcs_write32(PLE_GAP, ple_gap); + vmx->ple_window = ple_window; + vmx->ple_window_dirty = true; + } + + if (kvm_cstate_in_guest(vcpu)) { + vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C1_RES, MSR_TYPE_R); + vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R); + vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R); + vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R); + } else { + vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C1_RES, MSR_TYPE_R); + vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C3_RESIDENCY, MSR_TYPE_R); + vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R); + vmx_enable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R); + } +} + static struct kvm_x86_ops vmx_x86_ops __initdata = { .name = "kvm_intel", @@ -7849,6 +7884,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .complete_emulated_msr = kvm_complete_insn_gp, .vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector, + + .update_disabled_exits = vmx_update_disabled_exits, }; static unsigned int vmx_handle_intel_pt_intr(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7cc8ac550bc7..8123309d097f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5331,6 +5331,13 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, if (vcpu->arch.pv_cpuid.enforce) kvm_update_pv_runtime(vcpu); + return 0; + case KVM_CAP_X86_DISABLE_EXITS: + if (cap->args[0] & ~kvm_get_allowed_disable_exits()) + return -EINVAL; + + kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]); + static_call(kvm_x86_update_disabled_exits)(vcpu); return 0; default: return -EINVAL; @@ -5980,6 +5987,8 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap) { + struct kvm_vcpu *vcpu; + unsigned long i; int r; if (cap->flags) @@ -6036,14 +6045,17 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, break; mutex_lock(&kvm->lock); - if (kvm->created_vcpus) - goto disable_exits_unlock; + if (kvm->created_vcpus) { + kvm_for_each_vcpu(i, vcpu, kvm) { + kvm_ioctl_disable_exits(vcpu->arch, cap->args[0]); + static_call(kvm_x86_update_disabled_exits)(vcpu); + } + } + mutex_unlock(&kvm->lock); kvm_ioctl_disable_exits(kvm->arch, cap->args[0]); r = 0; -disable_exits_unlock: - mutex_unlock(&kvm->lock); break; case KVM_CAP_MSR_PLATFORM_INFO: kvm->arch.guest_can_read_msr_platform_info = cap->args[0];
Introduce support of vCPU-scoped ioctl with KVM_CAP_X86_DISABLE_EXITS cap for disabling exits to enable finer-grained VM exits disabling on per vCPU scales instead of whole guest. This patch enabled the vCPU-scoped exits control toggling, also align the VM-scoped exits control behaviors. In use cases like Windows guest running heavy CPU-bound workloads, disabling HLT VM-exits could mitigate host sched ctx switch overhead. Simply HLT disabling on all vCPUs could bring performance benefits, but if no pCPUs reserved for host threads, could happened to the forced preemption as host does not know the time to do the schedule for other host threads want to run. With this patch, we could only disable part of vCPUs HLT exits for one guest, this still keeps performance benefits, and also shows resiliency to host stressing workload running at the same time. In the host stressing workload experiment with Windows guest heavy CPU-bound workloads, it shows good resiliency and having the ~3% performance improvement. E.g. Passmark running in a Windows guest with this patch disabling HLT exits on only half of vCPUs still showing 2.4% higher main score v/s baseline. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kechen Lu <kechenl@nvidia.com> --- Documentation/virt/kvm/api.rst | 2 +- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm/svm.c | 30 ++++++++++++++++++++++++ arch/x86/kvm/vmx/vmx.c | 37 ++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 20 ++++++++++++---- 6 files changed, 87 insertions(+), 5 deletions(-)