Message ID | 20181001212534.20073-2-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: VMX: clean up virtual APIC control handling | expand |
On Mon, Oct 1, 2018 at 2:25 PM, Sean Christopherson <sean.j.christopherson@intel.com> wrote: > Many moons ago, commit 4c9fc8ef5017 ("KVM: VMX: Add module option to > disable flexpriority") introduced kvm-intel.flexpriority as it was > "Useful for debugging". At that time, kvm-intel.flexpriority only > determined whether or not KVM would enable VIRTUALIZE_APIC_ACCESSES. > In short, it was intended as a way to disable virtualization of APIC > accesses for debug purposes. Nowadays, kvm-intel.flexpriority is a > haphazard param that is inconsistently honored by KVM, e.g. it still > controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or > VIRTUALIZE_X2APIC_MODE, and only for non-nested guests. Disabling > the param also announces support for KVM_TPR_ACCESS_REPORTING (via > KVM_CAP_VAPIC), which may be functionally desirable, e.g. Qemu uses > KVM_TPR_ACCESS_REPORTING only to patch MMIO APIC access, but isn't > exactly accurate given its name since KVM may not intercept/report > TPR accesses via CR8 or MSR. > > Remove kvm-intel.flexpriority as its usefulness for debug is dubious > given the current code base, while its existence is confusing and > can complicate code changes and/or lead to new bugs being introduced. > For example, as of commit 8d860bbeedef ("kvm: vmx: Basic APIC > virtualization controls have three settings"), KVM will disable > VIRTUALIZE_APIC_ACCESSES when a nested guest writes APIC_BASE MSR and > kvm-intel.flexpriority=0, whereas previously KVM would allow a nested > guest to enable VIRTUALIZE_APIC_ACCESSES so long as it's supported in > hardware. I.e. KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest > but doesn't (always) allow setting it when kvm-intel.flexpriority=0, > and may even initially allow the control and then clear it when the > nested guest writes APIC_BASE MSR, which is decidedly odd even if it > doesn't cause functional issues. > > Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings") > Cc: Jim Mattson <jmattson@google.com> > Cc: stable@vger.kernel.org > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> I'm happy to see this go, but others may prefer to keep it and make it work as one might expect it to (i.e. clearing the module parameter should make kvm behave as if the host didn't advertise the feature). Reviewed-by: Jim Mattson <jmattson@google.com>
On 01/10/2018 23:25, Sean Christopherson wrote: > kvm-intel.flexpriority is a > haphazard param that is inconsistently honored by KVM, e.g. it still > controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or > VIRTUALIZE_X2APIC_MODE This is on purpose: the original purpose of VIRTUALIZE_APIC_ACCESSES was to speed up Windows XP and 2003, which didn't use CR8. Processors without it could already execute newer Windows versions without the overhead of emulating TPR accesses (notice how VIRTUALIZE_APIC_ACCESSES is a secondary execution control, while TPR_SHADOW is a primary control). Likewise, VIRTUALIZE_X2APIC_MODE only deals with MSR reads and writes, not MMIO accesses. We could add a separate knob to turn it off, but I never did it because the code paths are not as different as for flexpriority or, say, !APICv or !vNMI. > , and only for non-nested guests. Not sure if this is a bug, but it is certainly cleaner if flexpriority=0 masks VIRTUALIZE_APIC_ACCESSES from nested guests too. Paolo
On Wed, Oct 3, 2018 at 1:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 01/10/2018 23:25, Sean Christopherson wrote: >> kvm-intel.flexpriority is a >> haphazard param that is inconsistently honored by KVM, e.g. it still >> controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or >> VIRTUALIZE_X2APIC_MODE > > This is on purpose: the original purpose of VIRTUALIZE_APIC_ACCESSES was > to speed up Windows XP and 2003, which didn't use CR8. Processors > without it could already execute newer Windows versions without the > overhead of emulating TPR accesses (notice how VIRTUALIZE_APIC_ACCESSES > is a secondary execution control, while TPR_SHADOW is a primary control). If the setting is only for specific guests, why is it a kernel parameter rather than a per-VM capability?
On 03/10/2018 14:14, Jim Mattson wrote: > On Wed, Oct 3, 2018 at 1:29 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> On 01/10/2018 23:25, Sean Christopherson wrote: >>> kvm-intel.flexpriority is a >>> haphazard param that is inconsistently honored by KVM, e.g. it still >>> controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or >>> VIRTUALIZE_X2APIC_MODE >> >> This is on purpose: the original purpose of VIRTUALIZE_APIC_ACCESSES was >> to speed up Windows XP and 2003, which didn't use CR8. Processors >> without it could already execute newer Windows versions without the >> overhead of emulating TPR accesses (notice how VIRTUALIZE_APIC_ACCESSES >> is a secondary execution control, while TPR_SHADOW is a primary control). > > If the setting is only for specific guests, why is it a kernel > parameter rather than a per-VM capability? Do you mean the flexpriority module parameter or VIRTUALIZE_APIC_ACCESSES? VIRTUALIZE_APIC_ACCESSES does nothing on newer Windows guests, because they use CR8 to write to the TPR. However, it's also not a problem if you enable it. The flexpriority module parameter instead is a kernel parameter because it's really only useful for debugging problems that only happen on old machines (and to a lesser extent, problems that only happen on pre-AVIC AMD machines). Making it a per-VM capability would not be particularly interesting, and it adds weight to the userspace ABI compatibility (userspace ABI for example is the reason why we cannot drop userspace irqchip support). But since these machines still exist and people are using KVM on them, we have to keep the debugging aid around. Paolo
On Wed, Oct 3, 2018 at 5:40 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > The flexpriority module parameter instead is a kernel parameter because > it's really only useful for debugging problems that only happen on old > machines (and to a lesser extent, problems that only happen on pre-AVIC > AMD machines). Making it a per-VM capability would not be particularly > interesting, and it adds weight to the userspace ABI compatibility > (userspace ABI for example is the reason why we cannot drop userspace > irqchip support). But since these machines still exist and people are > using KVM on them, we have to keep the debugging aid around. It seems that the best way to mock up an older CPU that doesn't support FlexPriority is to: (a) intercept RDMSR of IA32_VMX_PROCBASED_CTLS2 and clear bit 0 of %rdx, and (b) intercept VMWRITE to the secondary processor-based VM-execution controls field and emulate VMfail when a 1 is written to bit 0. Pushing the code higher up in the stack complicates things quite a bit, and encourages bizarre behaviors like we have today, where, for instance, FlexPriority support is still enumerated in the IA32_VMX_PROCBASED_CTLS2 value synthesized for L1, and in fact, L1 can still enable FlexPriority for L2. That certainly isn't consistent with the behavior of old machines.
On 03/10/2018 18:36, Jim Mattson wrote: > It seems that the best way to mock up an older CPU that doesn't > support FlexPriority is to: > (a) intercept RDMSR of IA32_VMX_PROCBASED_CTLS2 and clear bit 0 of %rdx, and > (b) intercept VMWRITE to the secondary processor-based VM-execution > controls field and emulate VMfail when a 1 is written to bit 0. Not VMWRITE, but VMLAUNCH/VMRESUME I suppose? The patch that I sent this morning adds exactly these two things to the flexpriority parameter (actually only the MSR has to be changed, because execution controls are checked against the MSR values and changing them fixes (b) as well). But it's not clear, are you suggesting to use nested virtualization (using "intercept" makes me think of nested)? That also introduces the possible bugs (and hidden bugs) from nesting, so it is worse than the module parameter. > Pushing the code higher up in the stack complicates things quite a > bit, and encourages bizarre behaviors like we have today, where, for > instance, FlexPriority support is still enumerated in the > IA32_VMX_PROCBASED_CTLS2 value synthesized for L1, and in fact, L1 can > still enable FlexPriority for L2. That certainly isn't consistent with > the behavior of old machines. Yup, I agree. In general we do have cases where we provide functionality that is not there on older machines, but for flexpriority the behavior is messy and nowhere near real hardware's. In other cases (enable_vnmi for example) it's possibly odd and we may want to change that, but the severity is smaller. Paolo
On Wed, Oct 3, 2018 at 9:54 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 03/10/2018 18:36, Jim Mattson wrote: >> It seems that the best way to mock up an older CPU that doesn't >> support FlexPriority is to: >> (a) intercept RDMSR of IA32_VMX_PROCBASED_CTLS2 and clear bit 0 of %rdx, and >> (b) intercept VMWRITE to the secondary processor-based VM-execution >> controls field and emulate VMfail when a 1 is written to bit 0. > > Not VMWRITE, but VMLAUNCH/VMRESUME I suppose? The patch that I sent > this morning adds exactly these two things to the flexpriority parameter > (actually only the MSR has to be changed, because execution controls are > checked against the MSR values and changing them fixes (b) as well). You're right, of course. VMLAUNCH/VMRESUME would be the place to check. > But it's not clear, are you suggesting to use nested virtualization > (using "intercept" makes me think of nested)? That also introduces the > possible bugs (and hidden bugs) from nesting, so it is worse than the > module parameter. No, not at all. I'm suggesting adopting a convention where all RDMSRs of IA32_VMX_PROCBASED_CTLS2 are done through a single function that can mock up an older CPU based on module parameters. Similarly for VMLAUNCH/VMRESUME.
On 03/10/2018 19:39, Jim Mattson wrote: > On Wed, Oct 3, 2018 at 9:54 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> But it's not clear, are you suggesting to use nested virtualization >> (using "intercept" makes me think of nested)? That also introduces the >> possible bugs (and hidden bugs) from nesting, so it is worse than the >> module parameter. > > No, not at all. I'm suggesting adopting a convention where all RDMSRs > of IA32_VMX_PROCBASED_CTLS2 are done through a single function that > can mock up an older CPU based on module parameters. Similarly for > VMLAUNCH/VMRESUME. Oh I see. Yeah, that would be nice. Paolo
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9871e649ffef..670f0b0c6806 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1983,10 +1983,6 @@ [KVM,Intel] Enable emulation of invalid guest states Default is 0 (disabled) - kvm-intel.flexpriority= - [KVM,Intel] Disable FlexPriority feature (TPR shadow). - Default is 1 (enabled) - kvm-intel.nested= [KVM,Intel] Enable VMX nesting (nVMX). Default is 0 (disabled) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1d26f3c4985b..e1b8ea9a2bc4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -78,9 +78,6 @@ module_param_named(vpid, enable_vpid, bool, 0444); static bool __read_mostly enable_vnmi = 1; module_param_named(vnmi, enable_vnmi, bool, S_IRUGO); -static bool __read_mostly flexpriority_enabled = 1; -module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO); - static bool __read_mostly enable_ept = 1; module_param_named(ept, enable_ept, bool, S_IRUGO); @@ -1857,7 +1854,7 @@ static inline bool cpu_has_vmx_basic_inout(void) static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu) { - return flexpriority_enabled && lapic_in_kernel(vcpu); + return cpu_has_vmx_flexpriority() && lapic_in_kernel(vcpu); } static inline bool cpu_has_vmx_vpid(void) @@ -1924,11 +1921,6 @@ static bool vmx_umip_emulated(void) SECONDARY_EXEC_DESC; } -static inline bool report_flexpriority(void) -{ - return flexpriority_enabled; -} - static inline unsigned nested_cpu_vmx_misc_cr3_count(struct kvm_vcpu *vcpu) { return vmx_misc_cr3_count(to_vmx(vcpu)->nested.msrs.misc_low); @@ -7895,9 +7887,6 @@ static __init int hardware_setup(void) if (!cpu_has_vmx_unrestricted_guest() || !enable_ept) enable_unrestricted_guest = 0; - if (!cpu_has_vmx_flexpriority()) - flexpriority_enabled = 0; - if (!cpu_has_virtual_nmis()) enable_vnmi = 0; @@ -7906,7 +7895,7 @@ static __init int hardware_setup(void) * page upon invalidation. No need to do anything if not * using the APIC_ACCESS_ADDR VMCS field. */ - if (!flexpriority_enabled) + if (!cpu_has_vmx_flexpriority()) kvm_x86_ops->set_apic_access_page_addr = NULL; if (!cpu_has_vmx_tpr_shadow()) @@ -10233,7 +10222,7 @@ static void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) case LAPIC_MODE_DISABLED: break; case LAPIC_MODE_XAPIC: - if (flexpriority_enabled) { + if (cpu_has_vmx_flexpriority()) { sec_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; vmx_flush_tlb(vcpu, true); @@ -14007,7 +13996,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = { .check_processor_compatibility = vmx_check_processor_compat, .hardware_enable = hardware_enable, .hardware_disable = hardware_disable, - .cpu_has_accelerated_tpr = report_flexpriority, + .cpu_has_accelerated_tpr = cpu_has_vmx_flexpriority, .has_emulated_msr = vmx_has_emulated_msr, .vm_init = vmx_vm_init,
Many moons ago, commit 4c9fc8ef5017 ("KVM: VMX: Add module option to disable flexpriority") introduced kvm-intel.flexpriority as it was "Useful for debugging". At that time, kvm-intel.flexpriority only determined whether or not KVM would enable VIRTUALIZE_APIC_ACCESSES. In short, it was intended as a way to disable virtualization of APIC accesses for debug purposes. Nowadays, kvm-intel.flexpriority is a haphazard param that is inconsistently honored by KVM, e.g. it still controls VIRTUALIZE_APIC_ACCESSES but not TPR_SHADOW, CR8-exiting or VIRTUALIZE_X2APIC_MODE, and only for non-nested guests. Disabling the param also announces support for KVM_TPR_ACCESS_REPORTING (via KVM_CAP_VAPIC), which may be functionally desirable, e.g. Qemu uses KVM_TPR_ACCESS_REPORTING only to patch MMIO APIC access, but isn't exactly accurate given its name since KVM may not intercept/report TPR accesses via CR8 or MSR. Remove kvm-intel.flexpriority as its usefulness for debug is dubious given the current code base, while its existence is confusing and can complicate code changes and/or lead to new bugs being introduced. For example, as of commit 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0, whereas previously KVM would allow a nested guest to enable VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware. I.e. KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't (always) allow setting it when kvm-intel.flexpriority=0, and may even initially allow the control and then clear it when the nested guest writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause functional issues. Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings") Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- .../admin-guide/kernel-parameters.txt | 4 ---- arch/x86/kvm/vmx.c | 19 ++++--------------- 2 files changed, 4 insertions(+), 19 deletions(-)