Message ID | 20230506030435.80262-1-chao.gao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Track supported ARCH_CAPABILITIES in kvm_caps | expand |
On 5/6/2023 11:04 AM, Chao Gao wrote: > to avoid computing the supported value at runtime every time. > > No functional change intended. the value of kvm_get_arch_capabilities() can be changed due to if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER) data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH; and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module param. We need a detailed analysis that in no real case can ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime. > Signed-off-by: Chao Gao <chao.gao@intel.com> > --- > > A new call site of kvm_get_arch_capabilities() is added by [1]. It should be > replaced with the cached value in kvm_caps. > > [1] https://lore.kernel.org/all/20230504181827.130532-1-mizhang@google.com/ > > > arch/x86/kvm/x86.c | 5 +++-- > arch/x86/kvm/x86.h | 1 + > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 523c39a03c00..94aa70ec169c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1670,7 +1670,7 @@ static int kvm_get_msr_feature(struct kvm_msr_entry *msr) > { > switch (msr->index) { > case MSR_IA32_ARCH_CAPABILITIES: > - msr->data = kvm_get_arch_capabilities(); > + msr->data = kvm_caps.supported_arch_cap; > break; > case MSR_IA32_PERF_CAPABILITIES: > msr->data = kvm_caps.supported_perf_cap; > @@ -9523,6 +9523,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) > kvm_caps.max_guest_tsc_khz = max; > } > kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits; > + kvm_caps.supported_arch_cap = kvm_get_arch_capabilities(); > kvm_init_msr_lists(); > return 0; > > @@ -11879,7 +11880,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > if (r) > goto free_guest_fpu; > > - vcpu->arch.arch_capabilities = kvm_get_arch_capabilities(); > + vcpu->arch.arch_capabilities = kvm_caps.supported_arch_cap; > vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; > kvm_xen_init_vcpu(vcpu); > kvm_vcpu_mtrr_init(vcpu); > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h > index c544602d07a3..d3e524bcc169 100644 > --- a/arch/x86/kvm/x86.h > +++ b/arch/x86/kvm/x86.h > @@ -29,6 +29,7 @@ struct kvm_caps { > u64 supported_xcr0; > u64 supported_xss; > u64 supported_perf_cap; > + u64 supported_arch_cap; > }; > > void kvm_spurious_fault(void); > > base-commit: b3c98052d46948a8d65d2778c7f306ff38366aac
+Jim On Thu, May 18, 2023, Xiaoyao Li wrote: > On 5/6/2023 11:04 AM, Chao Gao wrote: > > to avoid computing the supported value at runtime every time. > > > > No functional change intended. > > the value of kvm_get_arch_capabilities() can be changed due to > > if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER) > data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH; > > and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module > param. Nice catch! > We need a detailed analysis that in no real case can > ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime. No, the fact that it _can_ be modified by a writable module param is enough to make this patch buggy. I do like snapshotting and then updating the value, even though there's likely no meaningful performance benefit, as that would provide a place to document that the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit when l1tf_vmx_mitigation is modified. On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!? I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL... 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every entry+exit. 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device() is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning a device is necessary to let the device DMA into the guest, but it's not necessary to let the guest access MMIO addresses, that's done purely via memslots. 3. Irrespective of whether or not there is a performance benefit, toggling the MSR on every entry+exit is completely unnecessary if KVM won't do VERW before VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the toggling can be done in vmx_prepare_switch_to_{guest,host}(). This probably isn't worth pursuing though, as #4 below is more likely, especially since X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs. 4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then KVM just needs to context switch the MSR between guests since the value that's loaded while running in the host is irrelevant. E.g. use a percpu cache to track the current value. 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up, i.e. the host's desired value is effectively static post-boot, and barring a buggy configuration (running KVM as a guest), the boot CPU's value will be the same as every other CPU. 6. Performance aside, KVM should not be speculating (ha!) on what the guest will and will not do, and should instead honor whatever behavior is presented to the guest. If the guest CPU model indicates that VERW flushes buffers, then KVM damn well needs to let VERW flush buffers. 7. Why on earth did Intel provide a knob that affects both the host and guest, since AFAICT the intent of the MSR is purely to suppress FB clearing for an unsuspecting (or misconfigured?) guest!?!?! FWIW, this trainwreck is another reason why I'm not going to look at the proposed "Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I do so. I have zero confidence that a paravirt interface defined by hardware vendors to fiddle with mitigations will be sane, flexible, and extensible. Anyways, can someone from Intel do some basic performance analysis to justify doing RDMSR + WRMSRx2 on every transition? Unless someone provides numbers that show a clear, meaningful benefit to the aggressive toggling, I'm inclined to have KVM do #4, e.g. end up with something like: /* L1D Flush includes CPU buffer clear to mitigate MDS */ if (static_branch_unlikely(&vmx_l1d_should_flush)) { vmx_l1d_flush(vcpu); } else if (static_branch_unlikely(&mds_user_clear) || static_branch_unlikely(&mmio_stale_data_clear)) { mds_clear_cpu_buffers(); } else if (static_branch_unlikely(&kvm_toggle_fb_clear) { bool enable_fb_clear = !!(vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR); if (this_cpu_read(kvm_fb_clear_enabled) != enable_fb_clear) { u64 mcu_opt_ctrl = host_mcu_opt_ctrl; if (enable_fb_clear) mcu_opt_ctrl &= ~FB_CLEAR_DIS; else mcu_opt_ctrl |= FB_CLEAR_DIS; native_wrmsrl(MSR_IA32_MCU_OPT_CTRL, mcu_opt_ctrl); this_cpu_write(kvm_fb_clear_enabled, enable_fb_clear); } }
+Pawan, could you share your thoughts on questions about FB_CLEAR? On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: >+Jim > >On Thu, May 18, 2023, Xiaoyao Li wrote: >> On 5/6/2023 11:04 AM, Chao Gao wrote: >> > to avoid computing the supported value at runtime every time. >> > >> > No functional change intended. >> >> the value of kvm_get_arch_capabilities() can be changed due to >> >> if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER) >> data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH; >> >> and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module >> param. Thanks for pointing this out. I noticed l1tf_vmx_mitigation and analyzed if it could be changed at runtime. Obviously I did a wrong analysis. > >Nice catch! > >> We need a detailed analysis that in no real case can >> ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime. > >No, the fact that it _can_ be modified by a writable module param is enough to >make this patch buggy. > >I do like snapshotting and then updating the value, even though there's likely no >meaningful performance benefit, as that would provide a place to document that >the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug >in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different >values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So I am not sure if fixing it is fool. There would be some other problem: KVM enables L1DLFUSH and creates a guest. Then ARCH_CAP_SKIP_VMENTRY_L1DFLUSH is exposed to the guest. If L1DFLUSH is disabled at runtime in KVM, the guest doesn't know this change and won't do L1DFLUSH when entering L2. Then L2 may use L1TF to leak some secrets of L1. >I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit >when l1tf_vmx_mitigation is modified. Sure. Will do. > >On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!? >I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL... > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every > entry+exit. > > 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device() > is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning > a device is necessary to let the device DMA into the guest, but it's not > necessary to let the guest access MMIO addresses, that's done purely via > memslots. > > 3. Irrespective of whether or not there is a performance benefit, toggling the > MSR on every entry+exit is completely unnecessary if KVM won't do VERW before > VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the > toggling can be done in vmx_prepare_switch_to_{guest,host}(). This probably > isn't worth pursuing though, as #4 below is more likely, especially since > X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs. > > 4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then > KVM just needs to context switch the MSR between guests since the value that's > loaded while running in the host is irrelevant. E.g. use a percpu cache to > track the current value. Agreed. Looks VERW can be used in CPL3, should we restore the MSR on returning to userspace i.e., leverage uret mechanism? > > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up, > i.e. the host's desired value is effectively static post-boot, and barring > a buggy configuration (running KVM as a guest), the boot CPU's value will be > the same as every other CPU. > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest > will and will not do, and should instead honor whatever behavior is presented > to the guest. If the guest CPU model indicates that VERW flushes buffers, > then KVM damn well needs to let VERW flush buffers. > > 7. Why on earth did Intel provide a knob that affects both the host and guest, > since AFAICT the intent of the MSR is purely to suppress FB clearing for an > unsuspecting (or misconfigured?) guest!?!?! I doubt it is purely for guests. Any chance userspace application may use VERW? And I don't think the original patch is for misconfigured guest. IIUC, it is about migrating a guest from a vulnerable host to an invulnerable host. > >FWIW, this trainwreck is another reason why I'm not going to look at the proposed >"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I >do so. I have zero confidence that a paravirt interface defined by hardware >vendors to fiddle with mitigations will be sane, flexible, and extensible. > >Anyways, can someone from Intel do some basic performance analysis to justify >doing RDMSR + WRMSRx2 on every transition? Unless someone provides numbers that Pawan, could you help to answer this question? >show a clear, meaningful benefit to the aggressive toggling, I'm inclined to have >KVM do #4, e.g. end up with something like: > > /* L1D Flush includes CPU buffer clear to mitigate MDS */ > if (static_branch_unlikely(&vmx_l1d_should_flush)) { > vmx_l1d_flush(vcpu); > } else if (static_branch_unlikely(&mds_user_clear) || > static_branch_unlikely(&mmio_stale_data_clear)) { > mds_clear_cpu_buffers(); > } else if (static_branch_unlikely(&kvm_toggle_fb_clear) { > bool enable_fb_clear = !!(vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR); > > if (this_cpu_read(kvm_fb_clear_enabled) != enable_fb_clear) { > u64 mcu_opt_ctrl = host_mcu_opt_ctrl; > > if (enable_fb_clear) > mcu_opt_ctrl &= ~FB_CLEAR_DIS; > else > mcu_opt_ctrl |= FB_CLEAR_DIS; > native_wrmsrl(MSR_IA32_MCU_OPT_CTRL, mcu_opt_ctrl); > this_cpu_write(kvm_fb_clear_enabled, enable_fb_clear); > } > }
On Fri, May 19, 2023, Chao Gao wrote: > +Pawan, could you share your thoughts on questions about FB_CLEAR? > > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: > >I do like snapshotting and then updating the value, even though there's likely no > >meaningful performance benefit, as that would provide a place to document that > >the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug > >in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different > >values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So > > I am not sure if fixing it is fool. There would be some other problem: Heh, "fool's errand" is an idiom that means doing something has almost no chance of succeeding, not that doing something is foolish. I 100% agree that there's value in presenting a consistent model to the guest, but there are conflicting requirements in play. To present a consistent model, KVM essentially needs to disallow changing the module param after VMs/vCPUs have been created, but that would prevent userspace from toggling the param while VMs are running, e.g. in response to a new vulnerability. The only feasible idea I can think of is to disallow *disabling* the mitigation while VMs/vCPUs are active. But then that prevents turning the L1D flush mitigation back off if some other mitigation is deployed, e.g. via livepatch, policy update, etc. That's why I said trying to fix the knob is probably a fool's errand. AFAICT, there's no straightforward solution that makes everybody happy. :-/
On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: > +Jim > > On Thu, May 18, 2023, Xiaoyao Li wrote: > > On 5/6/2023 11:04 AM, Chao Gao wrote: > > > to avoid computing the supported value at runtime every time. > > > > > > No functional change intended. > > > > the value of kvm_get_arch_capabilities() can be changed due to > > > > if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER) > > data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH; > > > > and l1tf_vmx_mitigation can be runtime changed by vmentry_l1d_flush module > > param. > > Nice catch! > > > We need a detailed analysis that in no real case can > > ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit change runtime. > > No, the fact that it _can_ be modified by a writable module param is enough to > make this patch buggy. > > I do like snapshotting and then updating the value, even though there's likely no > meaningful performance benefit, as that would provide a place to document that > the "supported" value is dynamic. Though the fact that it's dynamic is arguably a bug > in its own right, e.g. if userspace isn't careful, a VM can have vCPUs with different > values for ARCH_CAPABILITIES. But fixing that is probably a fool's errand. So > I vote to snapshot the value and toggle the ARCH_CAP_SKIP_VMENTRY_L1DFLUSH bit > when l1tf_vmx_mitigation is modified. > > On a somewhat related topic, what in the absolute #$#$ is going on with FB_CLEAR_DIS!?!? > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL... > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every > entry+exit. Unnecessary VERWs in guest will have much higher impact than due to MSR read/write at vmentry/exit. On an Icelake system it is pointless for a guest to incur VERW penalty when the system is not affected by MDS/TAA and guests don't need mitigation for MMIO Stale Data. MSR writes are only done when the guest is likely to execute unnecessary VERWs(e.g. when the guest thinks its running on an older gen CPU). > 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device() > is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Vulnerability applies to MMIO access to devices that don't respect "byte enable" (which indicates valid bytes in a transaction), and don't error on incorrect read or write size. > Assigning a device is necessary to let the device DMA into the > guest, but it's not necessary to let the guest access MMIO > addresses, that's done purely via memslots. I will get back on this. The guest would typically need access to an area that doesn't fail an incorrectly sized MMIO. > 3. Irrespective of whether or not there is a performance benefit, toggling the > MSR on every entry+exit is completely unnecessary if KVM won't do VERW before > VM-Enter, i.e. if (!mds_user_clear && !mmio_stale_data_clear), then the > toggling can be done in vmx_prepare_switch_to_{guest,host}(). This probably > isn't worth pursuing though, as #4 below is more likely, especially since > X86_BUG_MSBDS_ONLY is limited to Atom (and MIC, lol) CPUs. > > 4. If the host will will _never_ do VERW, i.e. #3 + !X86_BUG_MSBDS_ONLY, then Is it likely that KVM will not do VERW when affected by MMIO Stale Data? If you mean on a hardware that is not vulnerable to MDS and MMIO Stale Data, in that case MSR writes are unnecessary and will be skipped because FB_CLEAR_DIS is not available on unaffected hardware. > KVM just needs to context switch the MSR between guests since the value that's > loaded while running in the host is irrelevant. E.g. use a percpu cache to I will be happy to avoid the MSR read/write, but its worth considering that this MSR can receive more bits that host may want to toggle, then percpu cache implementation would likely change. > track the current value. > > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up, > i.e. the host's desired value is effectively static post-boot, and barring > a buggy configuration (running KVM as a guest), the boot CPU's value will be > the same as every other CPU. Would the MSR value be same on every CPU, if only some guests have enumerated FB_CLEAR and others haven't? MSR writes (to disable FB_CLEAR) are not done when a guest enumerates FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration. > 6. Performance aside, KVM should not be speculating (ha!) on what the guest > will and will not do, and should instead honor whatever behavior is presented > to the guest. If the guest CPU model indicates that VERW flushes buffers, > then KVM damn well needs to let VERW flush buffers. The current implementation allows guests to have VERW flush buffers when they enumerate FB_CLEAR. It only restricts the flush behavior when the guest is trying to mitigate against a vulnerability(like MDS) on a hardware that is not affected. I guess its common for guests to be running with older gen configuration on a newer hardware. > 7. Why on earth did Intel provide a knob that affects both the host and guest, > since AFAICT the intent of the MSR is purely to suppress FB clearing for an > unsuspecting (or misconfigured?) guest!?!?! Thats true.
On 5/18/23 10:33, Sean Christopherson wrote: > > 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device() > is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning > a device is necessary to let the device DMA into the guest, but it's not > necessary to let the guest access MMIO addresses, that's done purely via > memslots. Just to make sure we're all on the same page: KVM needs mitigations when real, hardware MMIO is exposed to the guest. None of this has anything to do with virtio or what guests _normally_ see as devices or MMIO. Right? But direct device assignment does that "real hardware MMIO" for sure because it's mapping parts of the PCI address space (which is all MMIO) into the guest. That's what the kvm_arch_has_assigned_device() check was going for. But I think you're also saying that, in the end, memory gets exposed to the guest by KVM userspace setting up a memslot. KVM userspace _could_ have mapped a piece of MMIO and could just pass that down to a guest without kvm_arch_has_assigned_device() being involved. That makes the kvm_arch_has_assigned_device() useless. In other words, all guests with kvm_arch_has_assigned_device() need mitigation. But there are potentially situations where the guest can see real hardware MMIO and yet also be !kvm_arch_has_assigned_device().
On Mon, May 22, 2023, Dave Hansen wrote: > On 5/18/23 10:33, Sean Christopherson wrote: > > > > 2. I'm pretty sure conditioning mmio_stale_data_clear on kvm_arch_has_assigned_device() > > is a bug. AIUI, the vulnerability applies to _any_ MMIO accesses. Assigning > > a device is necessary to let the device DMA into the guest, but it's not > > necessary to let the guest access MMIO addresses, that's done purely via > > memslots. > > Just to make sure we're all on the same page: KVM needs mitigations when > real, hardware MMIO is exposed to the guest. None of this has anything > to do with virtio or what guests _normally_ see as devices or MMIO. Right? Yes. I try to always call MMIO that is handled by a synthetic/virtual/emulated device "emulated MMIO", specifically to differentiate between the two cases. > But direct device assignment does that "real hardware MMIO" for sure > because it's mapping parts of the PCI address space (which is all MMIO) > into the guest. That's what the kvm_arch_has_assigned_device() check > was going for. > > But I think you're also saying that, in the end, memory gets exposed to > the guest by KVM userspace setting up a memslot. KVM userspace _could_ > have mapped a piece of MMIO and could just pass that down to a guest > without kvm_arch_has_assigned_device() being involved. That makes the > kvm_arch_has_assigned_device() useless. Yep. > In other words, all guests with kvm_arch_has_assigned_device() need > mitigation. Yes, assuming the guest wants to actually use the device :-) > But there are potentially situations where the guest can see real hardware MMIO > and yet also be !kvm_arch_has_assigned_device(). Yes. There may or may not be _legitimate_ scenarios for exposing host MMIO to the guest without an assigned device, but as far as the mitigation is concerned, being legitimate or not doesn't matter, all that matters is that userspace can expose host MMIO to the guest irrespective of VFIO. FWIW, I think this would be a minimal fix without having to apply the mitigation blindly. My only concern is that there might be gaps in the kvm_is_mmio_pfn() heuristic, but if that's the case then KVM likely has other issues, e.g. would potentially map MMIO with the wrong memtype. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2865c3cb3501..ac3c535ae3b9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1274,6 +1274,7 @@ struct kvm_arch { bool apic_access_memslot_enabled; bool apic_access_memslot_inhibited; + bool vm_has_passthrough_mmio; /* Protects apicv_inhibit_reasons */ struct rw_semaphore apicv_update_lock; diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index cf2c6426a6fc..83d235488e56 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -189,6 +189,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, if (level > PG_LEVEL_4K) spte |= PT_PAGE_SIZE_MASK; + if (static_branch_unlikely(&mmio_stale_data_clear) && + !vcpu->kvm->arch.vm_has_passthrough_mmio && kvm_is_mmio_pfn(pfn)) + vcpu->kvm->arch.vm_has_passthrough_mmio = true; + if (shadow_memtype_mask) spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn, kvm_is_mmio_pfn(pfn)); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 44fb619803b8..9c66ba35af92 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7159,7 +7159,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, else if (static_branch_unlikely(&mds_user_clear)) mds_clear_cpu_buffers(); else if (static_branch_unlikely(&mmio_stale_data_clear) && - kvm_arch_has_assigned_device(vcpu->kvm)) + to_kvm_vmx(vcpu->kvm)->vm_has_passthrough_mmio) mds_clear_cpu_buffers(); vmx_disable_fb_clear(vmx);
On Fri, May 19, 2023, Pawan Gupta wrote: > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL... > > > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill > > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every > > entry+exit. > > Unnecessary VERWs in guest will have much higher impact than due to MSR > read/write at vmentry/exit. Can you provide numbers for something closeish to a real world workload? > On an Icelake system it is pointless for a guest to incur VERW penalty when > the system is not affected by MDS/TAA and guests don't need mitigation for > MMIO Stale Data. MSR writes are only done when the guest is likely to execute > unnecessary VERWs(e.g. when the guest thinks its running on an older gen > CPU). > > > KVM just needs to context switch the MSR between guests since the value that's > > loaded while running in the host is irrelevant. E.g. use a percpu cache to > > I will be happy to avoid the MSR read/write, but its worth considering > that this MSR can receive more bits that host may want to toggle, then > percpu cache implementation would likely change. Change in and of itself isn't problematic, so long as whatever code we write won't fall over if/when new bits are added, i.e. doesn't clobber unknown bits. > > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up, > > i.e. the host's desired value is effectively static post-boot, and barring > > a buggy configuration (running KVM as a guest), the boot CPU's value will be > > the same as every other CPU. > > Would the MSR value be same on every CPU, if only some guests have > enumerated FB_CLEAR and others haven't? Ignore the guest, I'm talking purely about the host. Specifically, there's no reason to do a RDMSR to get the host value on every VM-Enter since the host's value is effectively static post-boot. > MSR writes (to disable FB_CLEAR) are not done when a guest enumerates > FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration. > > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest > > will and will not do, and should instead honor whatever behavior is presented > > to the guest. If the guest CPU model indicates that VERW flushes buffers, > > then KVM damn well needs to let VERW flush buffers. > > The current implementation allows guests to have VERW flush buffers when > they enumerate FB_CLEAR. It only restricts the flush behavior when the > guest is trying to mitigate against a vulnerability(like MDS) on a > hardware that is not affected. I guess its common for guests to be > running with older gen configuration on a newer hardware. Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest the guest will do things a certain way and should instead honor the "architectural" definition, in quotes because I realize there probably is no architectural definition for any of this. It might be that the code does (unintentionally?) honor the "architecture", i.e. this code might actually be accurrate with respect to when the guest can expect VERW to flush buffers. But the comment is so, so wrong. /* * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS * at VMEntry. Skip the MSR read/write when a guest has no use case to * execute VERW. */ if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) || ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) && (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) && (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) && (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) && (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO))) vmx->disable_fb_clear = false;
On 5/23/2023 1:43 AM, Sean Christopherson wrote: >>> 6. Performance aside, KVM should not be speculating (ha!) on what the guest >>> will and will not do, and should instead honor whatever behavior is presented >>> to the guest. If the guest CPU model indicates that VERW flushes buffers, >>> then KVM damn well needs to let VERW flush buffers. >> The current implementation allows guests to have VERW flush buffers when >> they enumerate FB_CLEAR. It only restricts the flush behavior when the >> guest is trying to mitigate against a vulnerability(like MDS) on a >> hardware that is not affected. I guess its common for guests to be >> running with older gen configuration on a newer hardware. > Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest > the guest will do things a certain way and should instead honor the "architectural" > definition, in quotes because I realize there probably is no architectural > definition for any of this. > > It might be that the code does (unintentionally?) honor the "architecture", i.e. > this code might actually be accurrate with respect to when the guest can expect > VERW to flush buffers. But the comment is so, so wrong. The comment is wrong and the code is wrong in some case as well. If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to VM, the VM is type of "affected by MDS". And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly enumerates FB_CLEAR as part of their MD_CLEAR support. However, the code will leave vmx->disable_fb_clear as 1 if hardware supports it, and VERW intruction doesn't clear FB in the VM, which conflicts "architectural" definition. > /* > * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS > * at VMEntry. Skip the MSR read/write when a guest has no use case to > * execute VERW. > */ > if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) || > ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO))) > vmx->disable_fb_clear = false;
On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote: > On Fri, May 19, 2023, Pawan Gupta wrote: > > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: > > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL... > > > > > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill > > > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every > > > entry+exit. > > > > Unnecessary VERWs in guest will have much higher impact than due to MSR > > read/write at vmentry/exit. > > Can you provide numbers for something closeish to a real world workload? I am collecting the numbers, will update here soon. > > On an Icelake system it is pointless for a guest to incur VERW penalty when > > the system is not affected by MDS/TAA and guests don't need mitigation for > > MMIO Stale Data. MSR writes are only done when the guest is likely to execute > > unnecessary VERWs(e.g. when the guest thinks its running on an older gen > > CPU). > > > > > KVM just needs to context switch the MSR between guests since the value that's > > > loaded while running in the host is irrelevant. E.g. use a percpu cache to > > > > I will be happy to avoid the MSR read/write, but its worth considering > > that this MSR can receive more bits that host may want to toggle, then > > percpu cache implementation would likely change. > > Change in and of itself isn't problematic, so long as whatever code we write won't > fall over if/when new bits are added, i.e. doesn't clobber unknown bits. Ok. > > > 5. MSR_IA32_MCU_OPT_CTRL is not modified by the host after a CPU is brought up, > > > i.e. the host's desired value is effectively static post-boot, and barring > > > a buggy configuration (running KVM as a guest), the boot CPU's value will be > > > the same as every other CPU. > > > > Would the MSR value be same on every CPU, if only some guests have > > enumerated FB_CLEAR and others haven't? > > Ignore the guest, I'm talking purely about the host. Specifically, there's no > reason to do a RDMSR to get the host value on every VM-Enter since the host's > value is effectively static post-boot. That right(ignoring late microcode load adding stuff to the MSR or msr-tools fiddling). > > MSR writes (to disable FB_CLEAR) are not done when a guest enumerates > > FB_CLEAR. Enumeration of FB_CLEAR in guest will depend on its configuration. > > > > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest > > > will and will not do, and should instead honor whatever behavior is presented > > > to the guest. If the guest CPU model indicates that VERW flushes buffers, > > > then KVM damn well needs to let VERW flush buffers. > > > > The current implementation allows guests to have VERW flush buffers when > > they enumerate FB_CLEAR. It only restricts the flush behavior when the > > guest is trying to mitigate against a vulnerability(like MDS) on a > > hardware that is not affected. I guess its common for guests to be > > running with older gen configuration on a newer hardware. > > Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest > the guest will do things a certain way and should instead honor the "architectural" > definition, in quotes because I realize there probably is no architectural > definition for any of this. Before MMIO Stale Data, processors that were not affected by MDS/TAA did not clear CPU buffers, even if they enumerated MD_CLEAR. On such processors guests that deployed VERW(thinking they are vulnerable to MDS) did not clear the CPU buffers. After MMIO Stale Data was discovered FB_CLEAR_DIS was introduced to restore this behavior. > It might be that the code does (unintentionally?) honor the "architecture", i.e. > this code might actually be accurrate with respect to when the guest can expect > VERW to flush buffers. But the comment is so, so wrong. Agree, the comment needs to explain this well. > /* > * If guest will not execute VERW, there is no need to set FB_CLEAR_DIS > * at VMEntry. Skip the MSR read/write when a guest has no use case to > * execute VERW. > */ > if ((vcpu->arch.arch_capabilities & ARCH_CAP_FB_CLEAR) || > ((vcpu->arch.arch_capabilities & ARCH_CAP_MDS_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_TAA_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_PSDP_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_FBSDP_NO) && > (vcpu->arch.arch_capabilities & ARCH_CAP_SBDR_SSDP_NO))) > vmx->disable_fb_clear = false;
On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote: > On 5/23/2023 1:43 AM, Sean Christopherson wrote: > > > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest > > > > will and will not do, and should instead honor whatever behavior is presented > > > > to the guest. If the guest CPU model indicates that VERW flushes buffers, > > > > then KVM damn well needs to let VERW flush buffers. > > > The current implementation allows guests to have VERW flush buffers when > > > they enumerate FB_CLEAR. It only restricts the flush behavior when the > > > guest is trying to mitigate against a vulnerability(like MDS) on a > > > hardware that is not affected. I guess its common for guests to be > > > running with older gen configuration on a newer hardware. > > Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest > > the guest will do things a certain way and should instead honor the "architectural" > > definition, in quotes because I realize there probably is no architectural > > definition for any of this. > > > > It might be that the code does (unintentionally?) honor the "architecture", i.e. > > this code might actually be accurrate with respect to when the guest can expect > > VERW to flush buffers. But the comment is so, so wrong. > > The comment is wrong and the code is wrong in some case as well. > > If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO, > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to > VM, the VM is type of "affected by MDS". > > And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html > > if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly > enumerates FB_CLEAR as part of their MD_CLEAR support. This is the excerpt from the link that you mentioned: "For processors that are affected by MDS and support L1D_FLUSH operations and MD_CLEAR operations, the VERW instruction flushes fill buffers." You are missing an important information here "For the processors _affected_ by MDS". On such processors ... > However, the code will leave vmx->disable_fb_clear as 1 if hardware supports > it, and VERW intruction doesn't clear FB in the VM, which conflicts > "architectural" definition. ... Fill buffer clear is not enabled at all: vmx_setup_fb_clear_ctrl() { u64 msr; if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) && !boot_cpu_has_bug(X86_BUG_MDS) && !boot_cpu_has_bug(X86_BUG_TAA)) { rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr); if (msr & ARCH_CAP_FB_CLEAR_CTRL) vmx_fb_clear_ctrl_available = true; } }
On 5/23/2023 5:23 AM, Pawan Gupta wrote: > On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote: >> On 5/23/2023 1:43 AM, Sean Christopherson wrote: >>>>> 6. Performance aside, KVM should not be speculating (ha!) on what the guest >>>>> will and will not do, and should instead honor whatever behavior is presented >>>>> to the guest. If the guest CPU model indicates that VERW flushes buffers, >>>>> then KVM damn well needs to let VERW flush buffers. >>>> The current implementation allows guests to have VERW flush buffers when >>>> they enumerate FB_CLEAR. It only restricts the flush behavior when the >>>> guest is trying to mitigate against a vulnerability(like MDS) on a >>>> hardware that is not affected. I guess its common for guests to be >>>> running with older gen configuration on a newer hardware. >>> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest >>> the guest will do things a certain way and should instead honor the "architectural" >>> definition, in quotes because I realize there probably is no architectural >>> definition for any of this. >>> >>> It might be that the code does (unintentionally?) honor the "architecture", i.e. >>> this code might actually be accurrate with respect to when the guest can expect >>> VERW to flush buffers. But the comment is so, so wrong. >> >> The comment is wrong and the code is wrong in some case as well. >> >> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO, >> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to >> VM, the VM is type of "affected by MDS". >> >> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html >> >> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly >> enumerates FB_CLEAR as part of their MD_CLEAR support. > > This is the excerpt from the link that you mentioned: > > "For processors that are affected by MDS and support L1D_FLUSH > operations and MD_CLEAR operations, the VERW instruction flushes fill > buffers." > > You are missing an important information here "For the processors > _affected_ by MDS". On such processors ... > >> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports >> it, and VERW intruction doesn't clear FB in the VM, which conflicts >> "architectural" definition. > > ... Fill buffer clear is not enabled at all: > > vmx_setup_fb_clear_ctrl() > { > u64 msr; > > if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) && > !boot_cpu_has_bug(X86_BUG_MDS) && > !boot_cpu_has_bug(X86_BUG_TAA)) { > rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr); > if (msr & ARCH_CAP_FB_CLEAR_CTRL) > vmx_fb_clear_ctrl_available = true; > } > } This is the check of bare metal, while the check in vmx_update_fb_clear_dis() is of guest VM. For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO, ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware clears Fill Buffer (if FB_CLEAR_DIS is not enabled in MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set vmx_fb_clear_ctrl_available to true. If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest. But in the view of guset, it expects VERW to clear Fill Buffer.
On Tue, May 23, 2023 at 09:00:50AM +0800, Xiaoyao Li wrote: > On 5/23/2023 5:23 AM, Pawan Gupta wrote: > > On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote: > > > On 5/23/2023 1:43 AM, Sean Christopherson wrote: > > > > > > 6. Performance aside, KVM should not be speculating (ha!) on what the guest > > > > > > will and will not do, and should instead honor whatever behavior is presented > > > > > > to the guest. If the guest CPU model indicates that VERW flushes buffers, > > > > > > then KVM damn well needs to let VERW flush buffers. > > > > > The current implementation allows guests to have VERW flush buffers when > > > > > they enumerate FB_CLEAR. It only restricts the flush behavior when the > > > > > guest is trying to mitigate against a vulnerability(like MDS) on a > > > > > hardware that is not affected. I guess its common for guests to be > > > > > running with older gen configuration on a newer hardware. > > > > Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest > > > > the guest will do things a certain way and should instead honor the "architectural" > > > > definition, in quotes because I realize there probably is no architectural > > > > definition for any of this. > > > > > > > > It might be that the code does (unintentionally?) honor the "architecture", i.e. > > > > this code might actually be accurrate with respect to when the guest can expect > > > > VERW to flush buffers. But the comment is so, so wrong. > > > > > > The comment is wrong and the code is wrong in some case as well. > > > > > > If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO, > > > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to > > > VM, the VM is type of "affected by MDS". > > > > > > And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html > > > > > > if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly > > > enumerates FB_CLEAR as part of their MD_CLEAR support. > > > > This is the excerpt from the link that you mentioned: > > > > "For processors that are affected by MDS and support L1D_FLUSH > > operations and MD_CLEAR operations, the VERW instruction flushes fill > > buffers." > > > > You are missing an important information here "For the processors > > _affected_ by MDS". On such processors ... > > > > > However, the code will leave vmx->disable_fb_clear as 1 if hardware supports > > > it, and VERW intruction doesn't clear FB in the VM, which conflicts > > > "architectural" definition. > > > > ... Fill buffer clear is not enabled at all: > > > > vmx_setup_fb_clear_ctrl() > > { > > u64 msr; > > if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) && > > !boot_cpu_has_bug(X86_BUG_MDS) && > > !boot_cpu_has_bug(X86_BUG_TAA)) { > > rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr); > > if (msr & ARCH_CAP_FB_CLEAR_CTRL) > > vmx_fb_clear_ctrl_available = true; > > } > > } > > This is the check of bare metal, while the check in > vmx_update_fb_clear_dis() is of guest VM. > > For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO, > ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO, > ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware > clears Fill Buffer (if FB_CLEAR_DIS is not enabled in > MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set > vmx_fb_clear_ctrl_available to true. > > If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO, > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and > ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave > vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest. > But in the view of guset, it expects VERW to clear Fill Buffer. That is correct, but whether VERW clears the CPU buffers also depends on if the hardware is affected or not, enumerating MD_CLEAR solely does not guarantee that VERW will flush CPU buffers. This was true even before MMIO Stale Data was discovered. If host(hardware) enumerates: MD_CLEAR | MDS_NO | VERW behavior ---------|--------|------------------- 1 | 0 | Clears CPU buffers But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates: MD_CLEAR | MDS_NO | VERW behavior ---------|--------|----------------------- 1 | 0 | Not guaranteed to clear CPU buffers After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior intact(for hardware that is not affected by MDS/TAA). If the userspace truly wants the guest to have VERW flush behavior, it can export FB_CLEAR. I see your point that from a guest's perspective it is being lied about VERW behavior. OTOH, I am not sure if it is a good enough reason for mitigated hardware to keep the overhead of clearing micro-architectural buffers for generations of CPUs.
On Mon, May 22, 2023 at 01:54:55PM -0700, Pawan Gupta wrote: > On Mon, May 22, 2023 at 10:43:49AM -0700, Sean Christopherson wrote: > > On Fri, May 19, 2023, Pawan Gupta wrote: > > > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: > > > > I made the mistake of digging into why KVM doesn't advertise ARCH_CAP_FB_CLEAR_CTRL... > > > > > > > > 1. I see *nothing* in commit 027bbb884be0 ("KVM: x86/speculation: Disable Fill > > > > buffer clear within guests") that justifies 1x RDMSR and 2x WRMSR on every > > > > entry+exit. > > > > > > Unnecessary VERWs in guest will have much higher impact than due to MSR > > > read/write at vmentry/exit. > > > > Can you provide numbers for something closeish to a real world workload? > > I am collecting the numbers, will update here soon. Looks like avoiding VERW flush behavior in guest results in 2-5% improvement in performance. Pybench and CPU bound sysbench test shows minor improvement when avoiding reading/writing MSR_MCU_OPT_CTRL at VMentry/VMexit. Baseline : v6.4-rc3 Target : v6.4-rc3 + No read/write to MSR_MCU_OPT_CTRL at VMentry/VMexit | Test | Configuration | Relative | | --------- | ------------- | -------- | | nginx | 200 | 0.977 | | hackbench | 8 - Process | 0.975 | | sysbench | RAM / Memory | 0.946 | | sysbench | CPU | 1.002 | | pybench | T.F.A.T.T | 1.007 | Host configuration (Icelake server): CPU family: 6 Model: 106 Stepping: 6 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Guest configuration (Skylake Client): CPU family: 6 Model: 94 Stepping: 3 Vulnerabilities: Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Retbleed: Vulnerable Spec store bypass: Vulnerable Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PB RSB-eIBRS Not affected Srbds: Unknown: Dependent on hypervisor status Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
On 5/23/2023 11:34 AM, Pawan Gupta wrote: > On Tue, May 23, 2023 at 09:00:50AM +0800, Xiaoyao Li wrote: >> On 5/23/2023 5:23 AM, Pawan Gupta wrote: >>> On Tue, May 23, 2023 at 03:31:44AM +0800, Xiaoyao Li wrote: >>>> On 5/23/2023 1:43 AM, Sean Christopherson wrote: >>>>>>> 6. Performance aside, KVM should not be speculating (ha!) on what the guest >>>>>>> will and will not do, and should instead honor whatever behavior is presented >>>>>>> to the guest. If the guest CPU model indicates that VERW flushes buffers, >>>>>>> then KVM damn well needs to let VERW flush buffers. >>>>>> The current implementation allows guests to have VERW flush buffers when >>>>>> they enumerate FB_CLEAR. It only restricts the flush behavior when the >>>>>> guest is trying to mitigate against a vulnerability(like MDS) on a >>>>>> hardware that is not affected. I guess its common for guests to be >>>>>> running with older gen configuration on a newer hardware. >>>>> Right, I'm saying that that behavior is wrong. KVM shouldn't assume the guest >>>>> the guest will do things a certain way and should instead honor the "architectural" >>>>> definition, in quotes because I realize there probably is no architectural >>>>> definition for any of this. >>>>> >>>>> It might be that the code does (unintentionally?) honor the "architecture", i.e. >>>>> this code might actually be accurrate with respect to when the guest can expect >>>>> VERW to flush buffers. But the comment is so, so wrong. >>>> >>>> The comment is wrong and the code is wrong in some case as well. >>>> >>>> If none of ARCH_CAP_FB_CLEAR, ARCH_CAP_MDS_NO, ARCH_CAP_TAA_NO, >>>> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO and ARCH_CAP_SBDR_SSDP_NO are exposed to >>>> VM, the VM is type of "affected by MDS". >>>> >>>> And accroding to the page https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html >>>> >>>> if the VM enumerates support for both L1D_FLUSH and MD_CLEAR, it implicitly >>>> enumerates FB_CLEAR as part of their MD_CLEAR support. >>> >>> This is the excerpt from the link that you mentioned: >>> >>> "For processors that are affected by MDS and support L1D_FLUSH >>> operations and MD_CLEAR operations, the VERW instruction flushes fill >>> buffers." >>> >>> You are missing an important information here "For the processors >>> _affected_ by MDS". On such processors ... >>> >>>> However, the code will leave vmx->disable_fb_clear as 1 if hardware supports >>>> it, and VERW intruction doesn't clear FB in the VM, which conflicts >>>> "architectural" definition. >>> >>> ... Fill buffer clear is not enabled at all: >>> >>> vmx_setup_fb_clear_ctrl() >>> { >>> u64 msr; >>> if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES) && >>> !boot_cpu_has_bug(X86_BUG_MDS) && >>> !boot_cpu_has_bug(X86_BUG_TAA)) { >>> rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr); >>> if (msr & ARCH_CAP_FB_CLEAR_CTRL) >>> vmx_fb_clear_ctrl_available = true; >>> } >>> } >> >> This is the check of bare metal, while the check in >> vmx_update_fb_clear_dis() is of guest VM. >> >> For example, if the hardware (host) enumerates ARCH_CAP_TAA_NO, >> ARCH_CAP_MDS_NO, ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO, >> ARCH_CAP_FB_CLEAR, and ARCH_CAP_FB_CLEAR_CTRL, the VERW on this hardware >> clears Fill Buffer (if FB_CLEAR_DIS is not enabled in >> MSR_IA32_MCU_OPT_CTRL). vmx_setup_fb_clear_ctrl() does set >> vmx_fb_clear_ctrl_available to true. >> >> If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO, >> ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and >> ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave >> vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest. >> But in the view of guset, it expects VERW to clear Fill Buffer. > > That is correct, but whether VERW clears the CPU buffers also depends on > if the hardware is affected or not, enumerating MD_CLEAR solely does not > guarantee that VERW will flush CPU buffers. This was true even before > MMIO Stale Data was discovered. > > If host(hardware) enumerates: > > MD_CLEAR | MDS_NO | VERW behavior > ---------|--------|------------------- > 1 | 0 | Clears CPU buffers > > But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates: > > MD_CLEAR | MDS_NO | VERW behavior > ---------|--------|----------------------- > 1 | 0 | Not guaranteed to clear > CPU buffers > > After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior > intact(for hardware that is not affected by MDS/TAA). Sorry, I don't understand it. What the behavior is? > If the userspace > truly wants the guest to have VERW flush behavior, it can export > FB_CLEAR. > > I see your point that from a guest's perspective it is being lied about > VERW behavior. OTOH, I am not sure if it is a good enough reason for > mitigated hardware to keep the overhead of clearing micro-architectural > buffers for generations of CPUs. User takes the responsiblity because itself requests the specific feature combination for its guest.
On Thu, May 25, 2023 at 11:42:24PM +0800, Xiaoyao Li wrote: > On 5/23/2023 11:34 AM, Pawan Gupta wrote: > > > If a guest is exposed without ARCH_CAP_TAA_NO, ARCH_CAP_MDS_NO, > > > ARCH_CAP_PSDP_NO, ARCH_CAP_FBSDP_NO, ARCH_CAP_SBDR_SSDP_NO and > > > ARCH_CAP_FB_CLEAR, vmx_update_fb_clear_dis() will leave > > > vmx->disable_fb_clear as true. So VERW doesn't clear Fill Buffer for guest. > > > But in the view of guset, it expects VERW to clear Fill Buffer. > > > > That is correct, but whether VERW clears the CPU buffers also depends on > > if the hardware is affected or not, enumerating MD_CLEAR solely does not > > guarantee that VERW will flush CPU buffers. This was true even before > > MMIO Stale Data was discovered. > > > > If host(hardware) enumerates: > > > > MD_CLEAR | MDS_NO | VERW behavior > > ---------|--------|------------------- > > 1 | 0 | Clears CPU buffers > > > > But on an MDS mitigated hardware(MDS_NO=1) if guest enumerates: > > > > MD_CLEAR | MDS_NO | VERW behavior > > ---------|--------|----------------------- > > 1 | 0 | Not guaranteed to clear > > CPU buffers > > > > After MMIO Stale Data, FB_CLEAR_DIS was introduced to keep this behavior > > intact(for hardware that is not affected by MDS/TAA). > > Sorry, I don't understand it. What the behavior is? That on a mitigated hardware VERW may not clear the micro-architectural buffers. There are many micro-architectural buffers, VERW only clears the affected ones. This is indicated in section "Fill Buffer Clearing Operations" of [1]. Some processors may enumerate MD_CLEAR because they overwrite all buffers affected by MDS/TAA, but they do not overwrite fill buffer values. This is because fill buffers are not susceptible to MDS or TAA on those processors. For processors affected by FBSDP where MD_CLEAR may not overwrite fill buffer values, Intel has released microcode updates that enumerate FB_CLEAR so that VERW does overwrite fill buffer values. > > If the userspace > > truly wants the guest to have VERW flush behavior, it can export > > FB_CLEAR. > > > > I see your point that from a guest's perspective it is being lied about > > VERW behavior. OTOH, I am not sure if it is a good enough reason for > > mitigated hardware to keep the overhead of clearing micro-architectural > > buffers for generations of CPUs. > > User takes the responsiblity because itself requests the specific feature > combination for its guest. As I understand, the MD_CLEAR enumeration on mitigated hardware is done purely for VM migration compatibility. Software is not expected to use VERW on mitigated hardware, below is from MDS documentation [2]: Future processors set the MDS_NO bit in IA32_ARCH_CAPABILITIES to indicate they are not affected by microarchitectural data sampling. Such processors will continue to enumerate the MD_CLEAR bit in CPUID. As none of these data buffers are vulnerable to exposure on such parts, no data buffer overwriting is required or expected for such parts, despite the MD_CLEAR indication. Software should look to the MDS_NO bit to determine whether buffer overwriting mitigations are required. [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/processor-mmio-stale-data-vulnerabilities.html [2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/resources/processors-affected-mds.html
On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: >FWIW, this trainwreck is another reason why I'm not going to look at the proposed >"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I >do so. I have zero confidence that a paravirt interface defined by hardware >vendors to fiddle with mitigations will be sane, flexible, and extensible. Hi Sean, Just to confirm we are on the same page: "Intel IA32_SPEC_CTRL Virtualization" series consists of 3 parts: 1. Expose BHI_CTRL, RRSBA_CTRL to guests. They are hardware mitigations to disable BHI and RRSBA behaviors and can be used by both guest/host. 2. Enable IA32_SPEC_CTRL Virtualization which is a VMX feature. This is for KVM to effectively lock some bits of IA32_SPEC_CTRL MSR when guests are running. 3. Implement the paravirt interface (the virtual MSRs) for guests to report software mitigations in-use. KVM can utilize such information to enable hardware mitigations for guests transparently to address software mitigation effectiveness issues caused by CPU microarchitecture changes (RRSBA behavior, size of branch history table). As per my understanding, your concerns are primarily focused on #3, the paravirt interface, rather than the entire series. Am I correct in assuming that you do not oppose #1 and #2? You previously mentioned that the paravirt interface was not common [1], and this time, you expressed an expectation for the interface to be "sane, flexible, and extensible." To ensure clarity, I want to confirm my interpretation of your expectations: 1. The interface should not be tied to a specific CPU vendor but instead be beneficial for Intel and AMD (and even ARM, and potentially others). 2. The interface should have the capability to solve other issues (e.g, coordinate mitigations in guest/host to address other perf/function issues), not limited to software mitigation effectiveness on Intel CPUs. 3. The interface should be extendable by VMMs rather than relying on hardware vendors rolling out new spec. This enables VMM developers to propose new ideas to coordinate mitigations in guest/host. Please let me know if I missed any key points or if any of the above statements do not align with your expectations. [1]: https://lore.kernel.org/all/Y6Sin1bmLN10yvMw@google.com/
On Mon, May 29, 2023, Chao Gao wrote: > On Thu, May 18, 2023 at 10:33:15AM -0700, Sean Christopherson wrote: > >FWIW, this trainwreck is another reason why I'm not going to look at the proposed > >"Intel IA32_SPEC_CTRL Virtualization" crud until external forces dictate that I > >do so. I have zero confidence that a paravirt interface defined by hardware > >vendors to fiddle with mitigations will be sane, flexible, and extensible. > > Hi Sean, > > Just to confirm we are on the same page: > > "Intel IA32_SPEC_CTRL Virtualization" series consists of 3 parts: > > 1. Expose BHI_CTRL, RRSBA_CTRL to guests. They are hardware mitigations to > disable BHI and RRSBA behaviors and can be used by both guest/host. > > 2. Enable IA32_SPEC_CTRL Virtualization which is a VMX feature. This is for KVM > to effectively lock some bits of IA32_SPEC_CTRL MSR when guests are running. > > 3. Implement the paravirt interface (the virtual MSRs) for guests to report > software mitigations in-use. KVM can utilize such information to enable > hardware mitigations for guests transparently to address software mitigation > effectiveness issues caused by CPU microarchitecture changes (RRSBA behavior, > size of branch history table). > > As per my understanding, your concerns are primarily focused on #3, the > paravirt interface, rather than the entire series. Am I correct in assuming that > you do not oppose #1 and #2? Yes, correct. I definitely recommend posting #1 and #2 separately from the paravirt interface, I ignored the entire series without realizing there is real hardware support in there. > You previously mentioned that the paravirt interface was not common [1], and > this time, you expressed an expectation for the interface to be "sane, flexible, > and extensible." To ensure clarity, I want to confirm my interpretation of > your expectations: > > 1. The interface should not be tied to a specific CPU vendor but instead be > beneficial for Intel and AMD (and even ARM, and potentially others). > > 2. The interface should have the capability to solve other issues (e.g, > coordinate mitigations in guest/host to address other perf/function issues), > not limited to software mitigation effectiveness on Intel CPUs. > 3. The interface should be extendable by VMMs rather than relying on hardware > vendors rolling out new spec. This enables VMM developers to propose new > ideas to coordinate mitigations in guest/host. Ya, that's more or less my opinion. Even more than allowing VMM developers to extend/define the interface, I want the definition of the interace to be a community/collaborative effort. LKML has active representatives from all of the major (known) hypervisors, so it shouldn't be *that* hard to figure out a way to make the interface community driven. Note that it doesn't necessarily have to be VMM developers, e.g. many of the people that are intimately familiar with the mitigations aren't virtualization folks. > Please let me know if I missed any key points or if any of the above statements > do not align with your expectations. > > [1]: https://lore.kernel.org/all/Y6Sin1bmLN10yvMw@google.com/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 523c39a03c00..94aa70ec169c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1670,7 +1670,7 @@ static int kvm_get_msr_feature(struct kvm_msr_entry *msr) { switch (msr->index) { case MSR_IA32_ARCH_CAPABILITIES: - msr->data = kvm_get_arch_capabilities(); + msr->data = kvm_caps.supported_arch_cap; break; case MSR_IA32_PERF_CAPABILITIES: msr->data = kvm_caps.supported_perf_cap; @@ -9523,6 +9523,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) kvm_caps.max_guest_tsc_khz = max; } kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits; + kvm_caps.supported_arch_cap = kvm_get_arch_capabilities(); kvm_init_msr_lists(); return 0; @@ -11879,7 +11880,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) if (r) goto free_guest_fpu; - vcpu->arch.arch_capabilities = kvm_get_arch_capabilities(); + vcpu->arch.arch_capabilities = kvm_caps.supported_arch_cap; vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; kvm_xen_init_vcpu(vcpu); kvm_vcpu_mtrr_init(vcpu); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index c544602d07a3..d3e524bcc169 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -29,6 +29,7 @@ struct kvm_caps { u64 supported_xcr0; u64 supported_xss; u64 supported_perf_cap; + u64 supported_arch_cap; }; void kvm_spurious_fault(void);
to avoid computing the supported value at runtime every time. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> --- A new call site of kvm_get_arch_capabilities() is added by [1]. It should be replaced with the cached value in kvm_caps. [1] https://lore.kernel.org/all/20230504181827.130532-1-mizhang@google.com/ arch/x86/kvm/x86.c | 5 +++-- arch/x86/kvm/x86.h | 1 + 2 files changed, 4 insertions(+), 2 deletions(-) base-commit: b3c98052d46948a8d65d2778c7f306ff38366aac