Message ID | 20220113163833.3831-4-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/spec-ctrl: Fix NMI race condition | expand |
On Thu, Jan 13, 2022 at 04:38:33PM +0000, Andrew Cooper wrote: > The logic was based on a mistaken understanding of how NMI blocking on vmexit > works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. > Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, > and the guest's value will be clobbered before it is saved. > > Switch to using MSR load/save lists. This causes the guest value to be saved > atomically with respect to NMIs/MCEs/etc. > > First, update vmx_cpuid_policy_changed() to configure the load/save lists at > the same time as configuring the intercepts. This function is always used in > remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole > function, rather than having multiple remote acquisitions of the same VMCS. > > vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the > guest, so handle failures using domain_crash(). vmx_del_msr() can fail with > -ESRCH but we don't matter, and this path will be taken during domain create > on a system lacking IBRSB. > > Second, update vmx_msr_{read,write}_intercept() to use the load/save lists > rather than vcpu_msrs, and update the comment to describe the new state > location. > > Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM > entirely, and use a shorter code sequence to simply reload Xen's setting from > the top-of-stack block. > > Because the guest values are loaded/saved atomically, we do not need to use > the shadowing logic to cope with late NMIs/etc, so we can omit > DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in > context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's > no need to switch back to Xen's context on an early failure. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > CC: Wei Liu <wl@xen.org> > CC: Jun Nakajima <jun.nakajima@intel.com> > CC: Kevin Tian <kevin.tian@intel.com> > > Needs backporting as far as people can tolerate. > > If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, > but this is awkard to arrange in asm. > --- > xen/arch/x86/hvm/vmx/entry.S | 19 ++++++++++------- > xen/arch/x86/hvm/vmx/vmx.c | 36 +++++++++++++++++++++++++++----- > xen/arch/x86/include/asm/msr.h | 10 ++++++++- > xen/arch/x86/include/asm/spec_ctrl_asm.h | 32 ++++------------------------ > 4 files changed, 56 insertions(+), 41 deletions(-) > > diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S > index 30139ae58e9d..297ed8685126 100644 > --- a/xen/arch/x86/hvm/vmx/entry.S > +++ b/xen/arch/x86/hvm/vmx/entry.S > @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) > > /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: acd */ > ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM > - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM > + > + .macro restore_spec_ctrl > + mov $MSR_SPEC_CTRL, %ecx > + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax > + xor %edx, %edx > + wrmsr > + .endm > + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM I assume loading the host value into SPEC_CTRL strictly needs to happen after the RSB overwrite, or else you could use a VM exit host load MSR entry to handle MSR_SPEC_CTRL. Also I don't understand why SPEC_CTRL shadowing is not cleared before loading Xen's value now. > /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ > > /* Hardware clears MSR_DEBUGCTL on VMExit. Reinstate it if debugging Xen. */ > @@ -83,7 +90,6 @@ UNLIKELY_END(realmode) > > /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */ > /* SPEC_CTRL_EXIT_TO_VMX Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */ > - ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_HVM > ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM > > mov VCPU_hvm_guest_cr2(%rbx),%rax > @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) > SAVE_ALL > > /* > - * PV variant needed here as no guest code has executed (so > - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable > - * to hit (in which case the HVM variant might corrupt things). > + * SPEC_CTRL_ENTRY notes > + * > + * If we end up here, no guest code has executed. We still have Xen's > + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. > */ > - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ > - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ > > call vmx_vmentry_failure > jmp .Lvmx_process_softirqs > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c > index 28ee6393f11e..d7feb5f9c455 100644 > --- a/xen/arch/x86/hvm/vmx/vmx.c > +++ b/xen/arch/x86/hvm/vmx/vmx.c > @@ -592,6 +592,7 @@ void vmx_update_exception_bitmap(struct vcpu *v) > static void vmx_cpuid_policy_changed(struct vcpu *v) > { > const struct cpuid_policy *cp = v->domain->arch.cpuid; > + int rc = 0; > > if ( opt_hvm_fep || > (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) ) > @@ -601,17 +602,27 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) > > vmx_vmcs_enter(v); > vmx_update_exception_bitmap(v); > - vmx_vmcs_exit(v); > > /* > * We can safely pass MSR_SPEC_CTRL through to the guest, even if STIBP > * isn't enumerated in hardware, as SPEC_CTRL_STIBP is ignored. > */ > if ( cp->feat.ibrsb ) > + { > vmx_clear_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); > + > + rc = vmx_add_guest_msr(v, MSR_SPEC_CTRL, 0); > + if ( rc ) > + goto err; > + } > else > + { > vmx_set_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); > > + /* Can only fail with -ESRCH, and we don't care. */ > + vmx_del_msr(v, MSR_SPEC_CTRL, VMX_MSR_GUEST); > + } > + > /* MSR_PRED_CMD is safe to pass through if the guest knows about it. */ > if ( cp->feat.ibrsb || cp->extd.ibpb ) > vmx_clear_msr_intercept(v, MSR_PRED_CMD, VMX_MSR_RW); > @@ -623,6 +634,15 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) > vmx_clear_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); > else > vmx_set_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); > + > + err: Nit: I would name this out, since it's used by both the error and the normal exit paths, but that's just my taste. > + vmx_vmcs_exit(v); > + > + if ( rc ) > + { > + printk(XENLOG_G_ERR "MSR load/save list error: %d", rc); > + domain_crash(v->domain); > + } > } > > int vmx_guest_x86_mode(struct vcpu *v) > @@ -3128,7 +3148,6 @@ static int is_last_branch_msr(u32 ecx) > static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) > { > struct vcpu *curr = current; > - const struct vcpu_msrs *msrs = curr->arch.msrs; > uint64_t tmp; > > HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x", msr); > @@ -3136,7 +3155,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) > switch ( msr ) > { > case MSR_SPEC_CTRL: /* guest_rdmsr() has already performed #GP checks. */ > - *msr_content = msrs->spec_ctrl.raw; > + if ( vmx_read_guest_msr(curr, msr, msr_content) ) > + { > + ASSERT_UNREACHABLE(); > + goto gp_fault; > + } AFAICT this is required just for Xen internal callers, as a guest attempt to access MSR_SPEC_CTRL will never be trapped, or if trapped it would be because !cp->feat.ibrsb and thus won't get here anyway. Thanks, Roger.
On 14/01/2022 12:50, Roger Pau Monné wrote: > On Thu, Jan 13, 2022 at 04:38:33PM +0000, Andrew Cooper wrote: >> The logic was based on a mistaken understanding of how NMI blocking on vmexit >> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. >> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, >> and the guest's value will be clobbered before it is saved. >> >> Switch to using MSR load/save lists. This causes the guest value to be saved >> atomically with respect to NMIs/MCEs/etc. >> >> First, update vmx_cpuid_policy_changed() to configure the load/save lists at >> the same time as configuring the intercepts. This function is always used in >> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole >> function, rather than having multiple remote acquisitions of the same VMCS. >> >> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the >> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with >> -ESRCH but we don't matter, and this path will be taken during domain create >> on a system lacking IBRSB. >> >> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists >> rather than vcpu_msrs, and update the comment to describe the new state >> location. >> >> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM >> entirely, and use a shorter code sequence to simply reload Xen's setting from >> the top-of-stack block. >> >> Because the guest values are loaded/saved atomically, we do not need to use >> the shadowing logic to cope with late NMIs/etc, so we can omit >> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in >> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's >> no need to switch back to Xen's context on an early failure. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> --- >> CC: Jan Beulich <JBeulich@suse.com> >> CC: Roger Pau Monné <roger.pau@citrix.com> >> CC: Wei Liu <wl@xen.org> >> CC: Jun Nakajima <jun.nakajima@intel.com> >> CC: Kevin Tian <kevin.tian@intel.com> >> >> Needs backporting as far as people can tolerate. >> >> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, >> but this is awkard to arrange in asm. >> --- >> xen/arch/x86/hvm/vmx/entry.S | 19 ++++++++++------- >> xen/arch/x86/hvm/vmx/vmx.c | 36 +++++++++++++++++++++++++++----- >> xen/arch/x86/include/asm/msr.h | 10 ++++++++- >> xen/arch/x86/include/asm/spec_ctrl_asm.h | 32 ++++------------------------ >> 4 files changed, 56 insertions(+), 41 deletions(-) >> >> diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S >> index 30139ae58e9d..297ed8685126 100644 >> --- a/xen/arch/x86/hvm/vmx/entry.S >> +++ b/xen/arch/x86/hvm/vmx/entry.S >> @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) >> >> /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: acd */ >> ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM >> - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM >> + >> + .macro restore_spec_ctrl >> + mov $MSR_SPEC_CTRL, %ecx >> + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax >> + xor %edx, %edx >> + wrmsr >> + .endm >> + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM > I assume loading the host value into SPEC_CTRL strictly needs to > happen after the RSB overwrite, or else you could use a VM exit host > load MSR entry to handle MSR_SPEC_CTRL. We could use the host load list, but Intel warned that the lists aren't as efficient as writing it like this, hence why I didn't use them originally when we thought the NMI behaviour was different. Obviously, we're using them now for correctness reasons, which is more important than avoiding them for (unquantified) perf reasons. I've got some plans for changes to how Xen handles MSR_SPEC_CTRL for its own protection, which I hope will bring a substantial efficiency improvements, and those would require us not to use a host load list here. > Also I don't understand why SPEC_CTRL shadowing is not cleared before > loading Xen's value now. Because we now don't set shadowing it in the first place, because of ... > >> /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ >> >> /* Hardware clears MSR_DEBUGCTL on VMExit. Reinstate it if debugging Xen. */ >> @@ -83,7 +90,6 @@ UNLIKELY_END(realmode) >> >> /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */ >> /* SPEC_CTRL_EXIT_TO_VMX Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */ >> - ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_HVM >> ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM ... this hunk. Also, see the note about ASSERT() in the commit message. I'm happy to try and make this clearer in the commit message if you have any suggestions. >> >> mov VCPU_hvm_guest_cr2(%rbx),%rax >> @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) >> SAVE_ALL >> >> /* >> - * PV variant needed here as no guest code has executed (so >> - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable >> - * to hit (in which case the HVM variant might corrupt things). >> + * SPEC_CTRL_ENTRY notes >> + * >> + * If we end up here, no guest code has executed. We still have Xen's >> + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. >> */ >> - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ >> - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ >> >> call vmx_vmentry_failure >> jmp .Lvmx_process_softirqs >> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c >> index 28ee6393f11e..d7feb5f9c455 100644 >> --- a/xen/arch/x86/hvm/vmx/vmx.c >> +++ b/xen/arch/x86/hvm/vmx/vmx.c >> @@ -592,6 +592,7 @@ void vmx_update_exception_bitmap(struct vcpu *v) >> static void vmx_cpuid_policy_changed(struct vcpu *v) >> { >> const struct cpuid_policy *cp = v->domain->arch.cpuid; >> + int rc = 0; >> >> if ( opt_hvm_fep || >> (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) ) >> @@ -601,17 +602,27 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) >> >> vmx_vmcs_enter(v); >> vmx_update_exception_bitmap(v); >> - vmx_vmcs_exit(v); >> >> /* >> * We can safely pass MSR_SPEC_CTRL through to the guest, even if STIBP >> * isn't enumerated in hardware, as SPEC_CTRL_STIBP is ignored. >> */ >> if ( cp->feat.ibrsb ) >> + { >> vmx_clear_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); >> + >> + rc = vmx_add_guest_msr(v, MSR_SPEC_CTRL, 0); >> + if ( rc ) >> + goto err; >> + } >> else >> + { >> vmx_set_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); >> >> + /* Can only fail with -ESRCH, and we don't care. */ >> + vmx_del_msr(v, MSR_SPEC_CTRL, VMX_MSR_GUEST); >> + } >> + >> /* MSR_PRED_CMD is safe to pass through if the guest knows about it. */ >> if ( cp->feat.ibrsb || cp->extd.ibpb ) >> vmx_clear_msr_intercept(v, MSR_PRED_CMD, VMX_MSR_RW); >> @@ -623,6 +634,15 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) >> vmx_clear_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); >> else >> vmx_set_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); >> + >> + err: > Nit: I would name this out, since it's used by both the error and the > normal exit paths, but that's just my taste. Ok. > >> + vmx_vmcs_exit(v); >> + >> + if ( rc ) >> + { >> + printk(XENLOG_G_ERR "MSR load/save list error: %d", rc); >> + domain_crash(v->domain); >> + } >> } >> >> int vmx_guest_x86_mode(struct vcpu *v) >> @@ -3128,7 +3148,6 @@ static int is_last_branch_msr(u32 ecx) >> static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) >> { >> struct vcpu *curr = current; >> - const struct vcpu_msrs *msrs = curr->arch.msrs; >> uint64_t tmp; >> >> HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x", msr); >> @@ -3136,7 +3155,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) >> switch ( msr ) >> { >> case MSR_SPEC_CTRL: /* guest_rdmsr() has already performed #GP checks. */ >> - *msr_content = msrs->spec_ctrl.raw; >> + if ( vmx_read_guest_msr(curr, msr, msr_content) ) >> + { >> + ASSERT_UNREACHABLE(); >> + goto gp_fault; >> + } > AFAICT this is required just for Xen internal callers, as a guest > attempt to access MSR_SPEC_CTRL will never be trapped, or if trapped it > would be because !cp->feat.ibrsb and thus won't get here anyway. We can end up here through FEP, or introspection caring about the MSR, but most importantly, for moving MSR_SPEC_CTRL on migrate. What the ASSERT_UNREACHABLE() is saying is that "if the common code declares not #GP, then MSR_SPEC_CTRL is *definitely* exists in the MSR list". ~Andrew
On Fri, Jan 14, 2022 at 01:08:43PM +0000, Andrew Cooper wrote: > On 14/01/2022 12:50, Roger Pau Monné wrote: > > On Thu, Jan 13, 2022 at 04:38:33PM +0000, Andrew Cooper wrote: > >> The logic was based on a mistaken understanding of how NMI blocking on vmexit > >> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. > >> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, > >> and the guest's value will be clobbered before it is saved. > >> > >> Switch to using MSR load/save lists. This causes the guest value to be saved > >> atomically with respect to NMIs/MCEs/etc. > >> > >> First, update vmx_cpuid_policy_changed() to configure the load/save lists at > >> the same time as configuring the intercepts. This function is always used in > >> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole > >> function, rather than having multiple remote acquisitions of the same VMCS. > >> > >> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the > >> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with > >> -ESRCH but we don't matter, and this path will be taken during domain create > >> on a system lacking IBRSB. > >> > >> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists > >> rather than vcpu_msrs, and update the comment to describe the new state > >> location. > >> > >> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM > >> entirely, and use a shorter code sequence to simply reload Xen's setting from > >> the top-of-stack block. > >> > >> Because the guest values are loaded/saved atomically, we do not need to use > >> the shadowing logic to cope with late NMIs/etc, so we can omit > >> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in > >> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's > >> no need to switch back to Xen's context on an early failure. > >> > >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > >> --- > >> CC: Jan Beulich <JBeulich@suse.com> > >> CC: Roger Pau Monné <roger.pau@citrix.com> > >> CC: Wei Liu <wl@xen.org> > >> CC: Jun Nakajima <jun.nakajima@intel.com> > >> CC: Kevin Tian <kevin.tian@intel.com> > >> > >> Needs backporting as far as people can tolerate. > >> > >> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, > >> but this is awkard to arrange in asm. > >> --- > >> xen/arch/x86/hvm/vmx/entry.S | 19 ++++++++++------- > >> xen/arch/x86/hvm/vmx/vmx.c | 36 +++++++++++++++++++++++++++----- > >> xen/arch/x86/include/asm/msr.h | 10 ++++++++- > >> xen/arch/x86/include/asm/spec_ctrl_asm.h | 32 ++++------------------------ > >> 4 files changed, 56 insertions(+), 41 deletions(-) > >> > >> diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S > >> index 30139ae58e9d..297ed8685126 100644 > >> --- a/xen/arch/x86/hvm/vmx/entry.S > >> +++ b/xen/arch/x86/hvm/vmx/entry.S > >> @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) > >> > >> /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: acd */ > >> ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM > >> - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM > >> + > >> + .macro restore_spec_ctrl > >> + mov $MSR_SPEC_CTRL, %ecx > >> + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax > >> + xor %edx, %edx > >> + wrmsr > >> + .endm > >> + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM > > I assume loading the host value into SPEC_CTRL strictly needs to > > happen after the RSB overwrite, or else you could use a VM exit host > > load MSR entry to handle MSR_SPEC_CTRL. > > We could use the host load list, but Intel warned that the lists aren't > as efficient as writing it like this, hence why I didn't use them > originally when we thought the NMI behaviour was different. Obviously, > we're using them now for correctness reasons, which is more important > than avoiding them for (unquantified) perf reasons. > > I've got some plans for changes to how Xen handles MSR_SPEC_CTRL for its > own protection, which I hope will bring a substantial efficiency > improvements, and those would require us not to use a host load list here. Might be worth mentioning that further work will prevent us from using host load list for SPEC_CTRL has a comment here or in the commit message. Using host load lists would also allow us to get rid of X86_FEATURE_SC_MSR_HVM. Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks, Roger.
On 14/01/2022 13:43, Roger Pau Monné wrote: > On Fri, Jan 14, 2022 at 01:08:43PM +0000, Andrew Cooper wrote: >> On 14/01/2022 12:50, Roger Pau Monné wrote: >>> On Thu, Jan 13, 2022 at 04:38:33PM +0000, Andrew Cooper wrote: >>>> The logic was based on a mistaken understanding of how NMI blocking on vmexit >>>> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. >>>> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, >>>> and the guest's value will be clobbered before it is saved. >>>> >>>> Switch to using MSR load/save lists. This causes the guest value to be saved >>>> atomically with respect to NMIs/MCEs/etc. >>>> >>>> First, update vmx_cpuid_policy_changed() to configure the load/save lists at >>>> the same time as configuring the intercepts. This function is always used in >>>> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole >>>> function, rather than having multiple remote acquisitions of the same VMCS. >>>> >>>> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the >>>> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with >>>> -ESRCH but we don't matter, and this path will be taken during domain create >>>> on a system lacking IBRSB. >>>> >>>> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists >>>> rather than vcpu_msrs, and update the comment to describe the new state >>>> location. >>>> >>>> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM >>>> entirely, and use a shorter code sequence to simply reload Xen's setting from >>>> the top-of-stack block. >>>> >>>> Because the guest values are loaded/saved atomically, we do not need to use >>>> the shadowing logic to cope with late NMIs/etc, so we can omit >>>> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in >>>> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's >>>> no need to switch back to Xen's context on an early failure. >>>> >>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> --- >>>> CC: Jan Beulich <JBeulich@suse.com> >>>> CC: Roger Pau Monné <roger.pau@citrix.com> >>>> CC: Wei Liu <wl@xen.org> >>>> CC: Jun Nakajima <jun.nakajima@intel.com> >>>> CC: Kevin Tian <kevin.tian@intel.com> >>>> >>>> Needs backporting as far as people can tolerate. >>>> >>>> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, >>>> but this is awkard to arrange in asm. >>>> --- >>>> xen/arch/x86/hvm/vmx/entry.S | 19 ++++++++++------- >>>> xen/arch/x86/hvm/vmx/vmx.c | 36 +++++++++++++++++++++++++++----- >>>> xen/arch/x86/include/asm/msr.h | 10 ++++++++- >>>> xen/arch/x86/include/asm/spec_ctrl_asm.h | 32 ++++------------------------ >>>> 4 files changed, 56 insertions(+), 41 deletions(-) >>>> >>>> diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S >>>> index 30139ae58e9d..297ed8685126 100644 >>>> --- a/xen/arch/x86/hvm/vmx/entry.S >>>> +++ b/xen/arch/x86/hvm/vmx/entry.S >>>> @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) >>>> >>>> /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: acd */ >>>> ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM >>>> - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM >>>> + >>>> + .macro restore_spec_ctrl >>>> + mov $MSR_SPEC_CTRL, %ecx >>>> + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax >>>> + xor %edx, %edx >>>> + wrmsr >>>> + .endm >>>> + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM >>> I assume loading the host value into SPEC_CTRL strictly needs to >>> happen after the RSB overwrite, or else you could use a VM exit host >>> load MSR entry to handle MSR_SPEC_CTRL. >> We could use the host load list, but Intel warned that the lists aren't >> as efficient as writing it like this, hence why I didn't use them >> originally when we thought the NMI behaviour was different. Obviously, >> we're using them now for correctness reasons, which is more important >> than avoiding them for (unquantified) perf reasons. >> >> I've got some plans for changes to how Xen handles MSR_SPEC_CTRL for its >> own protection, which I hope will bring a substantial efficiency >> improvements, and those would require us not to use a host load list here. > Might be worth mentioning that further work will prevent us from using > host load list for SPEC_CTRL has a comment here or in the commit > message. > > Using host load lists would also allow us to get rid of > X86_FEATURE_SC_MSR_HVM. No it wouldn't. I need to reuse this for the AMD side, and it also encodes "admin turned it off via spec-ctrl=". > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Thanks. ~Andrew
On 13/01/2022 16:38, Andrew Cooper wrote: > The logic was based on a mistaken understanding of how NMI blocking on vmexit > works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. > Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, > and the guest's value will be clobbered before it is saved. > > Switch to using MSR load/save lists. This causes the guest value to be saved > atomically with respect to NMIs/MCEs/etc. > > First, update vmx_cpuid_policy_changed() to configure the load/save lists at > the same time as configuring the intercepts. This function is always used in > remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole > function, rather than having multiple remote acquisitions of the same VMCS. > > vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the > guest, so handle failures using domain_crash(). vmx_del_msr() can fail with > -ESRCH but we don't matter, and this path will be taken during domain create > on a system lacking IBRSB. > > Second, update vmx_msr_{read,write}_intercept() to use the load/save lists > rather than vcpu_msrs, and update the comment to describe the new state > location. > > Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM > entirely, and use a shorter code sequence to simply reload Xen's setting from > the top-of-stack block. > > Because the guest values are loaded/saved atomically, we do not need to use > the shadowing logic to cope with late NMIs/etc, so we can omit > DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in > context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's > no need to switch back to Xen's context on an early failure. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > CC: Wei Liu <wl@xen.org> > CC: Jun Nakajima <jun.nakajima@intel.com> > CC: Kevin Tian <kevin.tian@intel.com> > > Needs backporting as far as people can tolerate. > > If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, > but this is awkard to arrange in asm. Actually, it's just occurred to me that an ASSERT is actually quite easy here. I'm proposing this additional delta (totally untested). diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S index 297ed8685126..f569c3259b32 100644 --- a/xen/arch/x86/hvm/vmx/entry.S +++ b/xen/arch/x86/hvm/vmx/entry.S @@ -41,6 +41,13 @@ ENTRY(vmx_asm_vmexit_handler) movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax xor %edx, %edx wrmsr + +#ifdef CONFIG_DEBUG + testb $SCF_use_shadow, CPUINFO_spec_ctrl_flags(%rsp) + jz 1f + ASSERT_FAILED("MSR_SPEC_CTRL shadowing active") +1: +#endif .endm ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ ~Andrew
On 14/01/2022 14:14, Andrew Cooper wrote: > On 13/01/2022 16:38, Andrew Cooper wrote: >> The logic was based on a mistaken understanding of how NMI blocking on vmexit >> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. >> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, >> and the guest's value will be clobbered before it is saved. >> >> Switch to using MSR load/save lists. This causes the guest value to be saved >> atomically with respect to NMIs/MCEs/etc. >> >> First, update vmx_cpuid_policy_changed() to configure the load/save lists at >> the same time as configuring the intercepts. This function is always used in >> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole >> function, rather than having multiple remote acquisitions of the same VMCS. >> >> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the >> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with >> -ESRCH but we don't matter, and this path will be taken during domain create >> on a system lacking IBRSB. >> >> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists >> rather than vcpu_msrs, and update the comment to describe the new state >> location. >> >> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM >> entirely, and use a shorter code sequence to simply reload Xen's setting from >> the top-of-stack block. >> >> Because the guest values are loaded/saved atomically, we do not need to use >> the shadowing logic to cope with late NMIs/etc, so we can omit >> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in >> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's >> no need to switch back to Xen's context on an early failure. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> --- >> CC: Jan Beulich <JBeulich@suse.com> >> CC: Roger Pau Monné <roger.pau@citrix.com> >> CC: Wei Liu <wl@xen.org> >> CC: Jun Nakajima <jun.nakajima@intel.com> >> CC: Kevin Tian <kevin.tian@intel.com> >> >> Needs backporting as far as people can tolerate. >> >> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, >> but this is awkard to arrange in asm. > Actually, it's just occurred to me that an ASSERT is actually quite easy > here. I'm proposing this additional delta (totally untested). > > diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S > index 297ed8685126..f569c3259b32 100644 > --- a/xen/arch/x86/hvm/vmx/entry.S > +++ b/xen/arch/x86/hvm/vmx/entry.S > @@ -41,6 +41,13 @@ ENTRY(vmx_asm_vmexit_handler) > movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax > xor %edx, %edx > wrmsr > + > +#ifdef CONFIG_DEBUG > + testb $SCF_use_shadow, CPUINFO_spec_ctrl_flags(%rsp) > + jz 1f > + ASSERT_FAILED("MSR_SPEC_CTRL shadowing active") > +1: > +#endif > .endm > ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM > /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ Irritatingly this doesn't work, because the metadata associated with the ud2 instruction is tied to the compiled position in .altinstr_replacement, not the runtime position after alternatives have run. ~Andrew
On 14.01.2022 15:41, Andrew Cooper wrote: > On 14/01/2022 14:14, Andrew Cooper wrote: >> On 13/01/2022 16:38, Andrew Cooper wrote: >>> The logic was based on a mistaken understanding of how NMI blocking on vmexit >>> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. >>> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, >>> and the guest's value will be clobbered before it is saved. >>> >>> Switch to using MSR load/save lists. This causes the guest value to be saved >>> atomically with respect to NMIs/MCEs/etc. >>> >>> First, update vmx_cpuid_policy_changed() to configure the load/save lists at >>> the same time as configuring the intercepts. This function is always used in >>> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole >>> function, rather than having multiple remote acquisitions of the same VMCS. >>> >>> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the >>> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with >>> -ESRCH but we don't matter, and this path will be taken during domain create >>> on a system lacking IBRSB. >>> >>> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists >>> rather than vcpu_msrs, and update the comment to describe the new state >>> location. >>> >>> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM >>> entirely, and use a shorter code sequence to simply reload Xen's setting from >>> the top-of-stack block. >>> >>> Because the guest values are loaded/saved atomically, we do not need to use >>> the shadowing logic to cope with late NMIs/etc, so we can omit >>> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in >>> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's >>> no need to switch back to Xen's context on an early failure. >>> >>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> --- >>> CC: Jan Beulich <JBeulich@suse.com> >>> CC: Roger Pau Monné <roger.pau@citrix.com> >>> CC: Wei Liu <wl@xen.org> >>> CC: Jun Nakajima <jun.nakajima@intel.com> >>> CC: Kevin Tian <kevin.tian@intel.com> >>> >>> Needs backporting as far as people can tolerate. >>> >>> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, >>> but this is awkard to arrange in asm. >> Actually, it's just occurred to me that an ASSERT is actually quite easy >> here. I'm proposing this additional delta (totally untested). >> >> diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S >> index 297ed8685126..f569c3259b32 100644 >> --- a/xen/arch/x86/hvm/vmx/entry.S >> +++ b/xen/arch/x86/hvm/vmx/entry.S >> @@ -41,6 +41,13 @@ ENTRY(vmx_asm_vmexit_handler) >> movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax >> xor %edx, %edx >> wrmsr >> + >> +#ifdef CONFIG_DEBUG >> + testb $SCF_use_shadow, CPUINFO_spec_ctrl_flags(%rsp) >> + jz 1f >> + ASSERT_FAILED("MSR_SPEC_CTRL shadowing active") >> +1: >> +#endif >> .endm >> ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM >> /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ > > Irritatingly this doesn't work, because the metadata associated with the > ud2 instruction is tied to the compiled position in > .altinstr_replacement, not the runtime position after alternatives have run. Could we have the macro "return" ZF, leaving it to the invocation site of the macro to act on it? ALTERNATIVE's first argument could easily be "xor %reg, %reg" to set ZF without much other overhead. Jan
On 17/01/2022 09:21, Jan Beulich wrote: > On 14.01.2022 15:41, Andrew Cooper wrote: >> On 14/01/2022 14:14, Andrew Cooper wrote: >>> On 13/01/2022 16:38, Andrew Cooper wrote: >>>> The logic was based on a mistaken understanding of how NMI blocking on vmexit >>>> works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. >>>> Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, >>>> and the guest's value will be clobbered before it is saved. >>>> >>>> Switch to using MSR load/save lists. This causes the guest value to be saved >>>> atomically with respect to NMIs/MCEs/etc. >>>> >>>> First, update vmx_cpuid_policy_changed() to configure the load/save lists at >>>> the same time as configuring the intercepts. This function is always used in >>>> remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole >>>> function, rather than having multiple remote acquisitions of the same VMCS. >>>> >>>> vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the >>>> guest, so handle failures using domain_crash(). vmx_del_msr() can fail with >>>> -ESRCH but we don't matter, and this path will be taken during domain create >>>> on a system lacking IBRSB. >>>> >>>> Second, update vmx_msr_{read,write}_intercept() to use the load/save lists >>>> rather than vcpu_msrs, and update the comment to describe the new state >>>> location. >>>> >>>> Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM >>>> entirely, and use a shorter code sequence to simply reload Xen's setting from >>>> the top-of-stack block. >>>> >>>> Because the guest values are loaded/saved atomically, we do not need to use >>>> the shadowing logic to cope with late NMIs/etc, so we can omit >>>> DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in >>>> context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's >>>> no need to switch back to Xen's context on an early failure. >>>> >>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>>> --- >>>> CC: Jan Beulich <JBeulich@suse.com> >>>> CC: Roger Pau Monné <roger.pau@citrix.com> >>>> CC: Wei Liu <wl@xen.org> >>>> CC: Jun Nakajima <jun.nakajima@intel.com> >>>> CC: Kevin Tian <kevin.tian@intel.com> >>>> >>>> Needs backporting as far as people can tolerate. >>>> >>>> If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, >>>> but this is awkard to arrange in asm. >>> Actually, it's just occurred to me that an ASSERT is actually quite easy >>> here. I'm proposing this additional delta (totally untested). >>> >>> diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S >>> index 297ed8685126..f569c3259b32 100644 >>> --- a/xen/arch/x86/hvm/vmx/entry.S >>> +++ b/xen/arch/x86/hvm/vmx/entry.S >>> @@ -41,6 +41,13 @@ ENTRY(vmx_asm_vmexit_handler) >>> movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax >>> xor %edx, %edx >>> wrmsr >>> + >>> +#ifdef CONFIG_DEBUG >>> + testb $SCF_use_shadow, CPUINFO_spec_ctrl_flags(%rsp) >>> + jz 1f >>> + ASSERT_FAILED("MSR_SPEC_CTRL shadowing active") >>> +1: >>> +#endif >>> .endm >>> ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM >>> /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ >> Irritatingly this doesn't work, because the metadata associated with the >> ud2 instruction is tied to the compiled position in >> .altinstr_replacement, not the runtime position after alternatives have run. > Could we have the macro "return" ZF, leaving it to the invocation > site of the macro to act on it? ALTERNATIVE's first argument could > easily be "xor %reg, %reg" to set ZF without much other overhead. That's very subtle, and a substantial layering violation. I really don't think the complexity is worth it. ~Andrew
On 13.01.2022 17:38, Andrew Cooper wrote: > Second, update vmx_msr_{read,write}_intercept() to use the load/save lists > rather than vcpu_msrs, and update the comment to describe the new state > location. Nit: Assuming "the comment" refers to something in the named function, I'm afraid I can't spot any comment being updated there. Perhaps you mean the comment in the declaration of struct vcpu_msrs? > --- a/xen/arch/x86/hvm/vmx/entry.S > +++ b/xen/arch/x86/hvm/vmx/entry.S > @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) > > /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: acd */ > ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM > - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM > + > + .macro restore_spec_ctrl > + mov $MSR_SPEC_CTRL, %ecx > + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax > + xor %edx, %edx > + wrmsr > + .endm > + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM > /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ > > /* Hardware clears MSR_DEBUGCTL on VMExit. Reinstate it if debugging Xen. */ > @@ -83,7 +90,6 @@ UNLIKELY_END(realmode) > > /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */ > /* SPEC_CTRL_EXIT_TO_VMX Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */ > - ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_HVM > ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM Shouldn't you update the "Clob:" remarks here as well? > @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) > SAVE_ALL > > /* > - * PV variant needed here as no guest code has executed (so > - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable > - * to hit (in which case the HVM variant might corrupt things). > + * SPEC_CTRL_ENTRY notes > + * > + * If we end up here, no guest code has executed. We still have Xen's > + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. > */ > - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ > - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ Is "no guest code has executed" actually enough here? If VM entry failed due to a later MSR-load-list entry, SPEC_CTRL could have changed value already, couldn't it? > @@ -601,17 +602,27 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) > > vmx_vmcs_enter(v); > vmx_update_exception_bitmap(v); > - vmx_vmcs_exit(v); > > /* > * We can safely pass MSR_SPEC_CTRL through to the guest, even if STIBP > * isn't enumerated in hardware, as SPEC_CTRL_STIBP is ignored. > */ > if ( cp->feat.ibrsb ) > + { > vmx_clear_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); > + > + rc = vmx_add_guest_msr(v, MSR_SPEC_CTRL, 0); > + if ( rc ) > + goto err; > + } > else > + { > vmx_set_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); > > + /* Can only fail with -ESRCH, and we don't care. */ > + vmx_del_msr(v, MSR_SPEC_CTRL, VMX_MSR_GUEST); To be forward-compatible with possible future changes to the function (or load list handling as a whole), how about having an assertion proving what the comment says: rc = vmx_del_msr(v, MSR_SPEC_CTRL, VMX_MSR_GUEST); if ( rc ) { ASSERT(rc == -ESRCH); rc = 0; } ? Jan
On 17/01/2022 11:51, Jan Beulich wrote: > On 13.01.2022 17:38, Andrew Cooper wrote: >> @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) >> SAVE_ALL >> >> /* >> - * PV variant needed here as no guest code has executed (so >> - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable >> - * to hit (in which case the HVM variant might corrupt things). >> + * SPEC_CTRL_ENTRY notes >> + * >> + * If we end up here, no guest code has executed. We still have Xen's >> + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. >> */ >> - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ >> - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ > Is "no guest code has executed" actually enough here? If VM entry failed > due to a later MSR-load-list entry, SPEC_CTRL could have changed value > already, couldn't it? No, it can't. See the very start of SDM Chapter 25 "VMEntries" for the distinction between early and late entry failures. (i.e. those which cause execution to continue beyond VMLAUNCH/VMRESUME, and those which cause execution to continue from the vmexit handler.) Early failures are conditions such as `pop %ss; vmlaunch`, and bad host state in the VMCS. Everything pertaining to guest state is late failure, so by the time we get to processing the MSR load/save list, we're definitely not taking this path. ~Andrew
On 17.01.2022 18:23, Andrew Cooper wrote: > On 17/01/2022 11:51, Jan Beulich wrote: >> On 13.01.2022 17:38, Andrew Cooper wrote: >>> @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) >>> SAVE_ALL >>> >>> /* >>> - * PV variant needed here as no guest code has executed (so >>> - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable >>> - * to hit (in which case the HVM variant might corrupt things). >>> + * SPEC_CTRL_ENTRY notes >>> + * >>> + * If we end up here, no guest code has executed. We still have Xen's >>> + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. >>> */ >>> - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ >>> - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ >> Is "no guest code has executed" actually enough here? If VM entry failed >> due to a later MSR-load-list entry, SPEC_CTRL could have changed value >> already, couldn't it? > > No, it can't. > > See the very start of SDM Chapter 25 "VMEntries" for the distinction > between early and late entry failures. (i.e. those which cause > execution to continue beyond VMLAUNCH/VMRESUME, and those which cause > execution to continue from the vmexit handler.) > > Early failures are conditions such as `pop %ss; vmlaunch`, and bad host > state in the VMCS. > > Everything pertaining to guest state is late failure, so by the time we > get to processing the MSR load/save list, we're definitely not taking > this path. I see. This still means that the answer to my 1st question is "yes". In which case I'd like to ask to extend the comment to include "no MSRs have been loaded from the load list" or something equivalent, despite realizing that such an amendment would have helped already before your change. Jan
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S index 30139ae58e9d..297ed8685126 100644 --- a/xen/arch/x86/hvm/vmx/entry.S +++ b/xen/arch/x86/hvm/vmx/entry.S @@ -35,7 +35,14 @@ ENTRY(vmx_asm_vmexit_handler) /* SPEC_CTRL_ENTRY_FROM_VMX Req: b=curr %rsp=regs/cpuinfo, Clob: acd */ ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM - ALTERNATIVE "", DO_SPEC_CTRL_ENTRY_FROM_HVM, X86_FEATURE_SC_MSR_HVM + + .macro restore_spec_ctrl + mov $MSR_SPEC_CTRL, %ecx + movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax + xor %edx, %edx + wrmsr + .endm + ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ /* Hardware clears MSR_DEBUGCTL on VMExit. Reinstate it if debugging Xen. */ @@ -83,7 +90,6 @@ UNLIKELY_END(realmode) /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */ /* SPEC_CTRL_EXIT_TO_VMX Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */ - ALTERNATIVE "", DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_SC_MSR_HVM ALTERNATIVE "", __stringify(verw CPUINFO_verw_sel(%rsp)), X86_FEATURE_SC_VERW_HVM mov VCPU_hvm_guest_cr2(%rbx),%rax @@ -119,12 +125,11 @@ UNLIKELY_END(realmode) SAVE_ALL /* - * PV variant needed here as no guest code has executed (so - * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable - * to hit (in which case the HVM variant might corrupt things). + * SPEC_CTRL_ENTRY notes + * + * If we end up here, no guest code has executed. We still have Xen's + * choice of MSR_SPEC_CTRL in context, and the RSB is safe. */ - SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */ - /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */ call vmx_vmentry_failure jmp .Lvmx_process_softirqs diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index 28ee6393f11e..d7feb5f9c455 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -592,6 +592,7 @@ void vmx_update_exception_bitmap(struct vcpu *v) static void vmx_cpuid_policy_changed(struct vcpu *v) { const struct cpuid_policy *cp = v->domain->arch.cpuid; + int rc = 0; if ( opt_hvm_fep || (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) ) @@ -601,17 +602,27 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) vmx_vmcs_enter(v); vmx_update_exception_bitmap(v); - vmx_vmcs_exit(v); /* * We can safely pass MSR_SPEC_CTRL through to the guest, even if STIBP * isn't enumerated in hardware, as SPEC_CTRL_STIBP is ignored. */ if ( cp->feat.ibrsb ) + { vmx_clear_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); + + rc = vmx_add_guest_msr(v, MSR_SPEC_CTRL, 0); + if ( rc ) + goto err; + } else + { vmx_set_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW); + /* Can only fail with -ESRCH, and we don't care. */ + vmx_del_msr(v, MSR_SPEC_CTRL, VMX_MSR_GUEST); + } + /* MSR_PRED_CMD is safe to pass through if the guest knows about it. */ if ( cp->feat.ibrsb || cp->extd.ibpb ) vmx_clear_msr_intercept(v, MSR_PRED_CMD, VMX_MSR_RW); @@ -623,6 +634,15 @@ static void vmx_cpuid_policy_changed(struct vcpu *v) vmx_clear_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); else vmx_set_msr_intercept(v, MSR_FLUSH_CMD, VMX_MSR_RW); + + err: + vmx_vmcs_exit(v); + + if ( rc ) + { + printk(XENLOG_G_ERR "MSR load/save list error: %d", rc); + domain_crash(v->domain); + } } int vmx_guest_x86_mode(struct vcpu *v) @@ -3128,7 +3148,6 @@ static int is_last_branch_msr(u32 ecx) static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) { struct vcpu *curr = current; - const struct vcpu_msrs *msrs = curr->arch.msrs; uint64_t tmp; HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x", msr); @@ -3136,7 +3155,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content) switch ( msr ) { case MSR_SPEC_CTRL: /* guest_rdmsr() has already performed #GP checks. */ - *msr_content = msrs->spec_ctrl.raw; + if ( vmx_read_guest_msr(curr, msr, msr_content) ) + { + ASSERT_UNREACHABLE(); + goto gp_fault; + } break; case MSR_IA32_SYSENTER_CS: @@ -3336,7 +3359,6 @@ void vmx_vlapic_msr_changed(struct vcpu *v) static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content) { struct vcpu *v = current; - struct vcpu_msrs *msrs = v->arch.msrs; const struct cpuid_policy *cp = v->domain->arch.cpuid; HVM_DBG_LOG(DBG_LEVEL_MSR, "ecx=%#x, msr_value=%#"PRIx64, msr, msr_content); @@ -3346,7 +3368,11 @@ static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content) uint64_t rsvd, tmp; case MSR_SPEC_CTRL: /* guest_wrmsr() has already performed #GP checks. */ - msrs->spec_ctrl.raw = msr_content; + if ( vmx_write_guest_msr(v, msr, msr_content) ) + { + ASSERT_UNREACHABLE(); + goto gp_fault; + } return X86EMUL_OKAY; case MSR_IA32_SYSENTER_CS: diff --git a/xen/arch/x86/include/asm/msr.h b/xen/arch/x86/include/asm/msr.h index 0b2176a9bc53..5f851958992b 100644 --- a/xen/arch/x86/include/asm/msr.h +++ b/xen/arch/x86/include/asm/msr.h @@ -287,7 +287,15 @@ extern struct msr_policy raw_msr_policy, /* Container object for per-vCPU MSRs */ struct vcpu_msrs { - /* 0x00000048 - MSR_SPEC_CTRL */ + /* + * 0x00000048 - MSR_SPEC_CTRL + * + * For PV guests, this holds the guest kernel value. It is accessed on + * every entry/exit path. + * + * For VT-x guests, the guest value is held in the MSR guest load/save + * list. + */ struct { uint32_t raw; } spec_ctrl; diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h b/xen/arch/x86/include/asm/spec_ctrl_asm.h index 18ecfcd70375..ddce8b33fd17 100644 --- a/xen/arch/x86/include/asm/spec_ctrl_asm.h +++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h @@ -42,9 +42,10 @@ * path, or late in the exit path after restoring the guest value. This * will corrupt the guest value. * - * Factor 1 is dealt with by relying on NMIs/MCEs being blocked immediately - * after VMEXIT. The VMEXIT-specific code reads MSR_SPEC_CTRL and updates - * current before loading Xen's MSR_SPEC_CTRL setting. + * Factor 1 is dealt with: + * - On VMX by using MSR load/save lists to have vmentry/exit atomically + * load/save the guest value. Xen's value is loaded in regular code, and + * there is no need to use the shadow logic (below). * * Factor 2 is harder. We maintain a shadow_spec_ctrl value, and a use_shadow * boolean in the per cpu spec_ctrl_flags. The synchronous use is: @@ -126,31 +127,6 @@ #endif .endm -.macro DO_SPEC_CTRL_ENTRY_FROM_HVM -/* - * Requires %rbx=current, %rsp=regs/cpuinfo - * Clobbers %rax, %rcx, %rdx - * - * The common case is that a guest has direct access to MSR_SPEC_CTRL, at - * which point we need to save the guest value before setting IBRS for Xen. - * Unilaterally saving the guest value is shorter and faster than checking. - */ - mov $MSR_SPEC_CTRL, %ecx - rdmsr - - /* Stash the value from hardware. */ - mov VCPU_arch_msrs(%rbx), %rdx - mov %eax, VCPUMSR_spec_ctrl_raw(%rdx) - xor %edx, %edx - - /* Clear SPEC_CTRL shadowing *before* loading Xen's value. */ - andb $~SCF_use_shadow, CPUINFO_spec_ctrl_flags(%rsp) - - /* Load Xen's intended value. */ - movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax - wrmsr -.endm - .macro DO_SPEC_CTRL_ENTRY maybexen:req /* * Requires %rsp=regs (also cpuinfo if !maybexen)
The logic was based on a mistaken understanding of how NMI blocking on vmexit works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits. Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path, and the guest's value will be clobbered before it is saved. Switch to using MSR load/save lists. This causes the guest value to be saved atomically with respect to NMIs/MCEs/etc. First, update vmx_cpuid_policy_changed() to configure the load/save lists at the same time as configuring the intercepts. This function is always used in remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole function, rather than having multiple remote acquisitions of the same VMCS. vmx_add_guest_msr() can fail, but only in ways which are entirely fatal to the guest, so handle failures using domain_crash(). vmx_del_msr() can fail with -ESRCH but we don't matter, and this path will be taken during domain create on a system lacking IBRSB. Second, update vmx_msr_{read,write}_intercept() to use the load/save lists rather than vcpu_msrs, and update the comment to describe the new state location. Finally, adjust the entry/exit asm. Drop DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter code sequence to simply reload Xen's setting from the top-of-stack block. Because the guest values are loaded/saved atomically, we do not need to use the shadowing logic to cope with late NMIs/etc, so we can omit DO_SPEC_CTRL_EXIT_TO_GUEST entirely and VMRESUME/VMLAUNCH with Xen's value in context. Furthermore, we can drop the SPEC_CTRL_ENTRY_FROM_PV too, as there's no need to switch back to Xen's context on an early failure. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> CC: Wei Liu <wl@xen.org> CC: Jun Nakajima <jun.nakajima@intel.com> CC: Kevin Tian <kevin.tian@intel.com> Needs backporting as far as people can tolerate. If the entry/exit logic were in C, I'd ASSERT() that shadow tracking is off, but this is awkard to arrange in asm. --- xen/arch/x86/hvm/vmx/entry.S | 19 ++++++++++------- xen/arch/x86/hvm/vmx/vmx.c | 36 +++++++++++++++++++++++++++----- xen/arch/x86/include/asm/msr.h | 10 ++++++++- xen/arch/x86/include/asm/spec_ctrl_asm.h | 32 ++++------------------------ 4 files changed, 56 insertions(+), 41 deletions(-)