Message ID | 20230511040857.6094-14-weijiang.yang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Enable CET Virtualization | expand |
On 5/11/2023 12:08 PM, Yang Weijiang wrote: > Add support for emulating read and write accesses to CET MSRs. > CET MSRs are universally "special" as they are either context switched > via dedicated VMCS fields or via XSAVES, i.e. no additional in-memory > tracking is needed, but emulated reads/writes are more expensive. > > Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> > --- > arch/x86/kernel/fpu/core.c | 1 + > arch/x86/kvm/vmx/vmx.c | 18 ++++++++++++++++++ > arch/x86/kvm/x86.c | 20 ++++++++++++++++++++ > arch/x86/kvm/x86.h | 31 +++++++++++++++++++++++++++++++ > 4 files changed, 70 insertions(+) > ... > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b6eec9143129..2e3a39c9297c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13630,6 +13630,26 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, > } > EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); > > +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) > +{ > + if (!kvm_cet_user_supported()) > + return false; > + > + if (msr->host_initiated) > + return true; > + > + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && > + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) > + return false; > + > + if (msr->index == MSR_IA32_PL3_SSP && > + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) > + return false; It may be better to merge the two if statements into one to avoid calling guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) twice. e.g, if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && (!guest_cpuid_has(vcpu, X86_FEATURE_IBT) || msr->index == MSR_IA32_PL3_SSP)) return false; > + > + return true; > +} > +EXPORT_SYMBOL_GPL(kvm_cet_is_msr_accessible); > + > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry); > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); > EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); > ...
On 5/23/2023 4:21 PM, Binbin Wu wrote: > > > On 5/11/2023 12:08 PM, Yang Weijiang wrote: >> Add support for emulating read and write accesses to CET MSRs. >> CET MSRs are universally "special" as they are either context switched >> via dedicated VMCS fields or via XSAVES, i.e. no additional in-memory >> tracking is needed, but emulated reads/writes are more expensive. [...] >> + >> + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && >> + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) >> + return false; >> + >> + if (msr->index == MSR_IA32_PL3_SSP && >> + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) >> + return false; > It may be better to merge the two if statements into one to avoid > calling guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) twice. > Yeah, it sounds good to me, thanks! > e.g, > > if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && > (!guest_cpuid_has(vcpu, X86_FEATURE_IBT) || msr->index == > MSR_IA32_PL3_SSP)) > return false; > > >> + >> + return true; >> +} >> +EXPORT_SYMBOL_GPL(kvm_cet_is_msr_accessible); >> + >> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry); >> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); >> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); >> > ...
On Thu, May 11, 2023, Yang Weijiang wrote: > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index c872a5aafa50..0ccaa467d7d3 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -2093,6 +2093,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > else > msr_info->data = vmx->pt_desc.guest.addr_a[index / 2]; > break; > + case MSR_IA32_U_CET: > + case MSR_IA32_PL3_SSP: > + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) > + return 1; > + kvm_get_xsave_msr(msr_info); > + break; Please put as much MSR handling in x86.c as possible. We quite obviously know that AMD support is coming along, there's no reason to duplicate all of this code. And unless I'm missing something, John's series misses several #GP checks, e.g. for MSR_IA32_S_CET reserved bits, which means that providing a common implementation would actually fix bugs. For MSRs that require vendor input and/or handling, please follow what was recently done for MSR_IA32_CR_PAT, where the common bits are handled in common code, and vendor code does its updates. The divergent alignment between AMD and Intel could get annoying, but I'm sure we can figure out a solution. > case MSR_IA32_DEBUGCTLMSR: > msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL); > break; > @@ -2405,6 +2411,18 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > else > vmx->pt_desc.guest.addr_a[index / 2] = data; > break; > + case MSR_IA32_U_CET: > + case MSR_IA32_PL3_SSP: > + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) > + return 1; > + if (is_noncanonical_address(data, vcpu)) > + return 1; > + if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6))) > + return 1; > + if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0))) Please #define reserved bits, ideally using the inverse of the valid masks. And for SSP, it might be better to do IS_ALIGNED(data, 8) (or 4, pending my question about the SDM's wording). Side topic, what on earth does the SDM mean by this?!? The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 (hardware requires bits 1:0 to be 0). I know Intel retroactively changed the alignment requirements, but the above is nonsensical. If ucode prevents writing bits 2:0, who cares what hardware requires? > + return 1; > + kvm_set_xsave_msr(msr_info); > + break; > case MSR_IA32_PERF_CAPABILITIES: > if (data && !vcpu_to_pmu(vcpu)->version) > return 1; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index b6eec9143129..2e3a39c9297c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13630,6 +13630,26 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, > } > EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); > > +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) > +{ > + if (!kvm_cet_user_supported()) This feels wrong. KVM should differentiate between SHSTK and IBT in the host. E.g. if running in a VM with SHSTK but not IBT, or vice versa, KVM should allow writes to non-existent MSRs. I.e. this looks wrong: /* * If SHSTK and IBT are available in KVM, clear CET user bit in * kvm_caps.supported_xss so that kvm_cet_user_supported() returns * false when called. */ if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) && !kvm_cpu_cap_has(X86_FEATURE_IBT)) kvm_caps.supported_xss &= ~XFEATURE_MASK_CET_USER; and by extension, all dependent code is also wrong. IIRC, there's a virtualization hole, but I don't see any reason why KVM has to make the hole even bigger. > + return false; > + > + if (msr->host_initiated) > + return true; > + > + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && > + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) > + return false; > + > + if (msr->index == MSR_IA32_PL3_SSP && > + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) I probably asked this long ago, but if I did I since forgot. Is it really just PL3_SSP that depends on SHSTK? I would expect all shadow stack MSRs to depend on SHSTK. > @@ -546,5 +557,25 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes, > int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, > unsigned int port, void *data, unsigned int count, > int in); > +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr); > + > +/* > + * We've already loaded guest MSRs in __msr_io() after check the MSR index. Please avoid pronouns > + * In case vcpu has been preempted, we need to disable preemption, check vCPU. And this doesn't make any sense. The "vCPU" being preempted doesn't matter, it's KVM, i.e. the task that's accessing vCPU state that cares about preemption. I *think* what you're trying to say is that preemption needs to be disabled to ensure that the guest values are resident. > + * and reload the guest fpu states before read/write xsaves-managed MSRs. > + */ > +static inline void kvm_get_xsave_msr(struct msr_data *msr_info) > +{ > + fpregs_lock_and_load(); KVM already has helpers that do exactly this, and they have far better names for KVM: kvm_fpu_get() and kvm_fpu_put(). Can you convert kvm_fpu_get() to fpregs_lock_and_load() and use those isntead? And if the extra consistency checks in fpregs_lock_and_load() fire, we definitely want to know, as it means we probably have bugs in KVM.
On 6/24/2023 7:53 AM, Sean Christopherson wrote: > On Thu, May 11, 2023, Yang Weijiang wrote: >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >> index c872a5aafa50..0ccaa467d7d3 100644 >> --- a/arch/x86/kvm/vmx/vmx.c >> +++ b/arch/x86/kvm/vmx/vmx.c >> @@ -2093,6 +2093,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >> else >> msr_info->data = vmx->pt_desc.guest.addr_a[index / 2]; >> break; >> + case MSR_IA32_U_CET: >> + case MSR_IA32_PL3_SSP: >> + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) >> + return 1; >> + kvm_get_xsave_msr(msr_info); >> + break; > Please put as much MSR handling in x86.c as possible. We quite obviously know > that AMD support is coming along, there's no reason to duplicate all of this code. > And unless I'm missing something, John's series misses several #GP checks, e.g. > for MSR_IA32_S_CET reserved bits, which means that providing a common implementation > would actually fix bugs. OK, will move the common part to x86.c > > For MSRs that require vendor input and/or handling, please follow what was > recently done for MSR_IA32_CR_PAT, where the common bits are handled in common > code, and vendor code does its updates. > > The divergent alignment between AMD and Intel could get annoying, but I'm sure > we can figure out a solution. Got it, will refer to the PAT handling. > >> case MSR_IA32_DEBUGCTLMSR: >> msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL); >> break; >> @@ -2405,6 +2411,18 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) >> else >> vmx->pt_desc.guest.addr_a[index / 2] = data; >> break; >> + case MSR_IA32_U_CET: >> + case MSR_IA32_PL3_SSP: >> + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) >> + return 1; >> + if (is_noncanonical_address(data, vcpu)) >> + return 1; >> + if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6))) >> + return 1; >> + if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0))) > Please #define reserved bits, ideally using the inverse of the valid masks. And > for SSP, it might be better to do IS_ALIGNED(data, 8) (or 4, pending my question > about the SDM's wording). OK. > > Side topic, what on earth does the SDM mean by this?!? > > The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 > (hardware requires bits 1:0 to be 0). > > I know Intel retroactively changed the alignment requirements, but the above > is nonsensical. If ucode prevents writing bits 2:0, who cares what hardware > requires? So do I ;-/ > >> + return 1; >> + kvm_set_xsave_msr(msr_info); >> + break; >> case MSR_IA32_PERF_CAPABILITIES: >> if (data && !vcpu_to_pmu(vcpu)->version) >> return 1; >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> index b6eec9143129..2e3a39c9297c 100644 >> --- a/arch/x86/kvm/x86.c >> +++ b/arch/x86/kvm/x86.c >> @@ -13630,6 +13630,26 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, >> } >> EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); >> >> +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) >> +{ >> + if (!kvm_cet_user_supported()) > This feels wrong. KVM should differentiate between SHSTK and IBT in the host. > E.g. if running in a VM with SHSTK but not IBT, or vice versa, KVM should allow > writes to non-existent MSRs. I don't follow you, in this case, which part KVM is on behalf of? guest or user space? > I.e. this looks wrong: > > /* > * If SHSTK and IBT are available in KVM, clear CET user bit in > * kvm_caps.supported_xss so that kvm_cet_user_supported() returns > * false when called. > */ > if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) && > !kvm_cpu_cap_has(X86_FEATURE_IBT)) > kvm_caps.supported_xss &= ~XFEATURE_MASK_CET_USER; The comment is wrong, it should be "are not available in KVM". My intent is, if both features are not available in KVM, then clear the precondition bit so that all dependent checks will fail quickly. > > and by extension, all dependent code is also wrong. IIRC, there's a virtualization > hole, but I don't see any reason why KVM has to make the hole even bigger. Do you mean the issue that both SHSTK and IBT share one control MSR? i.e., U_CET/S_CET? > >> + return false; >> + >> + if (msr->host_initiated) >> + return true; >> + >> + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && >> + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) >> + return false; >> + >> + if (msr->index == MSR_IA32_PL3_SSP && >> + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) > I probably asked this long ago, but if I did I since forgot. Is it really just > PL3_SSP that depends on SHSTK? I would expect all shadow stack MSRs to depend > on SHSTK. All PL{0,1,2,3}_SSP plus INT_SSP_TAB msr depend on SHSTK. In patch 21, I added more MSRs in this helper. >> @@ -546,5 +557,25 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes, >> int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, >> unsigned int port, void *data, unsigned int count, >> int in); >> +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr); >> + >> +/* >> + * We've already loaded guest MSRs in __msr_io() after check the MSR index. > Please avoid pronouns OK. >> + * In case vcpu has been preempted, we need to disable preemption, check > vCPU. And this doesn't make any sense. The "vCPU" being preempted doesn't matter, > it's KVM, i.e. the task that's accessing vCPU state that cares about preemption. > I *think* what you're trying to say is that preemption needs to be disabled to > ensure that the guest values are resident. Sorry the comment is broken, I meant to say between kvm_load_guest_fpu() and the place to use this helper, the vCPU could have been preempted, so need to reload fpu with fpregs_lock_and_load() and disable preemption now before access MSR. >> + * and reload the guest fpu states before read/write xsaves-managed MSRs. >> + */ >> +static inline void kvm_get_xsave_msr(struct msr_data *msr_info) >> +{ >> + fpregs_lock_and_load(); > KVM already has helpers that do exactly this, and they have far better names for > KVM: kvm_fpu_get() and kvm_fpu_put(). Can you convert kvm_fpu_get() to > fpregs_lock_and_load() and use those isntead? And if the extra consistency checks > in fpregs_lock_and_load() fire, we definitely want to know, as it means we probably > have bugs in KVM. Do you want me to do some experiments to make sure the WARN() in fpregs_lock_and load() would be triggered or not? If no WARN() trigger, then replace fpregs_lock_and_load()/fpregs_unlock() with kvm_fpu_get()/ kvm_fpu_put()?
On Mon, Jun 26, 2023, Weijiang Yang wrote: > > On 6/24/2023 7:53 AM, Sean Christopherson wrote: > > On Thu, May 11, 2023, Yang Weijiang wrote: > > Side topic, what on earth does the SDM mean by this?!? > > > > The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 > > (hardware requires bits 1:0 to be 0). > > > > I know Intel retroactively changed the alignment requirements, but the above > > is nonsensical. If ucode prevents writing bits 2:0, who cares what hardware > > requires? > > So do I ;-/ Can you follow-up with someone to get clarification? If writing bit 2 with '1' does not #GP despite the statement that it "must be aligned", then KVM shouldn't injected a #GP on that case. > > > + return 1; > > > + kvm_set_xsave_msr(msr_info); > > > + break; > > > case MSR_IA32_PERF_CAPABILITIES: > > > if (data && !vcpu_to_pmu(vcpu)->version) > > > return 1; > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index b6eec9143129..2e3a39c9297c 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -13630,6 +13630,26 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, > > > } > > > EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); > > > +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) > > > +{ > > > + if (!kvm_cet_user_supported()) > > This feels wrong. KVM should differentiate between SHSTK and IBT in the host. > > E.g. if running in a VM with SHSTK but not IBT, or vice versa, KVM should allow > > writes to non-existent MSRs. > > I don't follow you, in this case, which part KVM is on behalf of? guest or > user space? Sorry, typo. KVM *shouldn't* allow writes to non-existent MSRs. > > I.e. this looks wrong: > > > > /* > > * If SHSTK and IBT are available in KVM, clear CET user bit in > > * kvm_caps.supported_xss so that kvm_cet_user_supported() returns > > * false when called. > > */ > > if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) && > > !kvm_cpu_cap_has(X86_FEATURE_IBT)) > > kvm_caps.supported_xss &= ~XFEATURE_MASK_CET_USER; > > The comment is wrong, it should be "are not available in KVM". My intent is,� > if both features are not available in KVM, then clear the precondition bit so > that all dependent checks will fail quickly. Checking kvm_caps.supported_xss.CET_USER is worthless in 99% of the cases though. Unless I'm missing something, the only time it's useful is for CR4.CET, which doesn't differentiate between SHSTK and IBT. For everything else that KVM cares about, at some point KVM needs to precisely check for SHSTK and IBT support anyways > > and by extension, all dependent code is also wrong. IIRC, there's a virtualization > > hole, but I don't see any reason why KVM has to make the hole even bigger. > > Do you mean the issue that both SHSTK and IBT share one control MSR? i.e., > U_CET/S_CET? I mean that passing through PLx_SSP if the host has IBT but *not* SHSTK is wrong. > > > + return false; > > > + > > > + if (msr->host_initiated) > > > + return true; > > > + > > > + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && > > > + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) > > > + return false; > > > + > > > + if (msr->index == MSR_IA32_PL3_SSP && > > > + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) > > I probably asked this long ago, but if I did I since forgot. Is it really just > > PL3_SSP that depends on SHSTK? I would expect all shadow stack MSRs to depend > > on SHSTK. > > All PL{0,1,2,3}_SSP plus INT_SSP_TAB msr depend on SHSTK. In patch 21, I > added more MSRs in this helper. Sure, except that patch 21 never adds handling for PL{0,1,2}_SSP. I see: if (!kvm_cet_user_supported() && !(kvm_cpu_cap_has(X86_FEATURE_IBT) || kvm_cpu_cap_has(X86_FEATURE_SHSTK))) return false; if (msr->host_initiated) return true; if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) return false; /* The synthetic MSR is for userspace access only. */ if (msr->index == MSR_KVM_GUEST_SSP) return false; if (msr->index == MSR_IA32_U_CET) return true; if (msr->index == MSR_IA32_S_CET) return guest_cpuid_has(vcpu, X86_FEATURE_IBT) || kvm_cet_kernel_shstk_supported(); if (msr->index == MSR_IA32_INT_SSP_TAB) return guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && kvm_cet_kernel_shstk_supported(); if (msr->index == MSR_IA32_PL3_SSP && !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) return false; mask = (msr->index == MSR_IA32_PL3_SSP) ? XFEATURE_MASK_CET_USER : XFEATURE_MASK_CET_KERNEL; return !!(kvm_caps.supported_xss & mask); Which means that KVM will allow guest accesses to PL{0,1,2}_SSP regardless of whether or not X86_FEATURE_SHSTK is enumerated to the guest. And the above is also wrong for host_initiated writes to SHSTK MSRs. E.g. if KVM is running on a CPU that has IBT but not SHSTK, then userspace can write to MSRs that do not exist. Maybe this confusion is just a symptom of the series not providing proper Supervisor Shadow Stack support, but that's still a poor excuse for posting broken code. I suspect you tried to get too fancy. I don't see any reason to ever care about kvm_caps.supported_xss beyond emulating writes to XSS itself. Just require that both CET_USER and CET_KERNEL are supported in XSS to allow IBT or SHSTK, i.e. let X86_FEATURE_IBT and X86_FEATURE_SHSTK speak for themselves. That way, this can simply be: bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) { if (is_shadow_stack_msr(...)) if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) return false; return msr->host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); } if (!kvm_cpu_cap_has(X86_FEATURE_IBT) && !kvm_cpu_cap_has(X86_FEATURE_SHSTK)) return false; return msr->host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_IBT) || guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); } > > > + * and reload the guest fpu states before read/write xsaves-managed MSRs. > > > + */ > > > +static inline void kvm_get_xsave_msr(struct msr_data *msr_info) > > > +{ > > > + fpregs_lock_and_load(); > > KVM already has helpers that do exactly this, and they have far better names for > > KVM: kvm_fpu_get() and kvm_fpu_put(). Can you convert kvm_fpu_get() to > > fpregs_lock_and_load() and use those isntead? And if the extra consistency checks > > in fpregs_lock_and_load() fire, we definitely want to know, as it means we probably > > have bugs in KVM. > > Do you want me to do some experiments to make sure the WARN()� in > fpregs_lock_and load() would be triggered or not? Yes, though I shouldn't have to clarify that. The well-documented (as of now) expectation is that any code that someone posts is tested, unless explicitly stated otherwise. I.e. you should not have to ask if you should verify the WARN doesn't trigger, because you should be doing that for all code you post. > If no WARN() trigger, then replace fpregs_lock_and_load()/fpregs_unlock() > with kvm_fpu_get()/ > > kvm_fpu_put()? Yes.
On 6/27/2023 5:15 AM, Sean Christopherson wrote: > On Mon, Jun 26, 2023, Weijiang Yang wrote: >> On 6/24/2023 7:53 AM, Sean Christopherson wrote: >>> On Thu, May 11, 2023, Yang Weijiang wrote: >>> Side topic, what on earth does the SDM mean by this?!? >>> >>> The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 >>> (hardware requires bits 1:0 to be 0). >>> >>> I know Intel retroactively changed the alignment requirements, but the above >>> is nonsensical. If ucode prevents writing bits 2:0, who cares what hardware >>> requires? >> So do I ;-/ > Can you follow-up with someone to get clarification? If writing bit 2 with '1' > does not #GP despite the statement that it "must be aligned", then KVM shouldn't > injected a #GP on that case. OK, will consult someone and get back to this thread. > >>>> + return 1; >>>> + kvm_set_xsave_msr(msr_info); >>>> + break; >>>> case MSR_IA32_PERF_CAPABILITIES: >>>> if (data && !vcpu_to_pmu(vcpu)->version) >>>> return 1; >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>> index b6eec9143129..2e3a39c9297c 100644 >>>> --- a/arch/x86/kvm/x86.c >>>> +++ b/arch/x86/kvm/x86.c >>>> @@ -13630,6 +13630,26 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, >>>> } >>>> EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); >>>> +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) >>>> +{ >>>> + if (!kvm_cet_user_supported()) >>> This feels wrong. KVM should differentiate between SHSTK and IBT in the host. >>> E.g. if running in a VM with SHSTK but not IBT, or vice versa, KVM should allow >>> writes to non-existent MSRs. >> I don't follow you, in this case, which part KVM is on behalf of? guest or >> user space? > Sorry, typo. KVM *shouldn't* allow writes to non-existent MSRs. > >>> I.e. this looks wrong: >>> >>> /* >>> * If SHSTK and IBT are available in KVM, clear CET user bit in >>> * kvm_caps.supported_xss so that kvm_cet_user_supported() returns >>> * false when called. >>> */ >>> if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) && >>> !kvm_cpu_cap_has(X86_FEATURE_IBT)) >>> kvm_caps.supported_xss &= ~XFEATURE_MASK_CET_USER; >> The comment is wrong, it should be "are not available in KVM". My intent is,� >> if both features are not available in KVM, then clear the precondition bit so >> that all dependent checks will fail quickly. > Checking kvm_caps.supported_xss.CET_USER is worthless in 99% of the cases though. > Unless I'm missing something, the only time it's useful is for CR4.CET, which > doesn't differentiate between SHSTK and IBT. For everything else that KVM cares > about, at some point KVM needs to precisely check for SHSTK and IBT support > anyways I will tweak the patches and do precise checks based on the available features to guest. >>> and by extension, all dependent code is also wrong. IIRC, there's a virtualization >>> hole, but I don't see any reason why KVM has to make the hole even bigger. >> Do you mean the issue that both SHSTK and IBT share one control MSR? i.e., >> U_CET/S_CET? > I mean that passing through PLx_SSP if the host has IBT but *not* SHSTK is wrong. Understood. > >>>> + return false; >>>> + >>>> + if (msr->host_initiated) >>>> + return true; >>>> + >>>> + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && >>>> + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) >>>> + return false; >>>> + >>>> + if (msr->index == MSR_IA32_PL3_SSP && >>>> + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) >>> I probably asked this long ago, but if I did I since forgot. Is it really just >>> PL3_SSP that depends on SHSTK? I would expect all shadow stack MSRs to depend >>> on SHSTK. >> All PL{0,1,2,3}_SSP plus INT_SSP_TAB msr depend on SHSTK. In patch 21, I >> added more MSRs in this helper. > Sure, except that patch 21 never adds handling for PL{0,1,2}_SSP. I see: > > if (!kvm_cet_user_supported() && > !(kvm_cpu_cap_has(X86_FEATURE_IBT) || > kvm_cpu_cap_has(X86_FEATURE_SHSTK))) > return false; > > if (msr->host_initiated) > return true; > > if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && > !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) > return false; > > /* The synthetic MSR is for userspace access only. */ > if (msr->index == MSR_KVM_GUEST_SSP) > return false; > > if (msr->index == MSR_IA32_U_CET) > return true; > > if (msr->index == MSR_IA32_S_CET) > return guest_cpuid_has(vcpu, X86_FEATURE_IBT) || > kvm_cet_kernel_shstk_supported(); > > if (msr->index == MSR_IA32_INT_SSP_TAB) > return guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && > kvm_cet_kernel_shstk_supported(); > > if (msr->index == MSR_IA32_PL3_SSP && > !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) > return false; > > mask = (msr->index == MSR_IA32_PL3_SSP) ? XFEATURE_MASK_CET_USER : > XFEATURE_MASK_CET_KERNEL; > return !!(kvm_caps.supported_xss & mask); > > Which means that KVM will allow guest accesses to PL{0,1,2}_SSP regardless of > whether or not X86_FEATURE_SHSTK is enumerated to the guest. Hmm, the check of X86_FEATURE_SHSTK is missing in this case. > > And the above is also wrong for host_initiated writes to SHSTK MSRs. E.g. if KVM > is running on a CPU that has IBT but not SHSTK, then userspace can write to MSRs > that do not exist. > > Maybe this confusion is just a symptom of the series not providing proper > Supervisor Shadow Stack support, but that's still a poor excuse for posting > broken code. > > I suspect you tried to get too fancy. I don't see any reason to ever care about > kvm_caps.supported_xss beyond emulating writes to XSS itself. Just require that > both CET_USER and CET_KERNEL are supported in XSS to allow IBT or SHSTK, i.e. let > X86_FEATURE_IBT and X86_FEATURE_SHSTK speak for themselves. That way, this can > simply be: You're right, kvm_cet_user_supported() is overused. Let me recap to see if I understand correctly: 1. Check both CET_USER and CET_KERNEL are supported in XSS before advertise SHSTK is supported in KVM and expose it to guest, the reason is once SHSTK is exposed to guest, KVM should support both modes to honor arch integrity. 2. Check CET_USER is supported before advertise IBT is supported in KVM and expose IBT, the reason is, user IBT(MSR_U_CET) depends on CET_USER bit while kernel IBT(MSR_S_CET) doesn't. > > bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) > { > if (is_shadow_stack_msr(...)) > if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) > return false; > > return msr->host_initiated || > guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); > } > > if (!kvm_cpu_cap_has(X86_FEATURE_IBT) && > !kvm_cpu_cap_has(X86_FEATURE_SHSTK)) > return false; Move above checks to the beginning? > > return msr->host_initiated || > guest_cpuid_has(vcpu, X86_FEATURE_IBT) || > guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); > } > >>>> + * and reload the guest fpu states before read/write xsaves-managed MSRs. >>>> + */ >>>> +static inline void kvm_get_xsave_msr(struct msr_data *msr_info) >>>> +{ >>>> + fpregs_lock_and_load(); >>> KVM already has helpers that do exactly this, and they have far better names for >>> KVM: kvm_fpu_get() and kvm_fpu_put(). Can you convert kvm_fpu_get() to >>> fpregs_lock_and_load() and use those isntead? And if the extra consistency checks >>> in fpregs_lock_and_load() fire, we definitely want to know, as it means we probably >>> have bugs in KVM. >> Do you want me to do some experiments to make sure the WARN()� in >> fpregs_lock_and load() would be triggered or not? > Yes, though I shouldn't have to clarify that. The well-documented (as of now) > expectation is that any code that someone posts is tested, unless explicitly > stated otherwise. I.e. you should not have to ask if you should verify the WARN > doesn't trigger, because you should be doing that for all code you post. Surely I will do tests based on the change. > >> If no WARN() trigger, then replace fpregs_lock_and_load()/fpregs_unlock() >> with kvm_fpu_get()/ >> >> kvm_fpu_put()? > Yes.
On Tue, Jun 27, 2023, Weijiang Yang wrote: > > On 6/27/2023 5:15 AM, Sean Christopherson wrote: > > And the above is also wrong for host_initiated writes to SHSTK MSRs. E.g. if KVM > > is running on a CPU that has IBT but not SHSTK, then userspace can write to MSRs > > that do not exist. > > > > Maybe this confusion is just a symptom of the series not providing proper > > Supervisor Shadow Stack support, but that's still a poor excuse for posting > > broken code. > > > > I suspect you tried to get too fancy. I don't see any reason to ever care about > > kvm_caps.supported_xss beyond emulating writes to XSS itself. Just require that > > both CET_USER and CET_KERNEL are supported in XSS to allow IBT or SHSTK, i.e. let > > X86_FEATURE_IBT and X86_FEATURE_SHSTK speak for themselves. That way, this can > > simply be: > > You're right, kvm_cet_user_supported() is overused. > > Let me recap to see if I understand correctly: > > 1. Check both CET_USER and CET_KERNEL are supported in XSS before advertise > SHSTK is supported > > in KVM and expose it to guest, the reason is once SHSTK is exposed to guest, > KVM should support both modes to honor arch integrity. > > 2. Check CET_USER is supported before advertise IBT is supported in KVM� and > expose IBT, the reason is, user IBT(MSR_U_CET) depends on CET_USER bit while > kernel IBT(MSR_S_CET) doesn't. IBT can also used by the kernel... Just require that both CET_USER and CET_KERNEL are supported to advertise IBT or SHSTK. I don't see why this is needs to be any more complex than that. > > bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) > > { > > if (is_shadow_stack_msr(...)) > > if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) > > return false; > > > > return msr->host_initiated || > > guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); > > } > > > > if (!kvm_cpu_cap_has(X86_FEATURE_IBT) && > > !kvm_cpu_cap_has(X86_FEATURE_SHSTK)) > > return false; > > Move above checks to the beginning? Why? The is_shadow_stack_msr() would still have to recheck X86_FEATURE_SHSTK, so hoisting the checks to the top would be doing unnecessary work. > > return msr->host_initiated || > > guest_cpuid_has(vcpu, X86_FEATURE_IBT) || > > guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); > > }
On 6/27/2023 10:55 PM, Sean Christopherson wrote: > On Tue, Jun 27, 2023, Weijiang Yang wrote: >> On 6/27/2023 5:15 AM, Sean Christopherson wrote: >>> And the above is also wrong for host_initiated writes to SHSTK MSRs. E.g. if KVM >>> is running on a CPU that has IBT but not SHSTK, then userspace can write to MSRs >>> that do not exist. >>> >>> Maybe this confusion is just a symptom of the series not providing proper >>> Supervisor Shadow Stack support, but that's still a poor excuse for posting >>> broken code. >>> >>> I suspect you tried to get too fancy. I don't see any reason to ever care about >>> kvm_caps.supported_xss beyond emulating writes to XSS itself. Just require that >>> both CET_USER and CET_KERNEL are supported in XSS to allow IBT or SHSTK, i.e. let >>> X86_FEATURE_IBT and X86_FEATURE_SHSTK speak for themselves. That way, this can >>> simply be: >> You're right, kvm_cet_user_supported() is overused. >> >> Let me recap to see if I understand correctly: >> >> 1. Check both CET_USER and CET_KERNEL are supported in XSS before advertise >> SHSTK is supported >> >> in KVM and expose it to guest, the reason is once SHSTK is exposed to guest, >> KVM should support both modes to honor arch integrity. >> >> 2. Check CET_USER is supported before advertise IBT is supported in KVM� and >> expose IBT, the reason is, user IBT(MSR_U_CET) depends on CET_USER bit while >> kernel IBT(MSR_S_CET) doesn't. > IBT can also used by the kernel... > > Just require that both CET_USER and CET_KERNEL are supported to advertise IBT > or SHSTK. I don't see why this is needs to be any more complex than that. The arch control for user/kernel mode CET is the big source of complexity of the helpers. Currently, CET_USER bit manages IA32_U_CET and IA32_PL3_SSP. And CET_KERNEL bit manages PL{0,1,2}_SSP, but architectural control/enable of IBT(user or kernel) is through IA32_{U,S}_CET, the former is XSAVE-managed, but the latter is not. Checking both before enable the features would make things much easier, but looks like CET_KERNEL check for kernel IBT is excessive, just want to get your opinion on this. Thanks! > >>> bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) >>> { >>> if (is_shadow_stack_msr(...)) >>> if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK)) >>> return false; >>> >>> return msr->host_initiated || >>> guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); >>> } >>> >>> if (!kvm_cpu_cap_has(X86_FEATURE_IBT) && >>> !kvm_cpu_cap_has(X86_FEATURE_SHSTK)) >>> return false; >> Move above checks to the beginning? > Why? The is_shadow_stack_msr() would still have to recheck X86_FEATURE_SHSTK, > so hoisting the checks to the top would be doing unnecessary work. Yeah, just considered from readability perspective for the change, but it does introduce unnecessary check. Will follow it. > >>> return msr->host_initiated || >>> guest_cpuid_has(vcpu, X86_FEATURE_IBT) || >>> guest_cpuid_has(vcpu, X86_FEATURE_SHSTK); >>> }
>> + case MSR_IA32_PL3_SSP: >> + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) >> + return 1; >> + if (is_noncanonical_address(data, vcpu)) >> + return 1; >> + if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6))) >> + return 1; >> + if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0))) > Please #define reserved bits, ideally using the inverse of the valid masks. And > for SSP, it might be better to do IS_ALIGNED(data, 8) (or 4, pending my question > about the SDM's wording). > > Side topic, what on earth does the SDM mean by this?!? > > The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 > (hardware requires bits 1:0 to be 0). > > I know Intel retroactively changed the alignment requirements, but the above > is nonsensical. If ucode prevents writing bits 2:0, who cares what hardware > requires? Hi, Sean, Regarding the alignment check, I got update from Gil: ================================================== The WRMSR instruction to load IA32_PL[0-3]_SSP will #GP if the value to be loaded sets either bit 0 or bit 1. It does not check bit 2. IDT event delivery, when changing to rings 0-2 will load SSP from the MSR corresponding to the new ring. These transitions check that bits 2:0 of the new value are all zero and will generate a nested fault if any of those bits are set. (Far CALL using a call gate also checks this if changing CPL.) For a VMM that is emulating a WRMSR by a guest OS (because it was intercepting writes to that MSR), it suffices to perform the same checks as the CPU would (i.e., only bits 1:0): • If the VMM sees bits 1:0 clear, it can perform the write on the part of the guest OS. If the guest OS later encounters a #GP during IDT event delivery (because bit 2 is set), it is its own fault. • If the VMM sets either bit 0 or bit 1 set, it should inject a #GP into the guest, as that is what the CPU would do in this case. For an OS that is writing to the MSRs to set up shadow stacks, it should WRMSR the base addresses of those stacks. Because of the token-based architecture used for supervisor shadow stacks (for rings 0-2), the base addresses of those stacks should be 8-byte aligned (clearing bits 2:0). Thus, the values that an OS writes to the corresponding MSRs should clear bits 2:0. (Of course, most OS’s will use only the MSR for ring 0, as most OS’s do not use rings 1 and 2.) In contrast, the IA32_PL3_SSP MSR holds the current SSP for user software. When a user thread is created, I suppose it may reference the base of the user shadow stack. For a 32-bit app, that needs to be 4-byte aligned (bits 1:0 clear); for a 64-bit app, it may be necessary for it to be 8-byte aligned (bits 2:0) clear. Once the user thread is executing, the CPU will load IA32_PL3_SSP with the user’s value of SSP on every exception and interrupt to ring 0. The value at that time may be 4-byte or 8-byte aligned, depending on how the user thread is using the shadow stack. On context switches, the OS should WRMSR whatever value was saved (by RDMSR) the last time there was a context switch away from the incoming thread. The OS should not need to inspect or change this value. =================================================== Based on his feedback, I think VMM needs to check bits 1:0 when write the SSP MSRs. Is it?
There is a small typo below (which came from me originally): Where it says, "If the VMM sets either bit 0 or bit 1 set, it should inject a #GP" - it should be "If the VMM _sees_ ...". - Gil -----Original Message----- From: Yang, Weijiang <weijiang.yang@intel.com> Sent: Friday, July 7, 2023 02:10 To: Christopherson,, Sean <seanjc@google.com> Cc: pbonzini@redhat.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; peterz@infradead.org; rppt@kernel.org; binbin.wu@linux.intel.com; Edgecombe, Rick P <rick.p.edgecombe@intel.com>; john.allen@amd.com; Sean Christopherson <sean.j.christopherson@intel.com>; Neiger, Gil <gil.neiger@intel.com> Subject: Re: [PATCH v3 13/21] KVM:VMX: Emulate reads and writes to CET MSRs >> + case MSR_IA32_PL3_SSP: >> + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) >> + return 1; >> + if (is_noncanonical_address(data, vcpu)) >> + return 1; >> + if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6))) >> + return 1; >> + if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0))) > Please #define reserved bits, ideally using the inverse of the valid > masks. And for SSP, it might be better to do IS_ALIGNED(data, 8) (or > 4, pending my question about the SDM's wording). > > Side topic, what on earth does the SDM mean by this?!? > > The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 > (hardware requires bits 1:0 to be 0). > > I know Intel retroactively changed the alignment requirements, but the > above is nonsensical. If ucode prevents writing bits 2:0, who cares > what hardware requires? Hi, Sean, Regarding the alignment check, I got update from Gil: ================================================== The WRMSR instruction to load IA32_PL[0-3]_SSP will #GP if the value to be loaded sets either bit 0 or bit 1. It does not check bit 2. IDT event delivery, when changing to rings 0-2 will load SSP from the MSR corresponding to the new ring. These transitions check that bits 2:0 of the new value are all zero and will generate a nested fault if any of those bits are set. (Far CALL using a call gate also checks this if changing CPL.) For a VMM that is emulating a WRMSR by a guest OS (because it was intercepting writes to that MSR), it suffices to perform the same checks as the CPU would (i.e., only bits 1:0): • If the VMM sees bits 1:0 clear, it can perform the write on the part of the guest OS. If the guest OS later encounters a #GP during IDT event delivery (because bit 2 is set), it is its own fault. • If the VMM sets either bit 0 or bit 1 set, it should inject a #GP into the guest, as that is what the CPU would do in this case. For an OS that is writing to the MSRs to set up shadow stacks, it should WRMSR the base addresses of those stacks. Because of the token-based architecture used for supervisor shadow stacks (for rings 0-2), the base addresses of those stacks should be 8-byte aligned (clearing bits 2:0). Thus, the values that an OS writes to the corresponding MSRs should clear bits 2:0. (Of course, most OS’s will use only the MSR for ring 0, as most OS’s do not use rings 1 and 2.) In contrast, the IA32_PL3_SSP MSR holds the current SSP for user software. When a user thread is created, I suppose it may reference the base of the user shadow stack. For a 32-bit app, that needs to be 4-byte aligned (bits 1:0 clear); for a 64-bit app, it may be necessary for it to be 8-byte aligned (bits 2:0) clear. Once the user thread is executing, the CPU will load IA32_PL3_SSP with the user’s value of SSP on every exception and interrupt to ring 0. The value at that time may be 4-byte or 8-byte aligned, depending on how the user thread is using the shadow stack. On context switches, the OS should WRMSR whatever value was saved (by RDMSR) the last time there was a context switch away from the incoming thread. The OS should not need to inspect or change this value. =================================================== Based on his feedback, I think VMM needs to check bits 1:0 when write the SSP MSRs. Is it?
On Fri, Jul 07, 2023, Weijiang Yang wrote: > > Side topic, what on earth does the SDM mean by this?!? > > > > The linear address written must be aligned to 8 bytes and bits 2:0 must be 0 > > (hardware requires bits 1:0 to be 0). > > > > I know Intel retroactively changed the alignment requirements, but the above > > is nonsensical. If ucode prevents writing bits 2:0, who cares what hardware > > requires? > > Hi, Sean, > > Regarding the alignment check, I got update from Gil: > > ================================================== > > The WRMSR instruction to load IA32_PL[0-3]_SSP will #GP if the value to be > loaded sets either bit 0 or bit 1. It does not check bit 2. > IDT event delivery, when changing to rings 0-2 will load SSP from the MSR > corresponding to the new ring. These transitions check that bits 2:0 of the > new value are all zero and will generate a nested fault if any of those bits > are set. (Far CALL using a call gate also checks this if changing CPL.) > > For a VMM that is emulating a WRMSR by a guest OS (because it was > intercepting writes to that MSR), it suffices to perform the same checks as > the CPU would (i.e., only bits 1:0): > • If the VMM sees bits 1:0 clear, it can perform the write on the part of > the guest OS. If the guest OS later encounters a #GP during IDT event > delivery (because bit 2 is set), it is its own fault. > • If the VMM sets either bit 0 or bit 1 set, it should inject a #GP into > the guest, as that is what the CPU would do in this case. > > For an OS that is writing to the MSRs to set up shadow stacks, it should > WRMSR the base addresses of those stacks. Because of the token-based > architecture used for supervisor shadow stacks (for rings 0-2), the base > addresses of those stacks should be 8-byte aligned (clearing bits 2:0). > Thus, the values that an OS writes to the corresponding MSRs should clear > bits 2:0. > > (Of course, most OS’s will use only the MSR for ring 0, as most OS’s do not > use rings 1 and 2.) > > In contrast, the IA32_PL3_SSP MSR holds the current SSP for user software. > When a user thread is created, I suppose it may reference the base of the > user shadow stack. For a 32-bit app, that needs to be 4-byte aligned (bits > 1:0 clear); for a 64-bit app, it may be necessary for it to be 8-byte > aligned (bits 2:0) clear. > > Once the user thread is executing, the CPU will load IA32_PL3_SSP with the > user’s value of SSP on every exception and interrupt to ring 0. The value > at that time may be 4-byte or 8-byte aligned, depending on how the user > thread is using the shadow stack. On context switches, the OS should WRMSR > whatever value was saved (by RDMSR) the last time there was a context switch > away from the incoming thread. The OS should not need to inspect or change > this value. > > =================================================== > > Based on his feedback, I think VMM needs to check bits 1:0 when write the > SSP MSRs. Is it? Yep, KVM should only check bits 1:0 when emulating WRMSR. KVM doesn't emulate event delivery except for Real Mode, and I don't see that ever changing. So to "handle" the #GP during event delivery case, KVM just needs to propagate the "bad" value into guest context, which KVM needs to do anyways. Thanks for following up on this!
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index f851558b673f..b4e28487882c 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -770,6 +770,7 @@ void fpregs_lock_and_load(void) if (test_thread_flag(TIF_NEED_FPU_LOAD)) fpregs_restore_userregs(); } +EXPORT_SYMBOL_GPL(fpregs_lock_and_load); #ifdef CONFIG_X86_DEBUG_FPU /* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c872a5aafa50..0ccaa467d7d3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2093,6 +2093,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) else msr_info->data = vmx->pt_desc.guest.addr_a[index / 2]; break; + case MSR_IA32_U_CET: + case MSR_IA32_PL3_SSP: + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) + return 1; + kvm_get_xsave_msr(msr_info); + break; case MSR_IA32_DEBUGCTLMSR: msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL); break; @@ -2405,6 +2411,18 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) else vmx->pt_desc.guest.addr_a[index / 2] = data; break; + case MSR_IA32_U_CET: + case MSR_IA32_PL3_SSP: + if (!kvm_cet_is_msr_accessible(vcpu, msr_info)) + return 1; + if (is_noncanonical_address(data, vcpu)) + return 1; + if (msr_index == MSR_IA32_U_CET && (data & GENMASK(9, 6))) + return 1; + if (msr_index == MSR_IA32_PL3_SSP && (data & GENMASK(2, 0))) + return 1; + kvm_set_xsave_msr(msr_info); + break; case MSR_IA32_PERF_CAPABILITIES: if (data && !vcpu_to_pmu(vcpu)->version) return 1; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b6eec9143129..2e3a39c9297c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13630,6 +13630,26 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, } EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr) +{ + if (!kvm_cet_user_supported()) + return false; + + if (msr->host_initiated) + return true; + + if (!guest_cpuid_has(vcpu, X86_FEATURE_SHSTK) && + !guest_cpuid_has(vcpu, X86_FEATURE_IBT)) + return false; + + if (msr->index == MSR_IA32_PL3_SSP && + !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK)) + return false; + + return true; +} +EXPORT_SYMBOL_GPL(kvm_cet_is_msr_accessible); + EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2ba7c7fc4846..93afa7631735 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -2,6 +2,7 @@ #ifndef ARCH_X86_KVM_X86_H #define ARCH_X86_KVM_X86_H +#include <asm/fpu/api.h> #include <linux/kvm_host.h> #include <asm/fpu/xstate.h> #include <asm/mce.h> @@ -370,6 +371,16 @@ static inline bool kvm_mpx_supported(void) == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); } +/* + * Guest CET user mode states depend on host XSAVES/XRSTORS to save/restore + * when vCPU enter/exit user space. If host doesn't support CET user bit in + * XSS msr, then treat this case as KVM doesn't support CET user mode. + */ +static inline bool kvm_cet_user_supported(void) +{ + return !!(kvm_caps.supported_xss & XFEATURE_MASK_CET_USER); +} + extern unsigned int min_timer_period_us; extern bool enable_vmware_backdoor; @@ -546,5 +557,25 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes, int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, unsigned int port, void *data, unsigned int count, int in); +bool kvm_cet_is_msr_accessible(struct kvm_vcpu *vcpu, struct msr_data *msr); + +/* + * We've already loaded guest MSRs in __msr_io() after check the MSR index. + * In case vcpu has been preempted, we need to disable preemption, check + * and reload the guest fpu states before read/write xsaves-managed MSRs. + */ +static inline void kvm_get_xsave_msr(struct msr_data *msr_info) +{ + fpregs_lock_and_load(); + rdmsrl(msr_info->index, msr_info->data); + fpregs_unlock(); +} + +static inline void kvm_set_xsave_msr(struct msr_data *msr_info) +{ + fpregs_lock_and_load(); + wrmsrl(msr_info->index, msr_info->data); + fpregs_unlock(); +} #endif