Message ID | 20230511040857.6094-1-weijiang.yang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Enable CET Virtualization | expand |
On Thu, May 11, 2023, Yang Weijiang wrote: > The last patch is introduced to support supervisor SHSTK but the feature is > not enabled on Intel platform for now, the main purpose of this patch is to > facilitate AMD folks to enable the feature. I am beyond confused by the SDM's wording of CET_SSS. First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is more appropriate phrasing). Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor shadow stacks as long as it ensures that certain supervisor shadow-stack pushes will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). But then it says says VMMs shouldn't set the bit. When emulating the CPUID instruction, a virtual-machine monitor should return this bit as 0 if those pushes can cause VM exits. Based on the Xen code (which is sadly a far better source of information than the SDM), I *think* that what the SDM is trying to say is that VMMs should not set CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because if the SDM really means "VMMs should never set the bit", then what on earth is the point of the bit. > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but > doesn't fully support CET supervisor SHSTK, the enabling work is left for > the future. Why? If my interpretation of the SDM is correct, then all the pieces are there. > Executed all KVM-unit-test cases and KVM selftests against this series, all > test cases passed except the vmx test, the failure is due to CR4_CET bit > testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test > passed. I'll send a patch to fix this issue later. Your cover letter from v2 back in April said the same thing. Why hasn't the patch been posted? And what exactly is the issue? IIUC, setting CR4.CET with MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's a KVM bug. And if that's the case, the next obvious questions is, why are you posting known buggy code?
On Thu, Jun 15, 2023, Sean Christopherson wrote: > Your cover letter from v2 back in April said the same thing. Why hasn't the patch > been posted? And what exactly is the issue? IIUC, setting CR4.CET with > MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's > a KVM bug. And if that's the case, the next obvious questions is, why are you > posting known buggy code? Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1?
On 6/16/2023 8:00 AM, Sean Christopherson wrote: > On Thu, Jun 15, 2023, Sean Christopherson wrote: >> Your cover letter from v2 back in April said the same thing. Why hasn't the patch >> been posted? And what exactly is the issue? IIUC, setting CR4.CET with >> MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's >> a KVM bug. And if that's the case, the next obvious questions is, why are you >> posting known buggy code? > Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1? Thanks for taking time to review this series! Yes, due to CR0.WP bit is not set while CR4.CET is being set. The check is imposed by patch-12. I'll add the fixup patch together with next the version.
On 6/16/2023 7:30 AM, Sean Christopherson wrote: > On Thu, May 11, 2023, Yang Weijiang wrote: >> The last patch is introduced to support supervisor SHSTK but the feature is >> not enabled on Intel platform for now, the main purpose of this patch is to >> facilitate AMD folks to enable the feature. > I am beyond confused by the SDM's wording of CET_SSS. > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is > more appropriate phrasing). > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 > Architectures Software Developer’s Manual, Volume 1). > > But then it says says VMMs shouldn't set the bit. > > When emulating the CPUID instruction, a virtual-machine monitor should return > this bit as 0 if those pushes can cause VM exits. > > Based on the Xen code (which is sadly a far better source of information than the > SDM), I *think* that what the SDM is trying to say is that VMMs should not set > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because > if the SDM really means "VMMs should never set the bit", then what on earth is the > point of the bit. I need to double check for the vague description. From my understanding, on bare metal side, if the bit is 1, OS can enable SSS if pushes won't cause page fault. But for VM case, it's not recommended(regardless of the bit state) to set the bit as vm-exits caused by guest SSS pushes cannot be fully excluded. In other word, the bit is mainly for bare metal guidance now. >> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but >> doesn't fully support CET supervisor SHSTK, the enabling work is left for >> the future. > Why? If my interpretation of the SDM is correct, then all the pieces are there. My assumption is, VM supervisor SHSTK depends bare metal kernel support as PL0_SSP MSR is backed by XSAVES via IA32_XSS:bit12(CET_S), but this part of support is not there in Rick's native series. And also based on above SDM description, I don't want to add the support blindly now. > [...]
On Fri, Jun 16, 2023, Weijiang Yang wrote: > > On 6/16/2023 7:30 AM, Sean Christopherson wrote: > > On Thu, May 11, 2023, Yang Weijiang wrote: > > > The last patch is introduced to support supervisor SHSTK but the feature is > > > not enabled on Intel platform for now, the main purpose of this patch is to > > > facilitate AMD folks to enable the feature. > > I am beyond confused by the SDM's wording of CET_SSS. > > > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is > > more appropriate phrasing). > > > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor > > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes > > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 > > Architectures Software Developer’s Manual, Volume 1). > > > > But then it says says VMMs shouldn't set the bit. > > > > When emulating the CPUID instruction, a virtual-machine monitor should return > > this bit as 0 if those pushes can cause VM exits. > > > > Based on the Xen code (which is sadly a far better source of information than the > > SDM), I *think* that what the SDM is trying to say is that VMMs should not set > > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because > > if the SDM really means "VMMs should never set the bit", then what on earth is the > > point of the bit. > > I need to double check for the vague description. > > From my understanding, on bare metal side, if the bit is 1, OS can enable > SSS if pushes won't cause page fault. But for VM case, it's not recommended > (regardless of the bit state) to set the bit as vm-exits caused by guest SSS > pushes cannot be fully excluded. > > In other word, the bit is mainly for bare metal guidance now. > > > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but > > > doesn't fully support CET supervisor SHSTK, the enabling work is left for > > > the future. > > Why? If my interpretation of the SDM is correct, then all the pieces are there. ... > And also based on above SDM description, I don't want to add the support > blindly now. *sigh* I got filled in on the details offlist. 1) In the next version of this series, please rework it to reincorporate Supervisor Shadow Stack support into the main series, i.e. pretend Intel's implemenation isn't horribly flawed. KVM can't guarantee that a VM-Exit won't occur, i.e. can't advertise CET_SS, but I want the baseline support to be implemented, otherwise the series as a whole is a big confusing mess with unanswered question left, right, and center. And more importantly, architecturally SSS exists if X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize SSS if it so chooses, with the obvious caveat that there's a non-zero chance the guest risks death by doing so. Or if userspace can ensure no VM-Exit will occur, which is difficult but feasible (ignoring #MC), e.g. by statically partitioning memory, prefaulting all memory in guest firmware, and not dirty logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS to the guest, and KVM should support that. 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS. While Intel is apparently ok with treating KVM developers like mushrooms, I am not. --- From: Sean Christopherson <seanjc@google.com> Date: Fri, 16 Jun 2023 10:04:37 -0700 Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise CET_SSS Explicitly call out that KVM must NOT advertise CET_SSS to userspace, i.e. must not tell userspace and thus the guest that it is safe for the guest to enable Supervisor Shadow Stacks (SSS). Intel's implementation of SSS is fatally flawed for virtualized environments, as despite wording in the SDM that suggests otherwise, Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only the check-and-update of the supervisor shadow stack token's busy bit is atomic. Per the SDM: If the far CALL or event delivery pushes a stack frame after the token is acquired and any of the pushes causes a fault or VM exit, the processor will revert to the old shadow stack and the busy bit in the new shadow stack's token remains set. Or more bluntly, any fault or VM-Exit that occurs when pushing to the shadow stack after the busy bit is set is fatal to the kernel, i.e. to the guest in KVM's case. The (guest) kernel can protect itself against faults, e.g. by ensuring that the shadow stack always has a valid mapping, but a guest kernel obviously has no control over, or even knowledge of, VM-Exits due to host activity. To help software determine when it is safe to use SSS, Intel defined CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS, i.e. bare metal Intel CPUs advertise to software that it is safe to enable SSS. If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is sufficient for an operating system to ensure that none of the pushes can cause a page fault. But CET_SS also comes with an major caveat that is kinda sorta documented in the SDM: When emulating the CPUID instruction, a virtual-machine monitor should return this bit as 0 if those pushes can cause VM exits. In other words, CET_SSS (bit 18) does NOT enumerate that the underlying CPU prevents VM-Exits, only that the environment in which the software is running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the bleeding and allow kernels to enable SSS, not an indication that the underlying CPU is immune to the VM-Exit problem. And unfortunately, KVM itself effectively has zero chance of ensuring that a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs when any memslot is deleted, enabling dirty logging write-protects SPTEs, etc. A sufficiently motivated userspace can, at least in theory, provide a safe environment for SSS, e.g. by statically partitioning and prefaulting (in guest firmware) all memory, disabling PML, never write-protecting guest shadow stacks, etc. But such a setup is far, far beyond typical KVM deployments. Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full shadow stack switch atomically so long as the stack is mapped WB and does not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved guest play nice with SSS without additional shenanigans. Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/cpuid.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 1e3ee96c879b..ecf4a68aaa08 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void) ); kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | + + /* + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the + * guest that it is safe to use Supervisor Shadow Stacks under + * KVM when running on Intel CPUs. KVM itself cannot guarantee + * that a VM-Exit won't occur during a shadow stack update. + */ + 0 /* F(CET_SSS) */ ); kvm_cpu_cap_mask(CPUID_D_1_EAX, base-commit: 9305c14847719870e9e08294034861360577ce08 --
On 6/17/2023 1:56 AM, Sean Christopherson wrote: > On Fri, Jun 16, 2023, Weijiang Yang wrote: >> On 6/16/2023 7:30 AM, Sean Christopherson wrote: >>> On Thu, May 11, 2023, Yang Weijiang wrote: >>>> The last patch is introduced to support supervisor SHSTK but the feature is >>>> not enabled on Intel platform for now, the main purpose of this patch is to >>>> facilitate AMD folks to enable the feature. >>> I am beyond confused by the SDM's wording of CET_SSS. >>> >>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is >>> more appropriate phrasing). >>> >>> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor >>> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes >>> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 >>> Architectures Software Developer’s Manual, Volume 1). >>> >>> But then it says says VMMs shouldn't set the bit. >>> >>> When emulating the CPUID instruction, a virtual-machine monitor should return >>> this bit as 0 if those pushes can cause VM exits. >>> >>> Based on the Xen code (which is sadly a far better source of information than the >>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set >>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because >>> if the SDM really means "VMMs should never set the bit", then what on earth is the >>> point of the bit. >> I need to double check for the vague description. >> >> From my understanding, on bare metal side, if the bit is 1, OS can enable >> SSS if pushes won't cause page fault. But for VM case, it's not recommended >> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS >> pushes cannot be fully excluded. >> >> In other word, the bit is mainly for bare metal guidance now. >> >>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but >>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for >>>> the future. >>> Why? If my interpretation of the SDM is correct, then all the pieces are there. > ... > >> And also based on above SDM description, I don't want to add the support >> blindly now. > *sigh* > > I got filled in on the details offlist. > > 1) In the next version of this series, please rework it to reincorporate Supervisor > Shadow Stack support into the main series, i.e. pretend Intel's implemenation > isn't horribly flawed. Let me make it clear, you want me to do two things: 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU context switch. 2) Add Supervisor Shadow stack support into KVM part so that guest OS is able to use SSS with risk. is it correct? > KVM can't guarantee that a VM-Exit won't occur, i.e. > can't advertise CET_SS, but I want the baseline support to be implemented, > otherwise the series as a whole is a big confusing mess with unanswered question > left, right, and center. And more importantly, architecturally SSS exists if > X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize > SSS if it so chooses, with the obvious caveat that there's a non-zero chance > the guest risks death by doing so. Or if userspace can ensure no VM-Exit will > occur, which is difficult but feasible (ignoring #MC), e.g. by statically > partitioning memory, prefaulting all memory in guest firmware, and not dirty > logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS > to the guest, and KVM should support that. Make sense, provide support but take risk on your own. > > 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS. > While Intel is apparently ok with treating KVM developers like mushrooms, I > am not. will add it, thanks a lot for detailed change logs! > > --- > From: Sean Christopherson <seanjc@google.com> > Date: Fri, 16 Jun 2023 10:04:37 -0700 > Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise > CET_SSS > > Explicitly call out that KVM must NOT advertise CET_SSS to userspace, > i.e. must not tell userspace and thus the guest that it is safe for the > guest to enable Supervisor Shadow Stacks (SSS). > > Intel's implementation of SSS is fatally flawed for virtualized > environments, as despite wording in the SDM that suggests otherwise, > Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only > the check-and-update of the supervisor shadow stack token's busy bit is > atomic. Per the SDM: > > If the far CALL or event delivery pushes a stack frame after the token > is acquired and any of the pushes causes a fault or VM exit, the > processor will revert to the old shadow stack and the busy bit in the > new shadow stack's token remains set. > > Or more bluntly, any fault or VM-Exit that occurs when pushing to the > shadow stack after the busy bit is set is fatal to the kernel, i.e. to > the guest in KVM's case. The (guest) kernel can protect itself against > faults, e.g. by ensuring that the shadow stack always has a valid mapping, > but a guest kernel obviously has no control over, or even knowledge of, > VM-Exits due to host activity. > > To help software determine when it is safe to use SSS, Intel defined > CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS, > i.e. bare metal Intel CPUs advertise to software that it is safe to enable > SSS. > > If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is > sufficient for an operating system to ensure that none of the pushes can > cause a page fault. > > But CET_SS also comes with an major caveat that is kinda sorta documented > in the SDM: > > When emulating the CPUID instruction, a virtual-machine monitor should > return this bit as 0 if those pushes can cause VM exits. > > In other words, CET_SSS (bit 18) does NOT enumerate that the underlying > CPU prevents VM-Exits, only that the environment in which the software is > running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the > bleeding and allow kernels to enable SSS, not an indication that the > underlying CPU is immune to the VM-Exit problem. > > And unfortunately, KVM itself effectively has zero chance of ensuring that > a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs > when any memslot is deleted, enabling dirty logging write-protects SPTEs, > etc. A sufficiently motivated userspace can, at least in theory, provide > a safe environment for SSS, e.g. by statically partitioning and > prefaulting (in guest firmware) all memory, disabling PML, never > write-protecting guest shadow stacks, etc. But such a setup is far, far > beyond typical KVM deployments. > > Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full > shadow stack switch atomically so long as the stack is mapped WB and does > not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved > guest play nice with SSS without additional shenanigans. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/cpuid.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 1e3ee96c879b..ecf4a68aaa08 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void) > ); > > kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, > - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) > + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | > + > + /* > + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the > + * guest that it is safe to use Supervisor Shadow Stacks under > + * KVM when running on Intel CPUs. KVM itself cannot guarantee > + * that a VM-Exit won't occur during a shadow stack update. > + */ > + 0 /* F(CET_SSS) */ > ); > > kvm_cpu_cap_mask(CPUID_D_1_EAX, > > base-commit: 9305c14847719870e9e08294034861360577ce08
On Mon, Jun 19, 2023, Weijiang Yang wrote: > > On 6/17/2023 1:56 AM, Sean Christopherson wrote: > > On Fri, Jun 16, 2023, Weijiang Yang wrote: > > > On 6/16/2023 7:30 AM, Sean Christopherson wrote: > > > > On Thu, May 11, 2023, Yang Weijiang wrote: > > > > > The last patch is introduced to support supervisor SHSTK but the feature is > > > > > not enabled on Intel platform for now, the main purpose of this patch is to > > > > > facilitate AMD folks to enable the feature. > > > > I am beyond confused by the SDM's wording of CET_SSS. > > > > > > > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is > > > > more appropriate phrasing). > > > > > > > > Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor > > > > shadow stacks as long as it ensures that certain supervisor shadow-stack pushes > > > > will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 > > > > Architectures Software Developer’s Manual, Volume 1). > > > > > > > > But then it says says VMMs shouldn't set the bit. > > > > > > > > When emulating the CPUID instruction, a virtual-machine monitor should return > > > > this bit as 0 if those pushes can cause VM exits. > > > > > > > > Based on the Xen code (which is sadly a far better source of information than the > > > > SDM), I *think* that what the SDM is trying to say is that VMMs should not set > > > > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because > > > > if the SDM really means "VMMs should never set the bit", then what on earth is the > > > > point of the bit. > > > I need to double check for the vague description. > > > > > > From my understanding, on bare metal side, if the bit is 1, OS can enable > > > SSS if pushes won't cause page fault. But for VM case, it's not recommended > > > (regardless of the bit state) to set the bit as vm-exits caused by guest SSS > > > pushes cannot be fully excluded. > > > > > > In other word, the bit is mainly for bare metal guidance now. > > > > > > > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but > > > > > doesn't fully support CET supervisor SHSTK, the enabling work is left for > > > > > the future. > > > > Why? If my interpretation of the SDM is correct, then all the pieces are there. > > ... > > > > > And also based on above SDM description, I don't want to add the support > > > blindly now. > > *sigh* > > > > I got filled in on the details offlist. > > > > 1) In the next version of this series, please rework it to reincorporate Supervisor > > Shadow Stack support into the main series, i.e. pretend Intel's implemenation > > isn't horribly flawed. > > Let me make it clear, you want me to do two things: > > 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into > kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU > context switch. If that's necessary for correct functionality, yes. > 2) Add Supervisor Shadow stack support into KVM part so that guest OS is > able to use SSS with risk. Yes. Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to provide both User and Supervisor support. CET_SSS doesn't change the architecture, it's little more than a hint. And even if the guest follows SDM's recommendation to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use the MSRs as scratch registers.
On 6/24/2023 4:51 AM, Sean Christopherson wrote: > On Mon, Jun 19, 2023, Weijiang Yang wrote: >> On 6/17/2023 1:56 AM, Sean Christopherson wrote: >>> On Fri, Jun 16, 2023, Weijiang Yang wrote: >>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote: >>>>> On Thu, May 11, 2023, Yang Weijiang wrote: >>>>>> The last patch is introduced to support supervisor SHSTK but the feature is >>>>>> not enabled on Intel platform for now, the main purpose of this patch is to >>>>>> facilitate AMD folks to enable the feature. >>>>> I am beyond confused by the SDM's wording of CET_SSS. >>>>> >>>>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is >>>>> more appropriate phrasing). >>>>> >>>>> Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor >>>>> shadow stacks as long as it ensures that certain supervisor shadow-stack pushes >>>>> will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32 >>>>> Architectures Software Developer’s Manual, Volume 1). >>>>> >>>>> But then it says says VMMs shouldn't set the bit. >>>>> >>>>> When emulating the CPUID instruction, a virtual-machine monitor should return >>>>> this bit as 0 if those pushes can cause VM exits. >>>>> >>>>> Based on the Xen code (which is sadly a far better source of information than the >>>>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set >>>>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU. Because >>>>> if the SDM really means "VMMs should never set the bit", then what on earth is the >>>>> point of the bit. >>>> I need to double check for the vague description. >>>> >>>> From my understanding, on bare metal side, if the bit is 1, OS can enable >>>> SSS if pushes won't cause page fault. But for VM case, it's not recommended >>>> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS >>>> pushes cannot be fully excluded. >>>> >>>> In other word, the bit is mainly for bare metal guidance now. >>>> >>>>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but >>>>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for >>>>>> the future. >>>>> Why? If my interpretation of the SDM is correct, then all the pieces are there. >>> ... >>> >>>> And also based on above SDM description, I don't want to add the support >>>> blindly now. >>> *sigh* >>> >>> I got filled in on the details offlist. >>> >>> 1) In the next version of this series, please rework it to reincorporate Supervisor >>> Shadow Stack support into the main series, i.e. pretend Intel's implemenation >>> isn't horribly flawed. >> Let me make it clear, you want me to do two things: >> >> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into >> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU >> context switch. > If that's necessary for correct functionality, yes. > >> 2) Add Supervisor Shadow stack support into KVM part so that guest OS is >> able to use SSS with risk. > Yes. Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to > provide both User and Supervisor support. CET_SSS doesn't change the architecture, > it's little more than a hint. And even if the guest follows SDM's recommendation > to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use > the MSRs as scratch registers. Understood, thanks!
> *sigh* > > I got filled in on the details offlist. > > 1) In the next version of this series, please rework it to reincorporate Supervisor > Shadow Stack support into the main series, i.e. pretend Intel's implemenation > isn't horribly flawed. KVM can't guarantee that a VM-Exit won't occur, i.e. > can't advertise CET_SS, but I want the baseline support to be implemented, > otherwise the series as a whole is a big confusing mess with unanswered question > left, right, and center. And more importantly, architecturally SSS exists if > X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize > SSS if it so chooses, with the obvious caveat that there's a non-zero chance > the guest risks death by doing so. Or if userspace can ensure no VM-Exit will > occur, which is difficult but feasible (ignoring #MC), e.g. by statically > partitioning memory, prefaulting all memory in guest firmware, and not dirty > logging SSS pages. In such an extreme setup, userspace can enumerate CET_SSS > to the guest, and KVM should support that. > > 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS. > While Intel is apparently ok with treating KVM developers like mushrooms, I > am not. > > --- > From: Sean Christopherson<seanjc@google.com> > Date: Fri, 16 Jun 2023 10:04:37 -0700 > Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise > CET_SSS > > Explicitly call out that KVM must NOT advertise CET_SSS to userspace, > i.e. must not tell userspace and thus the guest that it is safe for the > guest to enable Supervisor Shadow Stacks (SSS). > > Intel's implementation of SSS is fatally flawed for virtualized > environments, as despite wording in the SDM that suggests otherwise, > Intel CPUs' handling of shadow stack switches are NOT fully atomic. Only > the check-and-update of the supervisor shadow stack token's busy bit is > atomic. Per the SDM: > > If the far CALL or event delivery pushes a stack frame after the token > is acquired and any of the pushes causes a fault or VM exit, the > processor will revert to the old shadow stack and the busy bit in the > new shadow stack's token remains set. > > Or more bluntly, any fault or VM-Exit that occurs when pushing to the > shadow stack after the busy bit is set is fatal to the kernel, i.e. to > the guest in KVM's case. The (guest) kernel can protect itself against > faults, e.g. by ensuring that the shadow stack always has a valid mapping, > but a guest kernel obviously has no control over, or even knowledge of, > VM-Exits due to host activity. > > To help software determine when it is safe to use SSS, Intel defined > CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS, > i.e. bare metal Intel CPUs advertise to software that it is safe to enable > SSS. > > If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is > sufficient for an operating system to ensure that none of the pushes can > cause a page fault. > > But CET_SS also comes with an major caveat that is kinda sorta documented > in the SDM: > > When emulating the CPUID instruction, a virtual-machine monitor should > return this bit as 0 if those pushes can cause VM exits. > > In other words, CET_SSS (bit 18) does NOT enumerate that the underlying > CPU prevents VM-Exits, only that the environment in which the software is > running will not generate VM-Exits. I.e. CET_SSS is a stopgap to stem the > bleeding and allow kernels to enable SSS, not an indication that the > underlying CPU is immune to the VM-Exit problem. > > And unfortunately, KVM itself effectively has zero chance of ensuring that > a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs > when any memslot is deleted, enabling dirty logging write-protects SPTEs, > etc. A sufficiently motivated userspace can, at least in theory, provide > a safe environment for SSS, e.g. by statically partitioning and > prefaulting (in guest firmware) all memory, disabling PML, never > write-protecting guest shadow stacks, etc. But such a setup is far, far > beyond typical KVM deployments. > > Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full > shadow stack switch atomically so long as the stack is mapped WB and does > not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved > guest play nice with SSS without additional shenanigans. > > Signed-off-by: Sean Christopherson<seanjc@google.com> > --- > arch/x86/kvm/cpuid.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 1e3ee96c879b..ecf4a68aaa08 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void) > ); > > kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX, > - F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) > + F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) | > + > + /* > + * Do NOT advertise CET_SSS, i.e. do not tell userspace and the > + * guest that it is safe to use Supervisor Shadow Stacks under > + * KVM when running on Intel CPUs. KVM itself cannot guarantee > + * that a VM-Exit won't occur during a shadow stack update. > + */ > + 0 /* F(CET_SSS) */ > ); > > kvm_cpu_cap_mask(CPUID_D_1_EAX, > > base-commit: 9305c14847719870e9e08294034861360577ce08 Hi, Sean, Gil reminded me SDM has been updated CET SSS related topics recently(June release): ====================================================================== Section 17.2.3 (Supervisor Shadow Stack Token) in Volume 1 of the SDM: If the far CALL or event delivery pushes a stack frame after the token is acquired and any of the pushes causes a fault or VM exit, the processor will revert to the old shadow stack and the busy bit in the new shadow stack's token remains set. The new shadow stack is said to be prematurely busy. Software should enable supervisor shadow stacks only if it is certain that this situation cannot occur. If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is sufficient for an operating system to ensure that none of the pushes can cause a page fault. Volume 2A: CPUID.(EAX=07H,ECX=1H):EDX[bit 18] as follows: CET_SSS. If 1, indicates that an operating system can enable supervisor shadow stacks as long as it ensures that a supervisor shadow stack cannot become prematurely busy due to page faults (see Section 17.2.3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). When emulating the CPUID instruction, a virtual-machine monitor (VMM) should return this bit as 1 only if it ensures that VM exits cannot cause a guest supervisor shadow stack to appear to be prematurely busy. Such a VMM could set the “prematurely busy shadow stack” VM-exit control and use the additional information that it provides. Volume 3C: new “prematurely busy shadow stack” VM-exit control. ======================================================================== And Gil told me additional information was planed to be released later in the summer. Maybe you need modify above changelog a bit per the update. Given the updated parts are technical forecast, I don't plan to implement it in this series and still enumerate CET_SSS ==0 for guest. What's your thoughts?
On Mon, Jul 10, 2023, Weijiang Yang wrote: > Maybe you need modify above changelog a bit per the update. Ya, I'll make sure the changelog gets updated before CET support is merged, though feel free to post the next version without waiting for new changelog. > Given the updated parts are technical forecast, I don't plan to implement it > in this series and still enumerate > > CET_SSS ==0 for guest. What's your thoughts? Yes, definitely punt shadow-stack fixup to future enabling work.
On 7/11/2023 6:18 AM, Sean Christopherson wrote: > On Mon, Jul 10, 2023, Weijiang Yang wrote: >> Maybe you need modify above changelog a bit per the update. > Ya, I'll make sure the changelog gets updated before CET support is merged, though > feel free to post the next version without waiting for new changelog. Sure, thanks! >> Given the updated parts are technical forecast, I don't plan to implement it >> in this series and still enumerate >> >> CET_SSS ==0 for guest. What's your thoughts? > Yes, definitely punt shadow-stack fixup to future enabling work. Got it.
On 6/24/2023 4:51 AM, Sean Christopherson wrote: > On Mon, Jun 19, 2023, Weijiang Yang wrote: >> On 6/17/2023 1:56 AM, Sean Christopherson wrote: >>> On Fri, Jun 16, 2023, Weijiang Yang wrote: >>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote: >>>>> On Thu, May 11, 2023, Yang Weijiang wrote: >>>>> [...] >> Let me make it clear, you want me to do two things: >> >> 1)Add Supervisor Shadow Stack state support(i.e., XSS.bit12(CET_S)) into >> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU >> context switch. > If that's necessary for correct functionality, yes. Hi, Sean, I held off posting the new version and want to sync up with you on this point to avoid surprising you. After discussed adding the patch in kernel with Rick and Chao, we got blow conclusions on doing so: the Pros: - Super easy to implement for KVM. - Automatically avoids saving and restoring this data when the vmexit is handled within KVM. the Cons: - Unnecessarily restores XFEATURE_CET_KERNEL when switching to non-KVM task's userspace. - Forces allocating space for this state on all tasks, whether or not they use KVM, and with likely zero users today and the near future. - Complicates the FPU optimization thinking by including things that can have no affect on userspace in the FPU Given above reasons, I implemented guest CET supervisor states management in KVM instead of adding a kernel patch for it. Below are 3 KVM patches to support it: Patch 1: Save/reload guest CET supervisor states when necessary: ======================================================================= commit 16147ede75dee29583b7d42a6621d10d55b63595 Author: Yang Weijiang <weijiang.yang@intel.com> Date: Tue Jul 11 02:26:17 2023 -0400 KVM:x86: Make guest supervisor states as non-XSAVE managed Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP, when vCPU context is being swapped before and after userspace <->kernel entry, also do the same operation when vCPU is sched-in or sched-out. Enabling CET supervisor state management only in KVM due to: 1) Currently, suervisor SHSTK is not enabled on host side, only KVM needs to care about the guest's suervisor SHSTK states. 2) Enabling them in kernel FPU state framework has global effects to all threads on host kernel, but the majority of the threads are free to CET supervisor states. And it requires additional storage size of thread FPU state area. Add a new helper kvm_arch_sched_out() for that purpose. Adding the support in kvm_arch_vcpu_put/load() without the new helper looks possible, but the put/load functions are also called in vcpu_put()/load(), the latter are heavily used in KVM, so adding new helper makes the implementation clearer. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 7e7e19ef6993..98235cb3d258 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1023,6 +1023,7 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu); static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} void kvm_arm_init_debug(void); void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu); diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 957121a495f0..56c5e85ba5a3 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -893,6 +893,7 @@ static inline void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 14ee0dece853..11587d953bf6 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -880,6 +880,7 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {} static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {} diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index ee0acccb1d3b..6ff4a04fe0f2 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -244,6 +244,7 @@ struct kvm_vcpu_arch { static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} #define KVM_ARCH_WANT_MMU_NOTIFIER diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 2bbc3d54959d..d1750a6a86cf 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -1033,6 +1033,7 @@ extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc); static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} +static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {} static inline void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {} diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e2c549f147a5..7d9cfb7e2fe8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) trace_kvm_fpu(0); } +static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu) +{ + preempt_disable(); + if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { + rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); + rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); + rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); + wrmsrl(MSR_IA32_PL0_SSP, 0); + wrmsrl(MSR_IA32_PL1_SSP, 0); + wrmsrl(MSR_IA32_PL2_SSP, 0); + } + preempt_enable(); +} + +static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu) +{ + preempt_disable(); + if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { + wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); + wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); + wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); + } + preempt_enable(); +} + int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) { struct kvm_queued_exception *ex = &vcpu->arch.exception; @@ -11222,6 +11247,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_activate(vcpu); kvm_run->flags = 0; kvm_load_guest_fpu(vcpu); + kvm_reload_cet_supervisor_ssp(vcpu); kvm_vcpu_srcu_read_lock(vcpu); if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) { @@ -11310,6 +11336,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) r = vcpu_run(vcpu); out: + kvm_save_cet_supervisor_ssp(vcpu); kvm_put_guest_fpu(vcpu); if (kvm_run->kvm_valid_regs) store_regs(vcpu); @@ -12398,9 +12425,17 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) pmu->need_cleanup = true; kvm_make_request(KVM_REQ_PMU, vcpu); } + + kvm_reload_cet_supervisor_ssp(vcpu); + static_call(kvm_x86_sched_in)(vcpu, cpu); } +void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) +{ + kvm_save_cet_supervisor_ssp(vcpu); +} + void kvm_arch_free_vm(struct kvm *kvm) { kfree(to_kvm_hv(kvm)->hv_pa_pg); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d90331f16db1..b3032a5f0641 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1423,6 +1423,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu); void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu); +void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 66c1447d3c7f..42f28e8905e1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5885,6 +5885,7 @@ static void kvm_sched_out(struct preempt_notifier *pn, { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + kvm_arch_sched_out(vcpu, 0); if (current->on_rq) { WRITE_ONCE(vcpu->preempted, true); WRITE_ONCE(vcpu->ready, true); Patch 2: optimization patch for above one: =================================================================== commit ae5fe7c81cc3b93193758d1b7b4ab74a92a51dad Author: Yang Weijiang <weijiang.yang@intel.com> Date: Fri Jul 14 20:03:52 2023 -0400 KVM:x86: Optimize CET supervisor SSP save/reload Make PL{0,1,2}_SSP as write-intercepted to detect whether guest is using these MSRs. Disable intercept to the MSRs if they're written with non-zero values. KVM does save/ reload for the MSRs only if they're used by guest. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 69cbc9d9b277..c50b555234fb 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -748,6 +748,7 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; bool xsaves_enabled; bool xfd_no_write_intercept; + bool cet_sss_active; u64 ia32_xss; u64 microcode_version; u64 arch_capabilities; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 90ce1c7d3fd7..21c89d200c88 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2156,6 +2156,18 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated return debugctl; } +static void vmx_disable_write_intercept_sss_msr(struct kvm_vcpu *vcpu) +{ + if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) { + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, + MSR_TYPE_RW, false); + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP, + MSR_TYPE_RW, false); + vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP, + MSR_TYPE_RW, false); + } +} + /* * Writes msr value into the appropriate "register". * Returns 0 on success, non-0 otherwise. @@ -2427,7 +2439,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) #define VMX_CET_CONTROL_MASK (~GENMASK_ULL(9,6)) #define LEG_BITMAP_BASE(data) ((data) >> 12) case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP: - return kvm_set_msr_common(vcpu, msr_info); + if (kvm_set_msr_common(vcpu, msr_info)) + return 1; + /* + * Write to the base SSP MSRs should happen ahead of toggling + * of IA32_S_CET.SH_STK_EN bit. + */ + if (!msr_info->host_initiated && + msr_index != MSR_IA32_PL3_SSP && data) { + vmx_disable_write_intercept_sss_msr(vcpu); + wrmsrl(msr_index, data); + } break; case MSR_IA32_U_CET: case MSR_IA32_S_CET: @@ -7773,12 +7795,17 @@ static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu) MSR_TYPE_RW, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET, MSR_TYPE_RW, false); + /* + * Supervisor shadow stack MSRs are intercepted until + * they're written by guest, this is designed to + * optimize the save/restore overhead. + */ vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP, - MSR_TYPE_RW, false); + MSR_TYPE_R, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP, - MSR_TYPE_RW, false); + MSR_TYPE_R, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP, - MSR_TYPE_RW, false); + MSR_TYPE_R, false); vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cab31dbb2bec..06dc5111da3b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4049,8 +4049,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!IS_ALIGNED(data, 4)) return 1; if (msr == MSR_IA32_PL0_SSP || msr == MSR_IA32_PL1_SSP || - msr == MSR_IA32_PL2_SSP) + msr == MSR_IA32_PL2_SSP) { + if (!msr_info->host_initiated && data) + vcpu->arch.cet_sss_active = true; vcpu->arch.cet_s_ssp[msr - MSR_IA32_PL0_SSP] = data; + } else if (msr == MSR_IA32_PL3_SSP) kvm_set_xsave_msr(msr_info); break; @@ -11250,7 +11253,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) kvm_sigset_activate(vcpu); kvm_run->flags = 0; kvm_load_guest_fpu(vcpu); - kvm_reload_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_reload_cet_supervisor_ssp(vcpu); kvm_vcpu_srcu_read_lock(vcpu); if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) { @@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) r = vcpu_run(vcpu); out: - kvm_save_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_save_cet_supervisor_ssp(vcpu); kvm_put_guest_fpu(vcpu); if (kvm_run->kvm_valid_regs) store_regs(vcpu); kvm_vcpu_srcu_read_lock(vcpu); if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) { @@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu) r = vcpu_run(vcpu); out: - kvm_save_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_save_cet_supervisor_ssp(vcpu); kvm_put_guest_fpu(vcpu); if (kvm_run->kvm_valid_regs) store_regs(vcpu); @@ -12428,15 +12433,16 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) pmu->need_cleanup = true; kvm_make_request(KVM_REQ_PMU, vcpu); } - - kvm_reload_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_reload_cet_supervisor_ssp(vcpu); static_call(kvm_x86_sched_in)(vcpu, cpu); } void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) { - kvm_save_cet_supervisor_ssp(vcpu); + if (vcpu->arch.cet_sss_active) + kvm_save_cet_supervisor_ssp(vcpu); } void kvm_arch_free_vm(struct kvm *kvm) ============================================================= Patch 3: support guest CET supervisor xstate bit: commit 2708b3c959db56fb9243f9a157884c2120b8810c Author: Yang Weijiang <weijiang.yang@intel.com> Date: Sat Jul 15 20:56:37 2023 -0400 KVM:x86: Enable guest CET supervisor xstate bit support Add S_CET bit in kvm_caps.supported_xss so that guest can enumerate the feature in CPUID(0xd,1).ECX. Guest S_CET xstate bit is specially handled, i.e., it can be exposed without related enabling on host side, because KVM manually saves/reloads guest supervisor SHSTK SSPs and current XSS swap logic for host/guest aslo supports doing so, thus it's safe to enable the bit without host support. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2653e5eb54ee..071bcdedc530 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -228,7 +228,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs; | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \ | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE) -#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER) +#define KVM_SUPPORTED_XSS (XFEATURE_MASK_CET_USER | \ + XFEATURE_MASK_CET_KERNEL) u64 __read_mostly host_efer; EXPORT_SYMBOL_GPL(host_efer); @@ -9639,6 +9640,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) if (boot_cpu_has(X86_FEATURE_XSAVES)) { rdmsrl(MSR_IA32_XSS, host_xss); kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS; + kvm_caps.supported_xss |= XFEATURE_MASK_CET_KERNEL; } kvm_init_pmu_capability(ops->pmu_ops); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index f8f042c91728..df187d7c3e74 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -362,7 +362,7 @@ static inline bool kvm_mpx_supported(void) == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR); } -#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER) +#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL) /* * Shadow Stack and Indirect Branch Tracking feature enabling depends on * whether host side CET user xstate bit is supported or not. ================================================================= What's your thoughts on the solution? Is it appropriate for KVM? Thanks! [...]
On Mon, Jul 17, 2023, Weijiang Yang wrote: > > On 6/24/2023 4:51 AM, Sean Christopherson wrote: > > > 1)Add Supervisor Shadow Stack� state support(i.e., XSS.bit12(CET_S)) into > > > kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU > > > context switch. > > If that's necessary for correct functionality, yes. ... > the Pros: > �- Super easy to implement for KVM. > �- Automatically avoids saving and restoring this data when the vmexit > �� is handled within KVM. > > the Cons: > �- Unnecessarily restores XFEATURE_CET_KERNEL when switching to > �� non-KVM task's userspace. > �- Forces allocating space for this state on all tasks, whether or not > �� they use KVM, and with likely zero users today and the near future. > �- Complicates the FPU optimization thinking by including things that > �� can have no affect on userspace in the FPU > > Given above reasons, I implemented guest CET supervisor states management > in KVM instead of adding a kernel patch for it. > > Below are 3 KVM patches to support it: > > Patch 1: Save/reload guest CET supervisor states when necessary: > > ======================================================================= > > commit 16147ede75dee29583b7d42a6621d10d55b63595 > Author: Yang Weijiang <weijiang.yang@intel.com> > Date:�� Tue Jul 11 02:26:17 2023 -0400 > > ��� KVM:x86: Make guest supervisor states as non-XSAVE managed > > ��� Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP, > ��� when vCPU context is being swapped before and after userspace > ��� <->kernel entry, also do the same operation when vCPU is sched-in > ��� or sched-out. ... > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index e2c549f147a5..7d9cfb7e2fe8 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu > *vcpu) > ������� trace_kvm_fpu(0); > �} > > +static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu) > +{ > +������ preempt_disable(); > +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { > +�������������� rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); > +�������������� rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); > +�������������� rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); > +�������������� wrmsrl(MSR_IA32_PL0_SSP, 0); > +�������������� wrmsrl(MSR_IA32_PL1_SSP, 0); > +�������������� wrmsrl(MSR_IA32_PL2_SSP, 0); > +������ } > +������ preempt_enable(); > +} > + > +static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu) > +{ > +������ preempt_disable(); > +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) { > +�������������� wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]); > +�������������� wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]); > +�������������� wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]); > +������ } > +������ preempt_enable(); > +} My understanding is that PL[0-2]_SSP are used only on transitions to the corresponding privilege level from a *different* privilege level. That means KVM should be able to utilize the user_return_msr framework to load the host values. Though if Linux ever supports SSS, I'm guessing the core kernel will have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to userspace, e.g. to avoid having to write PL0_SSP, which will presumably be per-task, on every context switch. But note my original wording: **If that's necessary** If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in IA32_S_CET, then running host stuff with guest values should be ok. KVM only needs to guarantee that it doesn't leak values between guests. But that should Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. And regardless of what the mechanism ends up managing SSP MSRs, it should only ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will never consume PL{1,2}_SSP. Am I missing something?
On Wed, Jul 19, 2023, Sean Christopherson wrote: > On Mon, Jul 17, 2023, Weijiang Yang wrote: > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index e2c549f147a5..7d9cfb7e2fe8 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu > > *vcpu) > > ������� trace_kvm_fpu(0); > > �} Huh. After a bit of debugging, the mangling is due to mutt's default for send_charset being "us-ascii:iso-8859-1:utf-8" and selecting iso-8859-1 instead of utf-8 as the encoding despite the original mail being utf-8. In this case, mutt ran afoul of nbsp (u+00a0). AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1 for sending mail set send_charset="us-ascii:utf-8"
On Wed, Jul 19, 2023 at 12:41:47PM -0700, Sean Christopherson wrote: > My understanding is that PL[0-2]_SSP are used only on transitions to the > corresponding privilege level from a *different* privilege level. That means > KVM should be able to utilize the user_return_msr framework to load the host > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > per-task, on every context switch. > > But note my original wording: **If that's necessary** > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > needs to guarantee that it doesn't leak values between guests. But that should > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > never consume PL{1,2}_SSP. To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
On 7/20/2023 3:41 AM, Sean Christopherson wrote: > [...] > My understanding is that PL[0-2]_SSP are used only on transitions to the > corresponding privilege level from a *different* privilege level. That means > KVM should be able to utilize the user_return_msr framework to load the host > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > per-task, on every context switch. > > But note my original wording: **If that's necessary** Thanks! I think host SSS enabling won't happen in short-term, handling the guest supervisor states in KVM side is doable. > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > needs to guarantee that it doesn't leak values between guests. But that should > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. Yes, these handling have been covered by the new version. > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > never consume PL{1,2}_SSP. > > Am I missing something? I think, guest PL{0,1,2}_SSP can be handled as a bundle to make the handling easy(instead of handling each separately) because guest can be non-Linux systems, as you said before they could even be used as scratch registers. But for host side as it's Linux, I can omit reloading/resetting host PL{1,2}_SSP when vCPU thread is preempted. I will post new version to community if above is minor divergence.
On 7/20/2023 4:26 AM, Sean Christopherson wrote: > On Wed, Jul 19, 2023, Sean Christopherson wrote: >> On Mon, Jul 17, 2023, Weijiang Yang wrote: >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index e2c549f147a5..7d9cfb7e2fe8 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu >>> *vcpu) >>> ������� trace_kvm_fpu(0); >>> �} > Huh. After a bit of debugging, the mangling is due to mutt's default for send_charset > being > > "us-ascii:iso-8859-1:utf-8" > > and selecting iso-8859-1 instead of utf-8 as the encoding despite the original > mail being utf-8. In this case, mutt ran afoul of nbsp (u+00a0). > > AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1 > for sending mail > > set send_charset="us-ascii:utf-8" It made me feel a bit guilty as I thought it could be resulted from wrong settings of my email system :-)
> > My understanding is that PL[0-2]_SSP are used only on transitions to the > > corresponding privilege level from a *different* privilege level. That means > > KVM should be able to utilize the user_return_msr framework to load the host > > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > > per-task, on every context switch. > > > > But note my original wording: **If that's necessary** > > > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > > needs to guarantee that it doesn't leak values between guests. But that should > > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. > > > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > > never consume PL{1,2}_SSP. > > To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2. Trying to understand more what prevents SSS to enable in pre FRED, Is it better #CP exception handling with other nested exceptions? Won't same problems (to some extent) happen in user-mode shadow stack (and in case of guest, SSS inside VM)? Thanks, Pankaj
On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote: > > > My understanding is that PL[0-2]_SSP are used only on transitions to the > > > corresponding privilege level from a *different* privilege level. That means > > > KVM should be able to utilize the user_return_msr framework to load the host > > > values. Though if Linux ever supports SSS, I'm guessing the core kernel will > > > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to > > > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be > > > per-task, on every context switch. > > > > > > But note my original wording: **If that's necessary** > > > > > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in > > > IA32_S_CET, then running host stuff with guest values should be ok. KVM only > > > needs to guarantee that it doesn't leak values between guests. But that should > > > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the > > > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. > > > > > > And regardless of what the mechanism ends up managing SSP MSRs, it should only > > > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will > > > never consume PL{1,2}_SSP. > > > > To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2. > > Trying to understand more what prevents SSS to enable in pre FRED, Is > it better #CP exception > handling with other nested exceptions? SSS took the syscall gap and made it worse -- as in *way* worse. To top it off, the whole SSS busy bit thing is fundamentally incompatible with how we manage to survive nested exceptions in NMI context. Basically, the whole x86 exception / stack switching logic was already borderline impossible (consider taking an MCE in the early NMI path where we set up, but have not finished, the re-entrancy stuff), and pushed it over the edge and set it on fire. And NMI isn't the only problem, the various new virt exceptions #VC and #HV are on their own already near impossible, adding SSS again pushes the whole thing into clear insanity. There's a good exposition of the whole trainwreck by Andrew here: https://www.youtube.com/watch?v=qcORS8CN0ow (that is, sorry for the youtube link, but Google is failing me in finding the actual Google Doc that talk is based on, or even the slide deck :/) FRED solves all that by: - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched atomically and consistently for every transition. - removing the non-reentrant IST mechanism and replacing it with stack levels - adding an explicit NMI latch - re-organising the actual shadow stacks and doing away with that busy bit thing (I need to re-read the FRED spec on this detail again). Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole, sorry.
On Thu, Jul 20, 2023 at 10:03:58AM +0200, Peter Zijlstra wrote: > > Trying to understand more what prevents SSS to enable in pre FRED, Is > > it better #CP exception > > handling with other nested exceptions? > > SSS took the syscall gap and made it worse -- as in *way* worse. > > To top it off, the whole SSS busy bit thing is fundamentally > incompatible with how we manage to survive nested exceptions in NMI > context. > > Basically, the whole x86 exception / stack switching logic was already > borderline impossible (consider taking an MCE in the early NMI path > where we set up, but have not finished, the re-entrancy stuff), and SSS > pushed it over the edge and set it on fire. > > And NMI isn't the only problem, the various new virt exceptions #VC and > #HV are on their own already near impossible, adding SSS again pushes > the whole thing into clear insanity. > > There's a good exposition of the whole trainwreck by Andrew here: > > https://www.youtube.com/watch?v=qcORS8CN0ow > > (that is, sorry for the youtube link, but Google is failing me in > finding the actual Google Doc that talk is based on, or even the slide > deck :/) > > > > FRED solves all that by: > > - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched > atomically and consistently for every transition. > > - removing the non-reentrant IST mechanism and replacing it with stack > levels > > - adding an explicit NMI latch > > - re-organising the actual shadow stacks and doing away with that busy > bit thing (I need to re-read the FRED spec on this detail again). > > > > Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole, > sorry.
> > > Trying to understand more what prevents SSS to enable in pre FRED, Is > > > it better #CP exception > > > handling with other nested exceptions? > > > > SSS took the syscall gap and made it worse -- as in *way* worse. > > > > To top it off, the whole SSS busy bit thing is fundamentally > > incompatible with how we manage to survive nested exceptions in NMI > > context. > > > > Basically, the whole x86 exception / stack switching logic was already > > borderline impossible (consider taking an MCE in the early NMI path > > where we set up, but have not finished, the re-entrancy stuff), and > > SSS > > > pushed it over the edge and set it on fire. ah I see. SSS takes it to the next level. > > > > And NMI isn't the only problem, the various new virt exceptions #VC and > > #HV are on their own already near impossible, adding SSS again pushes > > the whole thing into clear insanity. > > > > There's a good exposition of the whole trainwreck by Andrew here: > > > > https://www.youtube.com/watch?v=qcORS8CN0ow > > > > (that is, sorry for the youtube link, but Google is failing me in > > finding the actual Google Doc that talk is based on, or even the slide > > deck :/) I think I got the link: https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?pli=1 > > > > > > > > FRED solves all that by: > > > > - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched > > atomically and consistently for every transition. > > > > - removing the non-reentrant IST mechanism and replacing it with stack > > levels > > > > - adding an explicit NMI latch > > > > - re-organising the actual shadow stacks and doing away with that busy > > bit thing (I need to re-read the FRED spec on this detail again). > > Thank you for explaining. I will also study the FRED spec and corresponding kernel patches posted in the mailing list. > > > > > > Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole, > > sorry. ya, interesting. Best regards, Pankaj
On 20/07/2023 9:03 am, Peter Zijlstra wrote: > On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote: >>>> My understanding is that PL[0-2]_SSP are used only on transitions to the >>>> corresponding privilege level from a *different* privilege level. That means >>>> KVM should be able to utilize the user_return_msr framework to load the host >>>> values. Though if Linux ever supports SSS, I'm guessing the core kernel will >>>> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to >>>> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be >>>> per-task, on every context switch. >>>> >>>> But note my original wording: **If that's necessary** >>>> >>>> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in >>>> IA32_S_CET, then running host stuff with guest values should be ok. KVM only >>>> needs to guarantee that it doesn't leak values between guests. But that should >>>> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the >>>> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest. >>>> >>>> And regardless of what the mechanism ends up managing SSP MSRs, it should only >>>> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will >>>> never consume PL{1,2}_SSP. >>> To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2. >> Trying to understand more what prevents SSS to enable in pre FRED, Is >> it better #CP exception >> handling with other nested exceptions? > SSS Careful with SSS for "supervisor shadow stacks". Because there's a brand new CET_SSS CPUID bit to cover the (mis)feature where shstk supervisor tokens can be *prematurely busy*. (11/10 masterful wordsmithing, because it does lull you into the impression that this isn't WTF^2 levels of crazy) > took the syscall gap and made it worse -- as in *way* worse. More impressively, it created a sysenter gap where there wasn't one previously. > To top it off, the whole SSS busy bit thing is fundamentally > incompatible with how we manage to survive nested exceptions in NMI > context. To be clear, this is supervisor shadow stack regular busy bits, not the CET_SSS premature busy problem. > > Basically, the whole x86 exception / stack switching logic was already > borderline impossible (consider taking an MCE in the early NMI path > where we set up, but have not finished, the re-entrancy stuff), and > pushed it over the edge and set it on fire. > > And NMI isn't the only problem, the various new virt exceptions #VC and > #HV are on their own already near impossible, adding SSS again pushes > the whole thing into clear insanity. > > There's a good exposition of the whole trainwreck by Andrew here: > > https://www.youtube.com/watch?v=qcORS8CN0ow > > (that is, sorry for the youtube link, but Google is failing me in > finding the actual Google Doc that talk is based on, or even the slide > deck :/) https://docs.google.com/presentation/d/10vWC02kpy4QneI43qsT3worfF_e3sbAE3Ifr61Sq3dY/edit?usp=sharing is the slide deck. I'm very glad I put a "only accurate as of $PRESENTATION_DATE" disclaimer on slide 14. It makes the whole presentation still technically correct. FRED is now at draft 5, and importantly shstk tokens have been removed. They've been replaced with alternative MSR-based mechanism, mostly for performance reasons but a consequence is that the prematurely busy bug can't happen. ~Andrew