mbox series

[v3,00/21] Enable CET Virtualization

Message ID 20230511040857.6094-1-weijiang.yang@intel.com (mailing list archive)
Headers show
Series Enable CET Virtualization | expand

Message

Yang, Weijiang May 11, 2023, 4:08 a.m. UTC
Control-flow Enforcement Technology (CET) is a CPU feature used to prevent
Return/Jump-Oriented Programming (ROP/JOP) attacks. CET introduces a new
exception type, Control Protection (#CP), and two sub-features(SHSTK,IBT)
to defend against ROP/JOP style control-flow subversion attacks.

Shadow Stack (SHSTK):
  A shadow stack is a second stack used exclusively for control transfer
  operations. The shadow stack is separate from the data/normal stack and
  can be enabled individually in user and kernel mode. When shadow stack
  is enabled, CALL pushes the return address on both the data and shadow
  stack. RET pops the return address from both stacks and compares them.
  If the return addresses from the two stacks do not match, the processor
  generates a #CP.

Indirect Branch Tracking (IBT):
  IBT adds a new instruction, ENDBRANCH, to mark valid target addresses of
  indirect branches (CALL, JMP etc...). If an indirect branch is executed
  and the next instruction is _not_ an ENDBRANCH, the processor generates a
  #CP. These instruction behaves as a NOP on platforms that doesn't support
  CET.


Dependency:
--------------------------------------------------------------------------
The first 5 patches are taken over from CET native series [1] in linux-next.
They're prerequisites for enabling guest user mode SHSTK. Patch this full
series before build host kernel for guest CET testing. Also apply CET enabling
patches in [2] to build qualified QEMU. These kernel dependent patches will
be enclosed in KVM series until CET native series is merged in mainline tree.


Implementation:
--------------------------------------------------------------------------
Historically, the early KVM patches can support both user SHSTK and IBT,
and most of the early patches are carried forward with changes in this new
series. And with kernel IBT feature merged in 5.18, a new patch was added
to support the feature in guest. The last patch is introduced to support
supervisor SHSTK but the feature is not enabled on Intel platform for now,
the main purpose of this patch is to facilitate AMD folks to enable the
feature.

In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
doesn't fully support CET supervisor SHSTK, the enabling work is left for
the future.

Supported CET sub-features:

                  |
    User SHSTK    |    User IBT      (user mode)
--------------------------------------------------
    s-SHSTK (X)   |    Kernel IBT    (kernel mode)
                  |

Guest user mode SHSTK/IBT relies on host side XSAVES support(XSS[bit 11])
to swap CET states. Guest kernel IBT doesn't have dependency on host XSAVES.
The supervisor SHSTK relies on host side XSAVES support(XSS[bit 12]) for
supervisor mode CET states save/restore.

This version removed unnecessary checks of host CET enabling status before
expose CET features to guest, making guest CET enabling apart from host.
By doing so, it's expected to be more friendly to cloud computing scenarios.


CET states management:
--------------------------------------------------------------------------
CET user mode states, MSR_IA32_{U_CET,PL3_SSP} depends on {XSAVES,XRSTORS}
instructions to swap guest/host context when vm-exit/vm-entry happens. 
On vm-exit, the guest CET states are stored to guest fpu area and host user
mode states are loaded from thread/process context before vCPU returns to
userspace, vice-versa on vm-entry. See details in kvm_{load|put}_guest_fpu().
So the user mode state validity depends on host side U_CET bit set in MSR_XSS.

CET supervisor mode states are grouped into two categories - XSAVES dependent
and non-dependent, the former includes MSR_IA32_PL{0,1,2}_SSP, the later
consists of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL. The XSAVES dependent
MSR's save/restore depends on S_CET bit set in MSR_XSS. Since native series
doesn't enable S_CET support, these s-SHSTK shadow stack pointers are invalid.

New VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, are introduced for
guest/host non-XSAVES managed states switch. When CET entry/exit load bits are
set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from these fields
at vm-exit/entry. With these new fields, current guest kernel IBT enabling
doesn't depend on S_CET bit in XSS, i.e., host {XSAVES|XRSTORS} support.


Tests:
--------------------------------------------------------------------------
This series passed basic CET user shadow stack test and kernel IBT test in
L1 and L2 guest. It also works with CET KVM-unit-test application.

Executed all KVM-unit-test cases and KVM selftests against this series, all
test cases passed except the vmx test, the failure is due to CR4_CET bit
testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test
passed. I'll send a patch to fix this issue later.


To run user shadow stack test and kernel IBT test in VM, you need an CET
capable platform, e.g., Sapphire Rapids server, and follow below steps to
build host/guest kernel properly:

1. Build host kernel. Patch this series to kernel tree and build kernel.

2. Build guest kernel. Patch CET native series to kernel tree and opt-in
CONFIG_X86_KERNEL_IBT and CONFIG_X86_USER_SHADOW_STACK options. Build with
CET enabled gcc versions(>= 8.5.0).

3. Use patched QEMU to launch a VM.

Check kernel selftest test_shadow_stack_64 output:

[INFO]  new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO]  changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO]  ssp is now 7f8c82101000
[OK]    Shadow stack pivot
[OK]    Shadow stack faults
[INFO]  Corrupting shadow stack
[INFO]  Generated shadow stack violation successfully
[OK]    Shadow stack violation test
[INFO]  Gup read -> shstk access success
[INFO]  Gup write -> shstk access success
[INFO]  Violation from normal write
[INFO]  Gup read -> write access success
[INFO]  Violation from normal write
[INFO]  Gup write -> write access success
[INFO]  Cow gup write -> write access success
[OK]    Shadow gup test
[INFO]  Violation from shstk access
[OK]    mprotect() test
[SKIP]  Userfaultfd unavailable.
[OK]    32 bit test


Check kernel IBT with dmesg | grep CET:

CET detected: Indirect Branch Tracking enabled

--------------------------------------------------------------------------
Changes in v3:
1. Moved MSR access check helper to x86 common file. [Mike]
2. Modified cover letter, commit logs and code per review comments. [PeterZ, Binbin, Rick]
3. Fixed an issue on host MSR_IA32_S_CET reload at vm-exit.
5. Rebase on kvm-x86/next [4].


[1]: linux-next: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/?h=next-20230420
[2]: QEMU patch: https://lore.kernel.org/all/20230421041227.90915-1-weijiang.yang@intel.com/
[3]: v2 patchset: https://lore.kernel.org/all/20230421134615.62539-1-weijiang.yang@intel.com/
[4]: Rebase branch: https://github.com/kvm-x86/linux.git, commit: 5c291b93e5d6 (tag: kvm-x86-next-2023.04.26)


Rick Edgecombe (5):
  x86/shstk: Add Kconfig option for shadow stack
  x86/cpufeatures: Add CPU feature flags for shadow stacks
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states
  x86/fpu: Add helper for modifying xstate

Sean Christopherson (2):
  KVM:x86: Report XSS as to-be-saved if there are supported features
  KVM:x86: Load guest FPU state when accessing xsaves-managed MSRs

Yang Weijiang (14):
  KVM:x86: Refresh CPUID on write to guest MSR_IA32_XSS
  KVM:x86: Init kvm_caps.supported_xss with supported feature bits
  KVM:x86: Add #CP support in guest exception classification
  KVM:VMX: Introduce CET VMCS fields and control bits
  KVM:x86: Add fault checks for guest CR4.CET setting
  KVM:VMX: Emulate reads and writes to CET MSRs
  KVM:VMX: Add a synthetic MSR to allow userspace to access GUEST_SSP
  KVM:x86: Report CET MSRs as to-be-saved if CET is supported
  KVM:x86: Save/Restore GUEST_SSP to/from SMM state save area
  KVM:VMX: Pass through user CET MSRs to the guest
  KVM:x86: Enable CET virtualization for VMX and advertise to userspace
  KVM:nVMX: Enable user CET support for nested VMX
  KVM:x86: Enable kernel IBT support for guest
  KVM:x86: Support CET supervisor shadow stack MSR access

 arch/x86/Kconfig                         |  24 +++++
 arch/x86/Kconfig.assembler               |   5 +
 arch/x86/include/asm/cpufeatures.h       |   2 +
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/fpu/api.h           |   9 ++
 arch/x86/include/asm/fpu/types.h         |  16 ++-
 arch/x86/include/asm/fpu/xstate.h        |   6 +-
 arch/x86/include/asm/kvm_host.h          |   3 +-
 arch/x86/include/asm/vmx.h               |   8 ++
 arch/x86/include/uapi/asm/kvm.h          |   1 +
 arch/x86/include/uapi/asm/kvm_para.h     |   1 +
 arch/x86/kernel/cpu/common.c             |  35 +++++--
 arch/x86/kernel/cpu/cpuid-deps.c         |   1 +
 arch/x86/kernel/fpu/core.c               |  19 ++++
 arch/x86/kernel/fpu/xstate.c             |  90 ++++++++--------
 arch/x86/kvm/cpuid.c                     |  19 +++-
 arch/x86/kvm/cpuid.h                     |   6 ++
 arch/x86/kvm/smm.c                       |  20 ++++
 arch/x86/kvm/vmx/capabilities.h          |   4 +
 arch/x86/kvm/vmx/nested.c                |  29 +++++-
 arch/x86/kvm/vmx/vmcs12.c                |   6 ++
 arch/x86/kvm/vmx/vmcs12.h                |  14 ++-
 arch/x86/kvm/vmx/vmx.c                   | 124 ++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h                   |   6 +-
 arch/x86/kvm/x86.c                       | 122 ++++++++++++++++++++--
 arch/x86/kvm/x86.h                       |  47 ++++++++-
 26 files changed, 543 insertions(+), 82 deletions(-)


base-commit: 5c291b93e5d665380dbecc6944973583f9565ee5

Comments

Sean Christopherson June 15, 2023, 11:30 p.m. UTC | #1
On Thu, May 11, 2023, Yang Weijiang wrote:
> The last patch is introduced to support supervisor SHSTK but the feature is
> not enabled on Intel platform for now, the main purpose of this patch is to
> facilitate AMD folks to enable the feature.

I am beyond confused by the SDM's wording of CET_SSS.

First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
more appropriate phrasing).

  Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
  shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
  will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
  Architectures Software Developer’s Manual, Volume 1).

But then it says says VMMs shouldn't set the bit.

  When emulating the CPUID instruction, a virtual-machine monitor should return
  this bit as 0 if those pushes can cause VM exits.

Based on the Xen code (which is sadly a far better source of information than the
SDM), I *think* that what the SDM is trying to say is that VMMs should not set
CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU.  Because
if the SDM really means "VMMs should never set the bit", then what on earth is the
point of the bit.

> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
> doesn't fully support CET supervisor SHSTK, the enabling work is left for
> the future.

Why?  If my interpretation of the SDM is correct, then all the pieces are there.

> Executed all KVM-unit-test cases and KVM selftests against this series, all
> test cases passed except the vmx test, the failure is due to CR4_CET bit
> testing in test_vmxon_bad_cr(). After add CR4_CET bit to skip list, the test
> passed. I'll send a patch to fix this issue later.

Your cover letter from v2 back in April said the same thing.  Why hasn't the patch
been posted?  And what exactly is the issue?  IIUC, setting CR4.CET with
MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's
a KVM bug.  And if that's the case, the next obvious questions is, why are you
posting known buggy code?
Sean Christopherson June 16, 2023, midnight UTC | #2
On Thu, Jun 15, 2023, Sean Christopherson wrote:
> Your cover letter from v2 back in April said the same thing.  Why hasn't the patch
> been posted?  And what exactly is the issue?  IIUC, setting CR4.CET with
> MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's
> a KVM bug.  And if that's the case, the next obvious questions is, why are you
> posting known buggy code?

Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1?
Yang, Weijiang June 16, 2023, 1 a.m. UTC | #3
On 6/16/2023 8:00 AM, Sean Christopherson wrote:
> On Thu, Jun 15, 2023, Sean Christopherson wrote:
>> Your cover letter from v2 back in April said the same thing.  Why hasn't the patch
>> been posted?  And what exactly is the issue?  IIUC, setting CR4.CET with
>> MSR_IA32_S_CET=0 and MSR_IA32_U_CET=0 should be a nop, which suggests that there's
>> a KVM bug.  And if that's the case, the next obvious questions is, why are you
>> posting known buggy code?
> Ah, is the problem that the test doesn't set CR0.WP as required by CR4.CET=1?

Thanks for taking time to review this series!

Yes, due to CR0.WP bit is not set while CR4.CET is being set.

The check is imposed by patch-12.

I'll add the fixup patch together with next the version.
Yang, Weijiang June 16, 2023, 8:25 a.m. UTC | #4
On 6/16/2023 7:30 AM, Sean Christopherson wrote:
> On Thu, May 11, 2023, Yang Weijiang wrote:
>> The last patch is introduced to support supervisor SHSTK but the feature is
>> not enabled on Intel platform for now, the main purpose of this patch is to
>> facilitate AMD folks to enable the feature.
> I am beyond confused by the SDM's wording of CET_SSS.
>
> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
> more appropriate phrasing).
>
>    Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
>    shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
>    will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
>    Architectures Software Developer’s Manual, Volume 1).
>
> But then it says says VMMs shouldn't set the bit.
>
>    When emulating the CPUID instruction, a virtual-machine monitor should return
>    this bit as 0 if those pushes can cause VM exits.
>
> Based on the Xen code (which is sadly a far better source of information than the
> SDM), I *think* that what the SDM is trying to say is that VMMs should not set
> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU.  Because
> if the SDM really means "VMMs should never set the bit", then what on earth is the
> point of the bit.

I need to double check for the vague description.

 From my understanding, on bare metal side, if the bit is 1, OS can 
enable SSS if pushes won't cause

page fault. But for VM case, it's not recommended(regardless of the bit 
state) to set the bit as vm-exits

caused by guest SSS pushes cannot be fully excluded.

In other word, the bit is mainly for bare metal guidance now.

>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
>> doesn't fully support CET supervisor SHSTK, the enabling work is left for
>> the future.
> Why?  If my interpretation of the SDM is correct, then all the pieces are there.

My assumption is,  VM supervisor SHSTK depends bare metal kernel support 
as PL0_SSP MSR is

backed by XSAVES via IA32_XSS:bit12(CET_S), but this part of support is 
not there in Rick's native series.

And also based on above SDM description, I don't want to add the support 
blindly now.

> [...]
Sean Christopherson June 16, 2023, 5:56 p.m. UTC | #5
On Fri, Jun 16, 2023, Weijiang Yang wrote:
> 
> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
> > On Thu, May 11, 2023, Yang Weijiang wrote:
> > > The last patch is introduced to support supervisor SHSTK but the feature is
> > > not enabled on Intel platform for now, the main purpose of this patch is to
> > > facilitate AMD folks to enable the feature.
> > I am beyond confused by the SDM's wording of CET_SSS.
> > 
> > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
> > more appropriate phrasing).
> > 
> >    Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
> >    shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
> >    will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
> >    Architectures Software Developer’s Manual, Volume 1).
> > 
> > But then it says says VMMs shouldn't set the bit.
> > 
> >    When emulating the CPUID instruction, a virtual-machine monitor should return
> >    this bit as 0 if those pushes can cause VM exits.
> > 
> > Based on the Xen code (which is sadly a far better source of information than the
> > SDM), I *think* that what the SDM is trying to say is that VMMs should not set
> > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU.  Because
> > if the SDM really means "VMMs should never set the bit", then what on earth is the
> > point of the bit.
> 
> I need to double check for the vague description.
> 
> From my understanding, on bare metal side, if the bit is 1, OS can enable
> SSS if pushes won't cause page fault. But for VM case, it's not recommended
> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
> pushes cannot be fully excluded.
> 
> In other word, the bit is mainly for bare metal guidance now.
> 
> > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
> > > doesn't fully support CET supervisor SHSTK, the enabling work is left for
> > > the future.
> > Why?  If my interpretation of the SDM is correct, then all the pieces are there.

...

> And also based on above SDM description, I don't want to add the support
> blindly now.

*sigh*

I got filled in on the details offlist.

1) In the next version of this series, please rework it to reincorporate Supervisor
   Shadow Stack support into the main series, i.e. pretend Intel's implemenation
   isn't horribly flawed.  KVM can't guarantee that a VM-Exit won't occur, i.e.
   can't advertise CET_SS, but I want the baseline support to be implemented,
   otherwise the series as a whole is a big confusing mess with unanswered question
   left, right, and center.  And more importantly, architecturally SSS exists if
   X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize
   SSS if it so chooses, with the obvious caveat that there's a non-zero chance
   the guest risks death by doing so.  Or if userspace can ensure no VM-Exit will
   occur, which is difficult but feasible (ignoring #MC), e.g. by statically
   partitioning memory, prefaulting all memory in guest firmware, and not dirty
   logging SSS pages.  In such an extreme setup, userspace can enumerate CET_SSS
   to the guest, and KVM should support that.
 
2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS.
   While Intel is apparently ok with treating KVM developers like mushrooms, I
   am not.

---
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 16 Jun 2023 10:04:37 -0700
Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise
 CET_SSS

Explicitly call out that KVM must NOT advertise CET_SSS to userspace,
i.e. must not tell userspace and thus the guest that it is safe for the
guest to enable Supervisor Shadow Stacks (SSS).

Intel's implementation of SSS is fatally flawed for virtualized
environments, as despite wording in the SDM that suggests otherwise,
Intel CPUs' handling of shadow stack switches are NOT fully atomic.  Only
the check-and-update of the supervisor shadow stack token's busy bit is
atomic.  Per the SDM:

  If the far CALL or event delivery pushes a stack frame after the token
  is acquired and any of the pushes causes a fault or VM exit, the
  processor will revert to the old shadow stack and the busy bit in the
  new shadow stack's token remains set.

Or more bluntly, any fault or VM-Exit that occurs when pushing to the
shadow stack after the busy bit is set is fatal to the kernel, i.e. to
the guest in KVM's case.  The (guest) kernel can protect itself against
faults, e.g. by ensuring that the shadow stack always has a valid mapping,
but a guest kernel obviously has no control over, or even knowledge of,
VM-Exits due to host activity.

To help software determine when it is safe to use SSS, Intel defined
CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS,
i.e. bare metal Intel CPUs advertise to software that it is safe to enable
SSS.

  If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is
  sufficient for an operating system to ensure that none of the pushes can
  cause a page fault.

But CET_SS also comes with an major caveat that is kinda sorta documented
in the SDM:

  When emulating the CPUID instruction, a virtual-machine monitor should
  return this bit as 0 if those pushes can cause VM exits.

In other words, CET_SSS (bit 18) does NOT enumerate that the underlying
CPU prevents VM-Exits, only that the environment in which the software is
running will not generate VM-Exits.  I.e. CET_SSS is a stopgap to stem the
bleeding and allow kernels to enable SSS, not an indication that the
underlying CPU is immune to the VM-Exit problem.

And unfortunately, KVM itself effectively has zero chance of ensuring that
a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs
when any memslot is deleted, enabling dirty logging write-protects SPTEs,
etc.  A sufficiently motivated userspace can, at least in theory, provide
a safe environment for SSS, e.g. by statically partitioning and
prefaulting (in guest firmware) all memory, disabling PML, never
write-protecting guest shadow stacks, etc.  But such a setup is far, far
beyond typical KVM deployments.

Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full
shadow stack switch atomically so long as the stack is mapped WB and does
not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved
guest play nice with SSS without additional shenanigans.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/cpuid.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1e3ee96c879b..ecf4a68aaa08 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void)
 	);
 
 	kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
-		F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI)
+		F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) |
+
+		/*
+		 * Do NOT advertise CET_SSS, i.e. do not tell userspace and the
+		 * guest that it is safe to use Supervisor Shadow Stacks under
+		 * KVM when running on Intel CPUs.  KVM itself cannot guarantee
+		 * that a VM-Exit won't occur during a shadow stack update.
+		 */
+		0 /* F(CET_SSS) */
 	);
 
 	kvm_cpu_cap_mask(CPUID_D_1_EAX,

base-commit: 9305c14847719870e9e08294034861360577ce08
--
Yang, Weijiang June 19, 2023, 6:41 a.m. UTC | #6
On 6/17/2023 1:56 AM, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
>>> On Thu, May 11, 2023, Yang Weijiang wrote:
>>>> The last patch is introduced to support supervisor SHSTK but the feature is
>>>> not enabled on Intel platform for now, the main purpose of this patch is to
>>>> facilitate AMD folks to enable the feature.
>>> I am beyond confused by the SDM's wording of CET_SSS.
>>>
>>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
>>> more appropriate phrasing).
>>>
>>>     Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
>>>     shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
>>>     will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
>>>     Architectures Software Developer’s Manual, Volume 1).
>>>
>>> But then it says says VMMs shouldn't set the bit.
>>>
>>>     When emulating the CPUID instruction, a virtual-machine monitor should return
>>>     this bit as 0 if those pushes can cause VM exits.
>>>
>>> Based on the Xen code (which is sadly a far better source of information than the
>>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set
>>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU.  Because
>>> if the SDM really means "VMMs should never set the bit", then what on earth is the
>>> point of the bit.
>> I need to double check for the vague description.
>>
>>  From my understanding, on bare metal side, if the bit is 1, OS can enable
>> SSS if pushes won't cause page fault. But for VM case, it's not recommended
>> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
>> pushes cannot be fully excluded.
>>
>> In other word, the bit is mainly for bare metal guidance now.
>>
>>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
>>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for
>>>> the future.
>>> Why?  If my interpretation of the SDM is correct, then all the pieces are there.
> ...
>
>> And also based on above SDM description, I don't want to add the support
>> blindly now.
> *sigh*
>
> I got filled in on the details offlist.
>
> 1) In the next version of this series, please rework it to reincorporate Supervisor
>     Shadow Stack support into the main series, i.e. pretend Intel's implemenation
>     isn't horribly flawed.

Let me make it clear, you want me to do two things:

1)Add Supervisor Shadow Stack  state support(i.e., XSS.bit12(CET_S)) 
into kernel so that host can

support guest Supervisor Shadow Stack MSRs in g/h FPU context switch.

2) Add Supervisor Shadow stack support into KVM part so that guest OS is 
able to use SSS with risk.

is it correct?

> KVM can't guarantee that a VM-Exit won't occur, i.e.
>     can't advertise CET_SS, but I want the baseline support to be implemented,
>     otherwise the series as a whole is a big confusing mess with unanswered question
>     left, right, and center.  And more importantly, architecturally SSS exists if
>     X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize
>     SSS if it so chooses, with the obvious caveat that there's a non-zero chance
>     the guest risks death by doing so.  Or if userspace can ensure no VM-Exit will
>     occur, which is difficult but feasible (ignoring #MC), e.g. by statically
>     partitioning memory, prefaulting all memory in guest firmware, and not dirty
>     logging SSS pages.  In such an extreme setup, userspace can enumerate CET_SSS
>     to the guest, and KVM should support that.

Make sense, provide support but take risk on your own.

>   
> 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS.
>     While Intel is apparently ok with treating KVM developers like mushrooms, I
>     am not.

will add it, thanks a lot for detailed change logs!

>
> ---
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 16 Jun 2023 10:04:37 -0700
> Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise
>   CET_SSS
>
> Explicitly call out that KVM must NOT advertise CET_SSS to userspace,
> i.e. must not tell userspace and thus the guest that it is safe for the
> guest to enable Supervisor Shadow Stacks (SSS).
>
> Intel's implementation of SSS is fatally flawed for virtualized
> environments, as despite wording in the SDM that suggests otherwise,
> Intel CPUs' handling of shadow stack switches are NOT fully atomic.  Only
> the check-and-update of the supervisor shadow stack token's busy bit is
> atomic.  Per the SDM:
>
>    If the far CALL or event delivery pushes a stack frame after the token
>    is acquired and any of the pushes causes a fault or VM exit, the
>    processor will revert to the old shadow stack and the busy bit in the
>    new shadow stack's token remains set.
>
> Or more bluntly, any fault or VM-Exit that occurs when pushing to the
> shadow stack after the busy bit is set is fatal to the kernel, i.e. to
> the guest in KVM's case.  The (guest) kernel can protect itself against
> faults, e.g. by ensuring that the shadow stack always has a valid mapping,
> but a guest kernel obviously has no control over, or even knowledge of,
> VM-Exits due to host activity.
>
> To help software determine when it is safe to use SSS, Intel defined
> CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS,
> i.e. bare metal Intel CPUs advertise to software that it is safe to enable
> SSS.
>
>    If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is
>    sufficient for an operating system to ensure that none of the pushes can
>    cause a page fault.
>
> But CET_SS also comes with an major caveat that is kinda sorta documented
> in the SDM:
>
>    When emulating the CPUID instruction, a virtual-machine monitor should
>    return this bit as 0 if those pushes can cause VM exits.
>
> In other words, CET_SSS (bit 18) does NOT enumerate that the underlying
> CPU prevents VM-Exits, only that the environment in which the software is
> running will not generate VM-Exits.  I.e. CET_SSS is a stopgap to stem the
> bleeding and allow kernels to enable SSS, not an indication that the
> underlying CPU is immune to the VM-Exit problem.
>
> And unfortunately, KVM itself effectively has zero chance of ensuring that
> a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs
> when any memslot is deleted, enabling dirty logging write-protects SPTEs,
> etc.  A sufficiently motivated userspace can, at least in theory, provide
> a safe environment for SSS, e.g. by statically partitioning and
> prefaulting (in guest firmware) all memory, disabling PML, never
> write-protecting guest shadow stacks, etc.  But such a setup is far, far
> beyond typical KVM deployments.
>
> Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full
> shadow stack switch atomically so long as the stack is mapped WB and does
> not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved
> guest play nice with SSS without additional shenanigans.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/cpuid.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1e3ee96c879b..ecf4a68aaa08 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void)
>   	);
>   
>   	kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
> -		F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI)
> +		F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) |
> +
> +		/*
> +		 * Do NOT advertise CET_SSS, i.e. do not tell userspace and the
> +		 * guest that it is safe to use Supervisor Shadow Stacks under
> +		 * KVM when running on Intel CPUs.  KVM itself cannot guarantee
> +		 * that a VM-Exit won't occur during a shadow stack update.
> +		 */
> +		0 /* F(CET_SSS) */
>   	);
>   
>   	kvm_cpu_cap_mask(CPUID_D_1_EAX,
>
> base-commit: 9305c14847719870e9e08294034861360577ce08
Sean Christopherson June 23, 2023, 8:51 p.m. UTC | #7
On Mon, Jun 19, 2023, Weijiang Yang wrote:
> 
> On 6/17/2023 1:56 AM, Sean Christopherson wrote:
> > On Fri, Jun 16, 2023, Weijiang Yang wrote:
> > > On 6/16/2023 7:30 AM, Sean Christopherson wrote:
> > > > On Thu, May 11, 2023, Yang Weijiang wrote:
> > > > > The last patch is introduced to support supervisor SHSTK but the feature is
> > > > > not enabled on Intel platform for now, the main purpose of this patch is to
> > > > > facilitate AMD folks to enable the feature.
> > > > I am beyond confused by the SDM's wording of CET_SSS.
> > > > 
> > > > First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
> > > > more appropriate phrasing).
> > > > 
> > > >     Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
> > > >     shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
> > > >     will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
> > > >     Architectures Software Developer’s Manual, Volume 1).
> > > > 
> > > > But then it says says VMMs shouldn't set the bit.
> > > > 
> > > >     When emulating the CPUID instruction, a virtual-machine monitor should return
> > > >     this bit as 0 if those pushes can cause VM exits.
> > > > 
> > > > Based on the Xen code (which is sadly a far better source of information than the
> > > > SDM), I *think* that what the SDM is trying to say is that VMMs should not set
> > > > CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU.  Because
> > > > if the SDM really means "VMMs should never set the bit", then what on earth is the
> > > > point of the bit.
> > > I need to double check for the vague description.
> > > 
> > >  From my understanding, on bare metal side, if the bit is 1, OS can enable
> > > SSS if pushes won't cause page fault. But for VM case, it's not recommended
> > > (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
> > > pushes cannot be fully excluded.
> > > 
> > > In other word, the bit is mainly for bare metal guidance now.
> > > 
> > > > > In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
> > > > > doesn't fully support CET supervisor SHSTK, the enabling work is left for
> > > > > the future.
> > > > Why?  If my interpretation of the SDM is correct, then all the pieces are there.
> > ...
> > 
> > > And also based on above SDM description, I don't want to add the support
> > > blindly now.
> > *sigh*
> > 
> > I got filled in on the details offlist.
> > 
> > 1) In the next version of this series, please rework it to reincorporate Supervisor
> >     Shadow Stack support into the main series, i.e. pretend Intel's implemenation
> >     isn't horribly flawed.
> 
> Let me make it clear, you want me to do two things:
> 
> 1)Add Supervisor Shadow Stack  state support(i.e., XSS.bit12(CET_S)) into
> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
> context switch.

If that's necessary for correct functionality, yes.

> 2) Add Supervisor Shadow stack support into KVM part so that guest OS is
> able to use SSS with risk.

Yes.  Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to
provide both User and Supervisor support.  CET_SSS doesn't change the architecture,
it's little more than a hint.  And even if the guest follows SDM's recommendation
to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use
the MSRs as scratch registers.
Yang, Weijiang June 26, 2023, 6:46 a.m. UTC | #8
On 6/24/2023 4:51 AM, Sean Christopherson wrote:
> On Mon, Jun 19, 2023, Weijiang Yang wrote:
>> On 6/17/2023 1:56 AM, Sean Christopherson wrote:
>>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
>>>>> On Thu, May 11, 2023, Yang Weijiang wrote:
>>>>>> The last patch is introduced to support supervisor SHSTK but the feature is
>>>>>> not enabled on Intel platform for now, the main purpose of this patch is to
>>>>>> facilitate AMD folks to enable the feature.
>>>>> I am beyond confused by the SDM's wording of CET_SSS.
>>>>>
>>>>> First, it says that CET_SSS says the CPU isn't buggy (or maybe "less buggy" is
>>>>> more appropriate phrasing).
>>>>>
>>>>>      Bit 18: CET_SSS. If 1, indicates that an operating system can enable supervisor
>>>>>      shadow stacks as long as it ensures that certain supervisor shadow-stack pushes
>>>>>      will not cause page faults (see Section 17.2.3 of the Intel® 64 and IA-32
>>>>>      Architectures Software Developer’s Manual, Volume 1).
>>>>>
>>>>> But then it says says VMMs shouldn't set the bit.
>>>>>
>>>>>      When emulating the CPUID instruction, a virtual-machine monitor should return
>>>>>      this bit as 0 if those pushes can cause VM exits.
>>>>>
>>>>> Based on the Xen code (which is sadly a far better source of information than the
>>>>> SDM), I *think* that what the SDM is trying to say is that VMMs should not set
>>>>> CET_SS if VM-Exits can occur ***and*** the bit is not set in the host CPU.  Because
>>>>> if the SDM really means "VMMs should never set the bit", then what on earth is the
>>>>> point of the bit.
>>>> I need to double check for the vague description.
>>>>
>>>>   From my understanding, on bare metal side, if the bit is 1, OS can enable
>>>> SSS if pushes won't cause page fault. But for VM case, it's not recommended
>>>> (regardless of the bit state) to set the bit as vm-exits caused by guest SSS
>>>> pushes cannot be fully excluded.
>>>>
>>>> In other word, the bit is mainly for bare metal guidance now.
>>>>
>>>>>> In summary, this new series enables CET user SHSTK/IBT and kernel IBT, but
>>>>>> doesn't fully support CET supervisor SHSTK, the enabling work is left for
>>>>>> the future.
>>>>> Why?  If my interpretation of the SDM is correct, then all the pieces are there.
>>> ...
>>>
>>>> And also based on above SDM description, I don't want to add the support
>>>> blindly now.
>>> *sigh*
>>>
>>> I got filled in on the details offlist.
>>>
>>> 1) In the next version of this series, please rework it to reincorporate Supervisor
>>>      Shadow Stack support into the main series, i.e. pretend Intel's implemenation
>>>      isn't horribly flawed.
>> Let me make it clear, you want me to do two things:
>>
>> 1)Add Supervisor Shadow Stack  state support(i.e., XSS.bit12(CET_S)) into
>> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
>> context switch.
> If that's necessary for correct functionality, yes.
>
>> 2) Add Supervisor Shadow stack support into KVM part so that guest OS is
>> able to use SSS with risk.
> Yes.  Architecturally, if KVM advertises X86_FEATURE_SHSTK, then KVM needs to
> provide both User and Supervisor support.  CET_SSS doesn't change the architecture,
> it's little more than a hint.  And even if the guest follows SDM's recommendation
> to not enable shadow stacks, a clever kernel can still utilize SSS assets, e.g. use
> the MSRs as scratch registers.

Understood, thanks!
Yang, Weijiang July 10, 2023, 12:28 a.m. UTC | #9
> *sigh*
>
> I got filled in on the details offlist.
>
> 1) In the next version of this series, please rework it to reincorporate Supervisor
>     Shadow Stack support into the main series, i.e. pretend Intel's implemenation
>     isn't horribly flawed.  KVM can't guarantee that a VM-Exit won't occur, i.e.
>     can't advertise CET_SS, but I want the baseline support to be implemented,
>     otherwise the series as a whole is a big confusing mess with unanswered question
>     left, right, and center.  And more importantly, architecturally SSS exists if
>     X86_FEATURE_SHSTK is enumerated, i.e. the guest should be allowed to utilize
>     SSS if it so chooses, with the obvious caveat that there's a non-zero chance
>     the guest risks death by doing so.  Or if userspace can ensure no VM-Exit will
>     occur, which is difficult but feasible (ignoring #MC), e.g. by statically
>     partitioning memory, prefaulting all memory in guest firmware, and not dirty
>     logging SSS pages.  In such an extreme setup, userspace can enumerate CET_SSS
>     to the guest, and KVM should support that.
>   
> 2) Add the below patch to document exactly why KVM doesn't advertise CET_SSS.
>     While Intel is apparently ok with treating KVM developers like mushrooms, I
>     am not.
>
> ---
> From: Sean Christopherson<seanjc@google.com>
> Date: Fri, 16 Jun 2023 10:04:37 -0700
> Subject: [PATCH] KVM: x86: Explicitly document that KVM must not advertise
>   CET_SSS
>
> Explicitly call out that KVM must NOT advertise CET_SSS to userspace,
> i.e. must not tell userspace and thus the guest that it is safe for the
> guest to enable Supervisor Shadow Stacks (SSS).
>
> Intel's implementation of SSS is fatally flawed for virtualized
> environments, as despite wording in the SDM that suggests otherwise,
> Intel CPUs' handling of shadow stack switches are NOT fully atomic.  Only
> the check-and-update of the supervisor shadow stack token's busy bit is
> atomic.  Per the SDM:
>
>    If the far CALL or event delivery pushes a stack frame after the token
>    is acquired and any of the pushes causes a fault or VM exit, the
>    processor will revert to the old shadow stack and the busy bit in the
>    new shadow stack's token remains set.
>
> Or more bluntly, any fault or VM-Exit that occurs when pushing to the
> shadow stack after the busy bit is set is fatal to the kernel, i.e. to
> the guest in KVM's case.  The (guest) kernel can protect itself against
> faults, e.g. by ensuring that the shadow stack always has a valid mapping,
> but a guest kernel obviously has no control over, or even knowledge of,
> VM-Exits due to host activity.
>
> To help software determine when it is safe to use SSS, Intel defined
> CPUID.0x7.1.EDX bit (CET_SSS) and updated Intel CPUs to enumerate CET_SS,
> i.e. bare metal Intel CPUs advertise to software that it is safe to enable
> SSS.
>
>    If CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated as 1, it is
>    sufficient for an operating system to ensure that none of the pushes can
>    cause a page fault.
>
> But CET_SS also comes with an major caveat that is kinda sorta documented
> in the SDM:
>
>    When emulating the CPUID instruction, a virtual-machine monitor should
>    return this bit as 0 if those pushes can cause VM exits.
>
> In other words, CET_SSS (bit 18) does NOT enumerate that the underlying
> CPU prevents VM-Exits, only that the environment in which the software is
> running will not generate VM-Exits.  I.e. CET_SSS is a stopgap to stem the
> bleeding and allow kernels to enable SSS, not an indication that the
> underlying CPU is immune to the VM-Exit problem.
>
> And unfortunately, KVM itself effectively has zero chance of ensuring that
> a shadow stack switch can't trigger a VM-Exit, e.g. KVM zaps *all* SPTEs
> when any memslot is deleted, enabling dirty logging write-protects SPTEs,
> etc.  A sufficiently motivated userspace can, at least in theory, provide
> a safe environment for SSS, e.g. by statically partitioning and
> prefaulting (in guest firmware) all memory, disabling PML, never
> write-protecting guest shadow stacks, etc.  But such a setup is far, far
> beyond typical KVM deployments.
>
> Note, AMD CPUs have a similar erratum, but AMD CPUs *DO* perform the full
> shadow stack switch atomically so long as the stack is mapped WB and does
> not cross a page boundary, i.e. a "normal" KVM setup and a well-behaved
> guest play nice with SSS without additional shenanigans.
>
> Signed-off-by: Sean Christopherson<seanjc@google.com>
> ---
>   arch/x86/kvm/cpuid.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1e3ee96c879b..ecf4a68aaa08 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -658,7 +658,15 @@ void kvm_set_cpu_caps(void)
>   	);
>   
>   	kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
> -		F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI)
> +		F(AVX_VNNI_INT8) | F(AVX_NE_CONVERT) | F(PREFETCHITI) |
> +
> +		/*
> +		 * Do NOT advertise CET_SSS, i.e. do not tell userspace and the
> +		 * guest that it is safe to use Supervisor Shadow Stacks under
> +		 * KVM when running on Intel CPUs.  KVM itself cannot guarantee
> +		 * that a VM-Exit won't occur during a shadow stack update.
> +		 */
> +		0 /* F(CET_SSS) */
>   	);
>   
>   	kvm_cpu_cap_mask(CPUID_D_1_EAX,
>
> base-commit: 9305c14847719870e9e08294034861360577ce08

Hi, Sean,

Gil reminded me SDM has been updated CET SSS related topics 
recently(June release):

======================================================================

Section 17.2.3 (Supervisor Shadow Stack Token) in Volume 1 of the SDM:
     If the far CALL or event delivery pushes a stack frame after the 
token is acquired and any of the pushes causes a
     fault or VM exit, the processor will revert to the old shadow stack 
and the busy bit in the new shadow stack's token
     remains set. The new shadow stack is said to be prematurely busy. 
Software should enable supervisor shadow
     stacks only if it is certain that this situation cannot occur. If 
CPUID.(EAX=07H,ECX=1H):EDX[bit 18] is enumerated
     as 1, it is sufficient for an operating system to ensure that none 
of the pushes can cause a page fault.

Volume 2A: CPUID.(EAX=07H,ECX=1H):EDX[bit 18] as follows:
     CET_SSS. If 1, indicates that an operating system can enable 
supervisor shadow stacks as long as
     it ensures that a supervisor shadow stack cannot become prematurely 
busy due to page faults (see Section
     17.2.3 of the Intel® 64 and IA-32 Architectures Software 
Developer’s Manual, Volume 1). When
     emulating the CPUID instruction, a virtual-machine monitor (VMM) 
should return this bit as 1 only if it
     ensures that VM exits cannot cause a guest supervisor shadow stack 
to appear to be prematurely busy.
     Such a VMM could set the “prematurely busy shadow stack” VM-exit 
control and use the additional information
     that it provides.

Volume 3C: new “prematurely busy shadow stack” VM-exit control.

========================================================================

And Gil told me additional information was planed to be released later 
in the summer.

Maybe you need modify above changelog a bit per the update.

Given the updated parts are technical forecast, I don't plan to 
implement it in this series and still enumerate

CET_SSS ==0 for guest. What's your thoughts?
Sean Christopherson July 10, 2023, 10:18 p.m. UTC | #10
On Mon, Jul 10, 2023, Weijiang Yang wrote:
> Maybe you need modify above changelog a bit per the update.

Ya, I'll make sure the changelog gets updated before CET support is merged, though
feel free to post the next version without waiting for new changelog.

> Given the updated parts are technical forecast, I don't plan to implement it
> in this series and still enumerate
> 
> CET_SSS ==0 for guest. What's your thoughts?

Yes, definitely punt shadow-stack fixup to future enabling work.
Yang, Weijiang July 11, 2023, 1:24 a.m. UTC | #11
On 7/11/2023 6:18 AM, Sean Christopherson wrote:
> On Mon, Jul 10, 2023, Weijiang Yang wrote:
>> Maybe you need modify above changelog a bit per the update.
> Ya, I'll make sure the changelog gets updated before CET support is merged, though
> feel free to post the next version without waiting for new changelog.

Sure, thanks!

>> Given the updated parts are technical forecast, I don't plan to implement it
>> in this series and still enumerate
>>
>> CET_SSS ==0 for guest. What's your thoughts?
> Yes, definitely punt shadow-stack fixup to future enabling work.
Got it.
Yang, Weijiang July 17, 2023, 7:44 a.m. UTC | #12
On 6/24/2023 4:51 AM, Sean Christopherson wrote:
> On Mon, Jun 19, 2023, Weijiang Yang wrote:
>> On 6/17/2023 1:56 AM, Sean Christopherson wrote:
>>> On Fri, Jun 16, 2023, Weijiang Yang wrote:
>>>> On 6/16/2023 7:30 AM, Sean Christopherson wrote:
>>>>> On Thu, May 11, 2023, Yang Weijiang wrote:
>>>>>
[...]
>> Let me make it clear, you want me to do two things:
>>
>> 1)Add Supervisor Shadow Stack  state support(i.e., XSS.bit12(CET_S)) into
>> kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
>> context switch.
> If that's necessary for correct functionality, yes.

Hi, Sean,

I held off posting the new version and want to sync up with you on this 
point to avoid

surprising you.

After discussed adding the patch in kernel with Rick and Chao, we got 
blow conclusions

on doing so:

the Pros:
  - Super easy to implement for KVM.
  - Automatically avoids saving and restoring this data when the vmexit
    is handled within KVM.

the Cons:
  - Unnecessarily restores XFEATURE_CET_KERNEL when switching to
    non-KVM task's userspace.
  - Forces allocating space for this state on all tasks, whether or not
    they use KVM, and with likely zero users today and the near future.
  - Complicates the FPU optimization thinking by including things that
    can have no affect on userspace in the FPU

Given above reasons, I implemented guest CET supervisor states management

in KVM instead of adding a kernel patch for it.

Below are 3 KVM patches to support it:

Patch 1: Save/reload guest CET supervisor states when necessary:

=======================================================================

commit 16147ede75dee29583b7d42a6621d10d55b63595
Author: Yang Weijiang <weijiang.yang@intel.com>
Date:   Tue Jul 11 02:26:17 2023 -0400

     KVM:x86: Make guest supervisor states as non-XSAVE managed

     Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP,
     when vCPU context is being swapped before and after userspace
     <->kernel entry, also do the same operation when vCPU is sched-in
     or sched-out.

     Enabling CET supervisor state management only in KVM due to:
     1) Currently, suervisor SHSTK is not enabled on host side, only
     KVM needs to care about the guest's suervisor SHSTK states.
     2) Enabling them in kernel FPU state framework has global effects
     to all threads on host kernel, but the majority of the threads
     are free to CET supervisor states. And it requires additional
     storage size of thread FPU state area.

     Add a new helper kvm_arch_sched_out() for that purpose. Adding
     the support in kvm_arch_vcpu_put/load() without the new helper
     looks possible, but the put/load functions are also called in
     vcpu_put()/load(), the latter are heavily used in KVM, so adding
     new helper makes the implementation clearer.

     Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..98235cb3d258 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1023,6 +1023,7 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);

  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
  void kvm_arm_init_debug(void);
  void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
diff --git a/arch/mips/include/asm/kvm_host.h 
b/arch/mips/include/asm/kvm_host.h
index 957121a495f0..56c5e85ba5a3 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -893,6 +893,7 @@ static inline void kvm_arch_free_memslot(struct kvm 
*kvm,
                                          struct kvm_memory_slot *slot) {}
  static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..11587d953bf6 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -880,6 +880,7 @@ static inline void kvm_arch_sync_events(struct kvm 
*kvm) {}
  static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
  static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
  static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
  static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

diff --git a/arch/riscv/include/asm/kvm_host.h 
b/arch/riscv/include/asm/kvm_host.h
index ee0acccb1d3b..6ff4a04fe0f2 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -244,6 +244,7 @@ struct kvm_vcpu_arch {

  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
  #define KVM_ARCH_WANT_MMU_NOTIFIER

diff --git a/arch/s390/include/asm/kvm_host.h 
b/arch/s390/include/asm/kvm_host.h
index 2bbc3d54959d..d1750a6a86cf 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1033,6 +1033,7 @@ extern int kvm_s390_gisc_unregister(struct kvm 
*kvm, u32 gisc);

  static inline void kvm_arch_sync_events(struct kvm *kvm) {}
  static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
  static inline void kvm_arch_free_memslot(struct kvm *kvm,
                                          struct kvm_memory_slot *slot) {}
  static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e2c549f147a5..7d9cfb7e2fe8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu 
*vcpu)
         trace_kvm_fpu(0);
  }

+static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
+{
+       preempt_disable();
+       if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
+               rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
+               rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
+               rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
+               wrmsrl(MSR_IA32_PL0_SSP, 0);
+               wrmsrl(MSR_IA32_PL1_SSP, 0);
+               wrmsrl(MSR_IA32_PL2_SSP, 0);
+       }
+       preempt_enable();
+}
+
+static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
+{
+       preempt_disable();
+       if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
+               wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
+               wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
+               wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
+       }
+       preempt_enable();
+}
+
  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
  {
         struct kvm_queued_exception *ex = &vcpu->arch.exception;
@@ -11222,6 +11247,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
         kvm_sigset_activate(vcpu);
         kvm_run->flags = 0;
         kvm_load_guest_fpu(vcpu);
+       kvm_reload_cet_supervisor_ssp(vcpu);

         kvm_vcpu_srcu_read_lock(vcpu);
         if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11310,6 +11336,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
         r = vcpu_run(vcpu);

  out:
+       kvm_save_cet_supervisor_ssp(vcpu);
         kvm_put_guest_fpu(vcpu);
         if (kvm_run->kvm_valid_regs)
                 store_regs(vcpu);
@@ -12398,9 +12425,17 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, 
int cpu)
                 pmu->need_cleanup = true;
                 kvm_make_request(KVM_REQ_PMU, vcpu);
         }
+
+       kvm_reload_cet_supervisor_ssp(vcpu);
+
         static_call(kvm_x86_sched_in)(vcpu, cpu);
  }
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu)
+{
+       kvm_save_cet_supervisor_ssp(vcpu);
+}
+
  void kvm_arch_free_vm(struct kvm *kvm)
  {
         kfree(to_kvm_hv(kvm)->hv_pa_pg);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d90331f16db1..b3032a5f0641 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1423,6 +1423,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct 
kvm_vcpu *vcpu,
  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);

  void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu);

  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66c1447d3c7f..42f28e8905e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5885,6 +5885,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
  {
         struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

+       kvm_arch_sched_out(vcpu, 0);
         if (current->on_rq) {
                 WRITE_ONCE(vcpu->preempted, true);
                 WRITE_ONCE(vcpu->ready, true);

Patch 2: optimization patch for above one:

===================================================================

commit ae5fe7c81cc3b93193758d1b7b4ab74a92a51dad
Author: Yang Weijiang <weijiang.yang@intel.com>
Date:   Fri Jul 14 20:03:52 2023 -0400

     KVM:x86: Optimize CET supervisor SSP save/reload

     Make PL{0,1,2}_SSP as write-intercepted to detect whether
     guest is using these MSRs. Disable intercept to the MSRs
     if they're written with non-zero values. KVM does save/
     reload for the MSRs only if they're used by guest.

     Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>

diff --git a/arch/x86/include/asm/kvm_host.h 
b/arch/x86/include/asm/kvm_host.h
index 69cbc9d9b277..c50b555234fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -748,6 +748,7 @@ struct kvm_vcpu_arch {
         bool tpr_access_reporting;
         bool xsaves_enabled;
         bool xfd_no_write_intercept;
+       bool cet_sss_active;
         u64 ia32_xss;
         u64 microcode_version;
         u64 arch_capabilities;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 90ce1c7d3fd7..21c89d200c88 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2156,6 +2156,18 @@ static u64 vmx_get_supported_debugctl(struct 
kvm_vcpu *vcpu, bool host_initiated
         return debugctl;
  }

+static void vmx_disable_write_intercept_sss_msr(struct kvm_vcpu *vcpu)
+{
+       if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+                               MSR_TYPE_RW, false);
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+                               MSR_TYPE_RW, false);
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+                               MSR_TYPE_RW, false);
+       }
+}
+
  /*
   * Writes msr value into the appropriate "register".
   * Returns 0 on success, non-0 otherwise.
@@ -2427,7 +2439,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, 
struct msr_data *msr_info)
  #define VMX_CET_CONTROL_MASK   (~GENMASK_ULL(9,6))
  #define LEG_BITMAP_BASE(data)  ((data) >> 12)
         case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
-               return kvm_set_msr_common(vcpu, msr_info);
+               if (kvm_set_msr_common(vcpu, msr_info))
+                       return 1;
+               /*
+                * Write to the base SSP MSRs should happen ahead of 
toggling
+                * of IA32_S_CET.SH_STK_EN bit.
+                */
+               if (!msr_info->host_initiated &&
+                   msr_index != MSR_IA32_PL3_SSP && data) {
+                       vmx_disable_write_intercept_sss_msr(vcpu);
+                       wrmsrl(msr_index, data);
+               }
                 break;
         case MSR_IA32_U_CET:
         case MSR_IA32_S_CET:
@@ -7773,12 +7795,17 @@ static void 
vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
                                 MSR_TYPE_RW, false);
                 vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
                                 MSR_TYPE_RW, false);
+               /*
+                * Supervisor shadow stack MSRs are intercepted until
+                * they're written by guest, this is designed to
+                * optimize the save/restore overhead.
+                */
                 vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
-                               MSR_TYPE_RW, false);
+                               MSR_TYPE_R, false);
                 vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
-                               MSR_TYPE_RW, false);
+                               MSR_TYPE_R, false);
                 vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
-                               MSR_TYPE_RW, false);
+                               MSR_TYPE_R, false);
                 vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cab31dbb2bec..06dc5111da3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4049,8 +4049,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, 
struct msr_data *msr_info)
                 if (!IS_ALIGNED(data, 4))
                         return 1;
                 if (msr == MSR_IA32_PL0_SSP || msr == MSR_IA32_PL1_SSP ||
-                   msr == MSR_IA32_PL2_SSP)
+                   msr == MSR_IA32_PL2_SSP) {
+                       if (!msr_info->host_initiated && data)
+                               vcpu->arch.cet_sss_active = true;
                         vcpu->arch.cet_s_ssp[msr - MSR_IA32_PL0_SSP] = 
data;
+               }
                 else if (msr == MSR_IA32_PL3_SSP)
                         kvm_set_xsave_msr(msr_info);
                 break;
@@ -11250,7 +11253,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
         kvm_sigset_activate(vcpu);
         kvm_run->flags = 0;
         kvm_load_guest_fpu(vcpu);
-       kvm_reload_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_reload_cet_supervisor_ssp(vcpu);

         kvm_vcpu_srcu_read_lock(vcpu);
         if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
         r = vcpu_run(vcpu);

  out:
-       kvm_save_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_save_cet_supervisor_ssp(vcpu);
         kvm_put_guest_fpu(vcpu);
         if (kvm_run->kvm_valid_regs)
                 store_regs(vcpu);
         kvm_vcpu_srcu_read_lock(vcpu);
         if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
         r = vcpu_run(vcpu);

  out:
-       kvm_save_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_save_cet_supervisor_ssp(vcpu);
         kvm_put_guest_fpu(vcpu);
         if (kvm_run->kvm_valid_regs)
                 store_regs(vcpu);
@@ -12428,15 +12433,16 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, 
int cpu)
                 pmu->need_cleanup = true;
                 kvm_make_request(KVM_REQ_PMU, vcpu);
         }
-
-       kvm_reload_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_reload_cet_supervisor_ssp(vcpu);

         static_call(kvm_x86_sched_in)(vcpu, cpu);
  }

  void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu)
  {
-       kvm_save_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_save_cet_supervisor_ssp(vcpu);
  }

  void kvm_arch_free_vm(struct kvm *kvm)

=============================================================

Patch 3: support guest CET supervisor xstate bit:

commit 2708b3c959db56fb9243f9a157884c2120b8810c
Author: Yang Weijiang <weijiang.yang@intel.com>
Date:   Sat Jul 15 20:56:37 2023 -0400

     KVM:x86: Enable guest CET supervisor xstate bit support

     Add S_CET bit in kvm_caps.supported_xss so that guest can enumerate
     the feature in CPUID(0xd,1).ECX.

     Guest S_CET xstate bit is specially handled, i.e., it can be exposed
     without related enabling on host side, because KVM manually 
saves/reloads
     guest supervisor SHSTK SSPs and current XSS swap logic for 
host/guest aslo
     supports doing so, thus it's safe to enable the bit without host 
support.

     Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2653e5eb54ee..071bcdedc530 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -228,7 +228,8 @@ static struct kvm_user_return_msrs __percpu 
*user_return_msrs;
                                 | XFEATURE_MASK_BNDCSR | 
XFEATURE_MASK_AVX512 \
                                 | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

-#define KVM_SUPPORTED_XSS      (XFEATURE_MASK_CET_USER)
+#define KVM_SUPPORTED_XSS      (XFEATURE_MASK_CET_USER | \
+                                XFEATURE_MASK_CET_KERNEL)

  u64 __read_mostly host_efer;
  EXPORT_SYMBOL_GPL(host_efer);
@@ -9639,6 +9640,7 @@ static int __kvm_x86_vendor_init(struct 
kvm_x86_init_ops *ops)
         if (boot_cpu_has(X86_FEATURE_XSAVES)) {
                 rdmsrl(MSR_IA32_XSS, host_xss);
                 kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+               kvm_caps.supported_xss |= XFEATURE_MASK_CET_KERNEL;
         }

         kvm_init_pmu_capability(ops->pmu_ops);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f8f042c91728..df187d7c3e74 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -362,7 +362,7 @@ static inline bool kvm_mpx_supported(void)
                 == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
  }

-#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER)
+#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
  /*
   * Shadow Stack and Indirect Branch Tracking feature enabling depends on
   * whether host side CET user xstate bit is supported or not.

=================================================================

What's your thoughts on the solution? Is it appropriate for KVM?

Thanks!

[...]
Sean Christopherson July 19, 2023, 7:41 p.m. UTC | #13
On Mon, Jul 17, 2023, Weijiang Yang wrote:
> 
> On 6/24/2023 4:51 AM, Sean Christopherson wrote:
> > > 1)Add Supervisor Shadow Stack� state support(i.e., XSS.bit12(CET_S)) into
> > > kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
> > > context switch.
> > If that's necessary for correct functionality, yes.

...

> the Pros:
> �- Super easy to implement for KVM.
> �- Automatically avoids saving and restoring this data when the vmexit
> �� is handled within KVM.
> 
> the Cons:
> �- Unnecessarily restores XFEATURE_CET_KERNEL when switching to
> �� non-KVM task's userspace.
> �- Forces allocating space for this state on all tasks, whether or not
> �� they use KVM, and with likely zero users today and the near future.
> �- Complicates the FPU optimization thinking by including things that
> �� can have no affect on userspace in the FPU
> 
> Given above reasons, I implemented guest CET supervisor states management
> in KVM instead of adding a kernel patch for it.
> 
> Below are 3 KVM patches to support it:
> 
> Patch 1: Save/reload guest CET supervisor states when necessary:
> 
> =======================================================================
> 
> commit 16147ede75dee29583b7d42a6621d10d55b63595
> Author: Yang Weijiang <weijiang.yang@intel.com>
> Date:�� Tue Jul 11 02:26:17 2023 -0400
> 
> ��� KVM:x86: Make guest supervisor states as non-XSAVE managed
> 
> ��� Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP,
> ��� when vCPU context is being swapped before and after userspace
> ��� <->kernel entry, also do the same operation when vCPU is sched-in
> ��� or sched-out.

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e2c549f147a5..7d9cfb7e2fe8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
> *vcpu)
> ������� trace_kvm_fpu(0);
> �}
> 
> +static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
> +{
> +������ preempt_disable();
> +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
> +�������������� rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
> +�������������� rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
> +�������������� rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
> +�������������� wrmsrl(MSR_IA32_PL0_SSP, 0);
> +�������������� wrmsrl(MSR_IA32_PL1_SSP, 0);
> +�������������� wrmsrl(MSR_IA32_PL2_SSP, 0);
> +������ }
> +������ preempt_enable();
> +}
> +
> +static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
> +{
> +������ preempt_disable();
> +������ if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
> +�������������� wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
> +�������������� wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
> +�������������� wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
> +������ }
> +������ preempt_enable();
> +}

My understanding is that PL[0-2]_SSP are used only on transitions to the
corresponding privilege level from a *different* privilege level.  That means
KVM should be able to utilize the user_return_msr framework to load the host
values.  Though if Linux ever supports SSS, I'm guessing the core kernel will
have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
per-task, on every context switch.

But note my original wording: **If that's necessary**

If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
IA32_S_CET, then running host stuff with guest values should be ok.  KVM only
needs to guarantee that it doesn't leak values between guests.  But that should
Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.

And regardless of what the mechanism ends up managing SSP MSRs, it should only
ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
never consume PL{1,2}_SSP.

Am I missing something?
Sean Christopherson July 19, 2023, 8:26 p.m. UTC | #14
On Wed, Jul 19, 2023, Sean Christopherson wrote:
> On Mon, Jul 17, 2023, Weijiang Yang wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e2c549f147a5..7d9cfb7e2fe8 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
> > *vcpu)
> > ������� trace_kvm_fpu(0);
> > �}

Huh.  After a bit of debugging, the mangling is due to mutt's default for send_charset
being

  "us-ascii:iso-8859-1:utf-8"

and selecting iso-8859-1 instead of utf-8 as the encoding despite the original
mail being utf-8.  In this case, mutt ran afoul of nbsp (u+00a0).

AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1
for sending mail

  set send_charset="us-ascii:utf-8"
Peter Zijlstra July 19, 2023, 8:36 p.m. UTC | #15
On Wed, Jul 19, 2023 at 12:41:47PM -0700, Sean Christopherson wrote:

> My understanding is that PL[0-2]_SSP are used only on transitions to the
> corresponding privilege level from a *different* privilege level.  That means
> KVM should be able to utilize the user_return_msr framework to load the host
> values.  Though if Linux ever supports SSS, I'm guessing the core kernel will
> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> per-task, on every context switch.
> 
> But note my original wording: **If that's necessary**
> 
> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> IA32_S_CET, then running host stuff with guest values should be ok.  KVM only
> needs to guarantee that it doesn't leak values between guests.  But that should
> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
> 
> And regardless of what the mechanism ends up managing SSP MSRs, it should only
> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> never consume PL{1,2}_SSP.

To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
Yang, Weijiang July 20, 2023, 1:55 a.m. UTC | #16
On 7/20/2023 3:41 AM, Sean Christopherson wrote:
> [...]
> My understanding is that PL[0-2]_SSP are used only on transitions to the
> corresponding privilege level from a *different* privilege level.  That means
> KVM should be able to utilize the user_return_msr framework to load the host
> values.  Though if Linux ever supports SSS, I'm guessing the core kernel will
> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> per-task, on every context switch.
>
> But note my original wording: **If that's necessary**

Thanks!

I think host SSS enabling won't happen in short-term, handling the guest 
supervisor

states in KVM side is doable.

>
> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> IA32_S_CET, then running host stuff with guest values should be ok.  KVM only
> needs to guarantee that it doesn't leak values between guests.  But that should
> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.

Yes, these handling have been covered by the new version.

>
> And regardless of what the mechanism ends up managing SSP MSRs, it should only
> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> never consume PL{1,2}_SSP.
>
> Am I missing something?

I think, guest PL{0,1,2}_SSP can be handled as a bundle to make the 
handling easy(instead of handling each

separately) because guest can be non-Linux systems, as you said before 
they could even be used as scratch registers.

But for host side as it's Linux, I can omit reloading/resetting host 
PL{1,2}_SSP when vCPU thread is preempted.

I will post new version to community if above is minor divergence.
Yang, Weijiang July 20, 2023, 1:58 a.m. UTC | #17
On 7/20/2023 4:26 AM, Sean Christopherson wrote:
> On Wed, Jul 19, 2023, Sean Christopherson wrote:
>> On Mon, Jul 17, 2023, Weijiang Yang wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index e2c549f147a5..7d9cfb7e2fe8 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu
>>> *vcpu)
>>> ������� trace_kvm_fpu(0);
>>> �}
> Huh.  After a bit of debugging, the mangling is due to mutt's default for send_charset
> being
>
>    "us-ascii:iso-8859-1:utf-8"
>
> and selecting iso-8859-1 instead of utf-8 as the encoding despite the original
> mail being utf-8.  In this case, mutt ran afoul of nbsp (u+00a0).
>
> AFAICT, the solution is to essentially tell mutt to never try to use iso-8859-1
> for sending mail
>
>    set send_charset="us-ascii:utf-8"

It made me feel a bit guilty as I thought it could be resulted from 
wrong settings of my email system :-)
Pankaj Gupta July 20, 2023, 5:26 a.m. UTC | #18
> > My understanding is that PL[0-2]_SSP are used only on transitions to the
> > corresponding privilege level from a *different* privilege level.  That means
> > KVM should be able to utilize the user_return_msr framework to load the host
> > values.  Though if Linux ever supports SSS, I'm guessing the core kernel will
> > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> > per-task, on every context switch.
> >
> > But note my original wording: **If that's necessary**
> >
> > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> > IA32_S_CET, then running host stuff with guest values should be ok.  KVM only
> > needs to guarantee that it doesn't leak values between guests.  But that should
> > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
> >
> > And regardless of what the mechanism ends up managing SSP MSRs, it should only
> > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> > never consume PL{1,2}_SSP.
>
> To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.

Trying to understand more what prevents SSS to enable in pre FRED, Is
it better #CP exception
handling with other nested exceptions?

Won't same problems (to some extent) happen in user-mode shadow stack
(and in case of guest, SSS inside VM)?

Thanks,
Pankaj
Peter Zijlstra July 20, 2023, 8:03 a.m. UTC | #19
On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote:
> > > My understanding is that PL[0-2]_SSP are used only on transitions to the
> > > corresponding privilege level from a *different* privilege level.  That means
> > > KVM should be able to utilize the user_return_msr framework to load the host
> > > values.  Though if Linux ever supports SSS, I'm guessing the core kernel will
> > > have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
> > > userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
> > > per-task, on every context switch.
> > >
> > > But note my original wording: **If that's necessary**
> > >
> > > If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
> > > IA32_S_CET, then running host stuff with guest values should be ok.  KVM only
> > > needs to guarantee that it doesn't leak values between guests.  But that should
> > > Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
> > > guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
> > >
> > > And regardless of what the mechanism ends up managing SSP MSRs, it should only
> > > ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
> > > never consume PL{1,2}_SSP.
> >
> > To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
> 
> Trying to understand more what prevents SSS to enable in pre FRED, Is
> it better #CP exception
> handling with other nested exceptions?

SSS took the syscall gap and made it worse -- as in *way* worse.

To top it off, the whole SSS busy bit thing is fundamentally
incompatible with how we manage to survive nested exceptions in NMI
context.

Basically, the whole x86 exception / stack switching logic was already
borderline impossible (consider taking an MCE in the early NMI path
where we set up, but have not finished, the re-entrancy stuff), and
pushed it over the edge and set it on fire.

And NMI isn't the only problem, the various new virt exceptions #VC and
#HV are on their own already near impossible, adding SSS again pushes
the whole thing into clear insanity.

There's a good exposition of the whole trainwreck by Andrew here:

  https://www.youtube.com/watch?v=qcORS8CN0ow

(that is, sorry for the youtube link, but Google is failing me in
finding the actual Google Doc that talk is based on, or even the slide
deck :/)



FRED solves all that by:

 - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched
   atomically and consistently for every transition.

 - removing the non-reentrant IST mechanism and replacing it with stack
   levels

 - adding an explicit NMI latch

 - re-organising the actual shadow stacks and doing away with that busy
   bit thing (I need to re-read the FRED spec on this detail again).



Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole,
sorry.
Peter Zijlstra July 20, 2023, 8:09 a.m. UTC | #20
On Thu, Jul 20, 2023 at 10:03:58AM +0200, Peter Zijlstra wrote:

> > Trying to understand more what prevents SSS to enable in pre FRED, Is
> > it better #CP exception
> > handling with other nested exceptions?
> 
> SSS took the syscall gap and made it worse -- as in *way* worse.
> 
> To top it off, the whole SSS busy bit thing is fundamentally
> incompatible with how we manage to survive nested exceptions in NMI
> context.
> 
> Basically, the whole x86 exception / stack switching logic was already
> borderline impossible (consider taking an MCE in the early NMI path
> where we set up, but have not finished, the re-entrancy stuff), and

SSS

> pushed it over the edge and set it on fire.
> 
> And NMI isn't the only problem, the various new virt exceptions #VC and
> #HV are on their own already near impossible, adding SSS again pushes
> the whole thing into clear insanity.
> 
> There's a good exposition of the whole trainwreck by Andrew here:
> 
>   https://www.youtube.com/watch?v=qcORS8CN0ow
> 
> (that is, sorry for the youtube link, but Google is failing me in
> finding the actual Google Doc that talk is based on, or even the slide
> deck :/)
> 
> 
> 
> FRED solves all that by:
> 
>  - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched
>    atomically and consistently for every transition.
> 
>  - removing the non-reentrant IST mechanism and replacing it with stack
>    levels
> 
>  - adding an explicit NMI latch
> 
>  - re-organising the actual shadow stacks and doing away with that busy
>    bit thing (I need to re-read the FRED spec on this detail again).
> 
> 
> 
> Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole,
> sorry.
Pankaj Gupta July 20, 2023, 9:14 a.m. UTC | #21
> > > Trying to understand more what prevents SSS to enable in pre FRED, Is
> > > it better #CP exception
> > > handling with other nested exceptions?
> >
> > SSS took the syscall gap and made it worse -- as in *way* worse.
> >
> > To top it off, the whole SSS busy bit thing is fundamentally
> > incompatible with how we manage to survive nested exceptions in NMI
> > context.
> >
> > Basically, the whole x86 exception / stack switching logic was already
> > borderline impossible (consider taking an MCE in the early NMI path
> > where we set up, but have not finished, the re-entrancy stuff), and
>
> SSS
>
> > pushed it over the edge and set it on fire.

ah I see. SSS takes it to the next level.

> >
> > And NMI isn't the only problem, the various new virt exceptions #VC and
> > #HV are on their own already near impossible, adding SSS again pushes
> > the whole thing into clear insanity.
> >
> > There's a good exposition of the whole trainwreck by Andrew here:
> >
> >   https://www.youtube.com/watch?v=qcORS8CN0ow
> >
> > (that is, sorry for the youtube link, but Google is failing me in
> > finding the actual Google Doc that talk is based on, or even the slide
> > deck :/)

I think I got the link:
https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?pli=1

> >
> >
> >
> > FRED solves all that by:
> >
> >  - removing the stack gap, cc/ip/ss/sp/ssp/gs will all be switched
> >    atomically and consistently for every transition.
> >
> >  - removing the non-reentrant IST mechanism and replacing it with stack
> >    levels
> >
> >  - adding an explicit NMI latch
> >
> >  - re-organising the actual shadow stacks and doing away with that busy
> >    bit thing (I need to re-read the FRED spec on this detail again).
> >

Thank you for explaining. I will also study the FRED spec and
corresponding kernel
patches posted in the mailing list.
> >
> >
> > Crazy as we are, we're not touching legacy/IDT SSS with a ten foot pole,
> > sorry.

ya, interesting.

Best regards,
Pankaj
Andrew Cooper July 20, 2023, 10:46 a.m. UTC | #22
On 20/07/2023 9:03 am, Peter Zijlstra wrote:
> On Thu, Jul 20, 2023 at 07:26:04AM +0200, Pankaj Gupta wrote:
>>>> My understanding is that PL[0-2]_SSP are used only on transitions to the
>>>> corresponding privilege level from a *different* privilege level.  That means
>>>> KVM should be able to utilize the user_return_msr framework to load the host
>>>> values.  Though if Linux ever supports SSS, I'm guessing the core kernel will
>>>> have some sort of mechanism to defer loading MSR_IA32_PL0_SSP until an exit to
>>>> userspace, e.g. to avoid having to write PL0_SSP, which will presumably be
>>>> per-task, on every context switch.
>>>>
>>>> But note my original wording: **If that's necessary**
>>>>
>>>> If nothing in the host ever consumes those MSRs, i.e. if SSS is NOT enabled in
>>>> IA32_S_CET, then running host stuff with guest values should be ok.  KVM only
>>>> needs to guarantee that it doesn't leak values between guests.  But that should
>>>> Just Work, e.g. KVM should load the new vCPU's values if SHSTK is exposed to the
>>>> guest, and intercept (to inject #GP) if SHSTK is not exposed to the guest.
>>>>
>>>> And regardless of what the mechanism ends up managing SSP MSRs, it should only
>>>> ever touch PL0_SSP, because Linux never runs anything at CPL1 or CPL2, i.e. will
>>>> never consume PL{1,2}_SSP.
>>> To clarify, Linux will only use SSS in FRED mode -- FRED removes CPL1,2.
>> Trying to understand more what prevents SSS to enable in pre FRED, Is
>> it better #CP exception
>> handling with other nested exceptions?
> SSS 

Careful with SSS for "supervisor shadow stacks".   Because there's a
brand new CET_SSS CPUID bit to cover the (mis)feature where shstk
supervisor tokens can be *prematurely busy*.

(11/10 masterful wordsmithing, because it does lull you into the
impression that this isn't WTF^2 levels of crazy)

> took the syscall gap and made it worse -- as in *way* worse.

More impressively, it created a sysenter gap where there wasn't one
previously.

> To top it off, the whole SSS busy bit thing is fundamentally
> incompatible with how we manage to survive nested exceptions in NMI
> context.

To be clear, this is supervisor shadow stack regular busy bits, not the
CET_SSS premature busy problem.

>
> Basically, the whole x86 exception / stack switching logic was already
> borderline impossible (consider taking an MCE in the early NMI path
> where we set up, but have not finished, the re-entrancy stuff), and
> pushed it over the edge and set it on fire.
>
> And NMI isn't the only problem, the various new virt exceptions #VC and
> #HV are on their own already near impossible, adding SSS again pushes
> the whole thing into clear insanity.
>
> There's a good exposition of the whole trainwreck by Andrew here:
>
>   https://www.youtube.com/watch?v=qcORS8CN0ow
>
> (that is, sorry for the youtube link, but Google is failing me in
> finding the actual Google Doc that talk is based on, or even the slide
> deck :/)

https://docs.google.com/presentation/d/10vWC02kpy4QneI43qsT3worfF_e3sbAE3Ifr61Sq3dY/edit?usp=sharing
is the slide deck.

I'm very glad I put a "only accurate as of $PRESENTATION_DATE"
disclaimer on slide 14.  It makes the whole presentation still
technically correct.

FRED is now at draft 5, and importantly shstk tokens have been removed. 
They've been replaced with alternative MSR-based mechanism, mostly for
performance reasons but a consequence is that the prematurely busy bug
can't happen.

~Andrew