diff mbox series

[09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

Message ID 20240904030751.117579-10-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU Part 2 | expand

Commit Message

Rick Edgecombe Sept. 4, 2024, 3:07 a.m. UTC
From: Yuan Yao <yuan.yao@intel.com>

TDX module internally uses locks to protect internal resources.  It tries
to acquire the locks.  If it fails to obtain the lock, it returns
TDX_OPERAND_BUSY error without spin because its execution time limitation.

TDX SEAMCALL API reference describes what resources are used.  It's known
which TDX SEAMCALL can cause contention with which resources.  VMM can
avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL
with, for example, spinlock.  Because OS knows better its process
scheduling and its scalability, a lock at OS/VMM layer would work better
than simply retrying TDX SEAMCALLs.

TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and
the TDX module internally tries to acquire the lock of the secure EPT tree.
They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to
get the lock.  TDX KVM allows sept callbacks to return error so that TDP
MMU layer can retry.

Retry TDX TDH.MEM.* API on the error because the error is a rare event
caused by zero-step attack mitigation.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Updates from seamcall overhaul (Kai)

v19:
 - fix typo TDG.VP.ENTER => TDH.VP.ENTER,
   TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY
 - drop the description on TDH.VP.ENTER as this patch doesn't touch
   TDH.VP.ENTER
---
 arch/x86/kvm/vmx/tdx_ops.h | 48 ++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 7 deletions(-)

Comments

Kai Huang Sept. 6, 2024, 1:41 a.m. UTC | #1
On 4/09/2024 3:07 pm, Rick Edgecombe wrote:
> From: Yuan Yao <yuan.yao@intel.com>
> 
> TDX module internally uses locks to protect internal resources.  It tries
> to acquire the locks.  If it fails to obtain the lock, it returns
> TDX_OPERAND_BUSY error without spin because its execution time limitation.
> 
> TDX SEAMCALL API reference describes what resources are used.  It's known
> which TDX SEAMCALL can cause contention with which resources.  VMM can
> avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL
> with, for example, spinlock.  Because OS knows better its process
> scheduling and its scalability, a lock at OS/VMM layer would work better
> than simply retrying TDX SEAMCALLs.
> 
> TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and
> the TDX module internally tries to acquire the lock of the secure EPT tree.
> They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to
> get the lock.  TDX KVM allows sept callbacks to return error so that TDP
> MMU layer can retry.
> 
> Retry TDX TDH.MEM.* API on the error because the error is a rare event
> caused by zero-step attack mitigation.

The last paragraph seems can be improved:

It seems to say the "TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT" can only be 
cauesd by zero-step attack detection/mitigation, which isn't true from 
the previous paragraph.

In fact, I think this patch can be dropped:

1) The TDH_MEM_xx()s can return BUSY due to nature of TDP MMU, but all 
the callers of TDH_MEM_xx()s are already explicitly retrying by looking 
at the patch "KVM: TDX: Implement hooks to propagate changes of TDP MMU 
mirror page table" -- they either return PF_RETRY to let the fault to 
happen again or explicitly loop until no BUSY is returned.  So I am not 
sure why we need to "loo SEAMCALL_RETRY_MAX (16) times" in the common code.

2) TDH_VP_ENTER explicitly retries immediately for such case:

         /* See the comment of tdx_seamcall_sept(). */
         if (unlikely(vp_enter_ret == TDX_ERROR_SEPT_BUSY))
                 return EXIT_FASTPATH_REENTER_GUEST;


3) That means the _ONLY_ reason to retry in the common code for 
TDH_MEM_xx()s is to mitigate zero-step attack by reducing the times of 
letting guest to fault on the same instruction.

I don't think we need to handle zero-step attack mitigation in the first 
TDX support submission.  So I think we can just remove this patch.

> 
> Signed-off-by: Yuan Yao <yuan.yao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Updates from seamcall overhaul (Kai)
> 
> v19:
>   - fix typo TDG.VP.ENTER => TDH.VP.ENTER,
>     TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY
>   - drop the description on TDH.VP.ENTER as this patch doesn't touch
>     TDH.VP.ENTER
> ---
>   arch/x86/kvm/vmx/tdx_ops.h | 48 ++++++++++++++++++++++++++++++++------
>   1 file changed, 41 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> index 0363d8544f42..8ca3e252a6ed 100644
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -31,6 +31,40 @@
>   #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
>   	pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
>   
> +/*
> + * TDX module acquires its internal lock for resources.  It doesn't spin to get
> + * locks because of its restrictions of allowed execution time.  Instead, it
> + * returns TDX_OPERAND_BUSY with an operand id.
> + *
> + * Multiple VCPUs can operate on SEPT.  Also with zero-step attack mitigation,
> + * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
> + * attack is suspected.  It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
> + * with TDH.MEM.* operation.  Note: TDH.MEM.TRACK is an exception.
> + *
> + * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
> + * spoils TDP MMU effort.  Retry several times with the assumption that SEPT
> + * lock contention is rare.  But don't loop forever to avoid lockup.  Let TDP
> + * MMU retry.
> + */
> +#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
> +
> +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> +{
> +#define SEAMCALL_RETRY_MAX     16
> +	struct tdx_module_args args_in;
> +	int retry = SEAMCALL_RETRY_MAX;
> +	u64 ret;
> +
> +	do {
> +		args_in = *in;
> +		ret = seamcall_ret(op, in);
> +	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
> +
> +	*in = args_in;
> +
> +	return ret;
> +}
> +
>   static inline u64 tdh_mng_addcx(struct kvm_tdx *kvm_tdx, hpa_t addr)
>   {
>   	struct tdx_module_args in = {
> @@ -55,7 +89,7 @@ static inline u64 tdh_mem_page_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	u64 ret;
>   
>   	clflush_cache_range(__va(hpa), PAGE_SIZE);
> -	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -76,7 +110,7 @@ static inline u64 tdh_mem_sept_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   
>   	clflush_cache_range(__va(page), PAGE_SIZE);
>   
> -	ret = seamcall_ret(TDH_MEM_SEPT_ADD, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -93,7 +127,7 @@ static inline u64 tdh_mem_sept_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_SEPT_REMOVE, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -123,7 +157,7 @@ static inline u64 tdh_mem_page_aug(struct kvm_tdx *kvm_tdx, gpa_t gpa, hpa_t hpa
>   	u64 ret;
>   
>   	clflush_cache_range(__va(hpa), PAGE_SIZE);
> -	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -140,7 +174,7 @@ static inline u64 tdh_mem_range_block(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -335,7 +369,7 @@ static inline u64 tdh_mem_page_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -361,7 +395,7 @@ static inline u64 tdh_mem_range_unblock(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_RANGE_UNBLOCK, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
Paolo Bonzini Sept. 9, 2024, 3:25 p.m. UTC | #2
On 9/4/24 05:07, Rick Edgecombe wrote:
> +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> +{
> +#define SEAMCALL_RETRY_MAX     16

How is the 16 determined?  Also, is the lock per-VM or global?

Thanks,

Paolo

> +	struct tdx_module_args args_in;
> +	int retry = SEAMCALL_RETRY_MAX;
> +	u64 ret;
> +
> +	do {
> +		args_in = *in;
> +		ret = seamcall_ret(op, in);
> +	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
> +
> +	*in = args_in;
> +
> +	return ret;
> +}
> +
Rick Edgecombe Sept. 9, 2024, 8:22 p.m. UTC | #3
On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> > +{
> > +#define SEAMCALL_RETRY_MAX     16
> 
> How is the 16 determined?  Also, is the lock per-VM or global?

The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be
for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can
share the history. In any case, there seems to be some problems with this patch
or justification.

Regarding the zero-step mitigation, the TDX Module has a mitigation for an
attack where a malicious VMM causes repeated private EPT violations for the same
GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of
zero-step detection, these SEPT related SEAMCALLs will exit with the checked
error code if they contend the mentioned lock. If there was some other (non-
zero-step related) contention for this lock and KVM tries to re-enter the TD too
many times without resolving an EPT violation, it might inadvertently trigger
the zero-step mitigation. I *think* this patch is trying to say not to worry
about this case, and do a simple retry loop instead to handle the contention.

But why 16 retries would be sufficient, I can't find a reason for. Getting this
required retry logic right is important because some failures
(TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s.

Per the docs, in general the VMM is supposed to retry SEAMCALLs that return
TDX_OPERAND_BUSY. I think we need to revisit the general question of which
SEAMCALLs we should be retrying and how many times/how long. The other
consideration is that KVM already has per-VM locking, that would prevent
contention for some of the locks. So depending on internal details KVM may not
need to do any retries in some cases.
Rick Edgecombe Sept. 9, 2024, 8:25 p.m. UTC | #4
On Fri, 2024-09-06 at 13:41 +1200, Huang, Kai wrote:
> 3) That means the _ONLY_ reason to retry in the common code for 
> TDH_MEM_xx()s is to mitigate zero-step attack by reducing the times of 
> letting guest to fault on the same instruction.

My read of the zero-step mitigation is that it is implemented in the TDX module.
(which makes sense since it is defending against VMMs). There is some optional
ability for the guest to request notification, but the host defense is always in
place. Is that your understanding?

> 
> I don't think we need to handle zero-step attack mitigation in the first 
> TDX support submission.  So I think we can just remove this patch.

Thanks for highlighting the weirdness here. I think it needs more investigation.
Sean Christopherson Sept. 9, 2024, 9:11 p.m. UTC | #5
On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote:
> > On 9/4/24 05:07, Rick Edgecombe wrote:
> > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> > > +{
> > > +#define SEAMCALL_RETRY_MAX     16
> > 
> > How is the 16 determined?  Also, is the lock per-VM or global?
> 
> The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be
> for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can
> share the history. In any case, there seems to be some problems with this patch
> or justification.
> 
> Regarding the zero-step mitigation, the TDX Module has a mitigation for an
> attack where a malicious VMM causes repeated private EPT violations for the same
> GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of
> zero-step detection, these SEPT related SEAMCALLs will exit with the checked
> error code if they contend the mentioned lock. If there was some other (non-
> zero-step related) contention for this lock and KVM tries to re-enter the TD too
> many times without resolving an EPT violation, it might inadvertently trigger
> the zero-step mitigation. I *think* this patch is trying to say not to worry
> about this case, and do a simple retry loop instead to handle the contention.
> 
> But why 16 retries would be sufficient, I can't find a reason for. Getting this
> required retry logic right is important because some failures
> (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s.

I (somewhat indirectly) raised this as an issue in v11, and at a (very quick)
glance, nothing has changed to alleviate my concerns.

In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever.  For
its operations, I'm pretty sure the only sane approach is for KVM to ensure there
will be no contention.  And if the TDX module's single-step protection spuriously
kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
that it's mitigating single-step, e.g. so that KVM can forward the information
to userspace, then that's a TDX module problem to solve.

> Per the docs, in general the VMM is supposed to retry SEAMCALLs that return
> TDX_OPERAND_BUSY.

IMO, that's terrible advice.  SGX has similar behavior, where the xucode "module"
signals #GP if there's a conflict.  #GP is obviously far, far worse as it lacks
the precision that would help software understand exactly what went wrong, but I
think one of the better decisions we made with the SGX driver was to have a
"zero tolerance" policy where the driver would _never_ retry due to a potential
resource conflict, i.e. that any conflict in the module would be treated as a
kernel bug.

> I think we need to revisit the general question of which
> SEAMCALLs we should be retrying and how many times/how long. The other
> consideration is that KVM already has per-VM locking, that would prevent
> contention for some of the locks. So depending on internal details KVM may not
> need to do any retries in some cases.

Yes, and if KVM can't avoid conflict/retry, then before we go any further, I want
to know exactly why that is the case.
Sean Christopherson Sept. 9, 2024, 9:23 p.m. UTC | #6
On Mon, Sep 09, 2024, Sean Christopherson wrote:
> On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> > On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote:
> > > On 9/4/24 05:07, Rick Edgecombe wrote:
> > > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> > > > +{
> > > > +#define SEAMCALL_RETRY_MAX     16
> > > 
> > > How is the 16 determined?  Also, is the lock per-VM or global?
> > 
> > The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be
> > for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can
> > share the history. In any case, there seems to be some problems with this patch
> > or justification.
> > 
> > Regarding the zero-step mitigation, the TDX Module has a mitigation for an
> > attack where a malicious VMM causes repeated private EPT violations for the same
> > GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of
> > zero-step detection, these SEPT related SEAMCALLs will exit with the checked
> > error code if they contend the mentioned lock. If there was some other (non-
> > zero-step related) contention for this lock and KVM tries to re-enter the TD too
> > many times without resolving an EPT violation, it might inadvertently trigger
> > the zero-step mitigation. I *think* this patch is trying to say not to worry
> > about this case, and do a simple retry loop instead to handle the contention.
> > 
> > But why 16 retries would be sufficient, I can't find a reason for. Getting this
> > required retry logic right is important because some failures
> > (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s.
> 
> I (somewhat indirectly) raised this as an issue in v11, and at a (very quick)
> glance, nothing has changed to alleviate my concerns.

Gah, went out of my way to find the thread and then forgot to post the link:

https://lore.kernel.org/all/Y8m34OEVBfL7Q4Ns@google.com

> In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> will be no contention.  And if the TDX module's single-step protection spuriously
> kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> that it's mitigating single-step, e.g. so that KVM can forward the information
> to userspace, then that's a TDX module problem to solve.
> 
> > Per the docs, in general the VMM is supposed to retry SEAMCALLs that return
> > TDX_OPERAND_BUSY.
> 
> IMO, that's terrible advice.  SGX has similar behavior, where the xucode "module"
> signals #GP if there's a conflict.  #GP is obviously far, far worse as it lacks
> the precision that would help software understand exactly what went wrong, but I
> think one of the better decisions we made with the SGX driver was to have a
> "zero tolerance" policy where the driver would _never_ retry due to a potential
> resource conflict, i.e. that any conflict in the module would be treated as a
> kernel bug.
> 
> > I think we need to revisit the general question of which
> > SEAMCALLs we should be retrying and how many times/how long. The other
> > consideration is that KVM already has per-VM locking, that would prevent
> > contention for some of the locks. So depending on internal details KVM may not
> > need to do any retries in some cases.
> 
> Yes, and if KVM can't avoid conflict/retry, then before we go any further, I want
> to know exactly why that is the case.
Rick Edgecombe Sept. 9, 2024, 10:34 p.m. UTC | #7
On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever. 
> > For
> > its operations, I'm pretty sure the only sane approach is for KVM to ensure
> > there
> > will be no contention.  And if the TDX module's single-step protection
> > spuriously
> > kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't
> > communicate
> > that it's mitigating single-step, e.g. so that KVM can forward the
> > information
> > to userspace, then that's a TDX module problem to solve.
> > 
> > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > return
> > > TDX_OPERAND_BUSY.
> > 
> > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > "module"
> > signals #GP if there's a conflict.  #GP is obviously far, far worse as it
> > lacks
> > the precision that would help software understand exactly what went wrong,
> > but I
> > think one of the better decisions we made with the SGX driver was to have a
> > "zero tolerance" policy where the driver would _never_ retry due to a
> > potential
> > resource conflict, i.e. that any conflict in the module would be treated as
> > a
> > kernel bug.

Thanks for the analysis. The direction seems reasonable to me for this lock in
particular. We need to do some analysis on how much the existing mmu_lock can
protects us. Maybe sprinkle some asserts for documentation purposes.

For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The guest
side operations can take the locks too. From "Base Architecture Specification":
"
Host-Side (SEAMCALL) Operation
------------------------------
The host VMM is expected to retry host-side operations that fail with a
TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at
most after a limited time (the longest guest-side TDX module flow) there will be
no contention with a guest TD attempting to acquire access to the same resource.

Lock operations process the HOST_PRIORITY bit as follows:
   - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s
   HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is
   the host VMM’s responsibility to re-attempt the SEAMCALL function until is
   succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD
   from acquiring the lock.
   - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the
   lock’s HOST_PRIORITY bit.
   
Guest-Side (TDCALL) Operation
-----------------------------
A TDCALL (guest-side) function that attempt to acquire a lock fails if
HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest.
The guest is expected to retry the operation.

Guest-side TDCALL flows that acquire a host priority lock have an upper bound on
the host-side latency for that lock; once a lock is acquired, the flow either
releases within a fixed upper time bound, or periodically monitor the
HOST_PRIORITY flag to see if the host is attempting to acquire the lock.
"

So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is
involved in sorting out contention between the guest as well. We need to double
check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the
functionality we need to exercise for base support.

The thing that makes me nervous about retry based solution is the potential for
some kind deadlock like pattern. Just to gather your opinion, if there was some
SEAMCALL contention that couldn't be locked around from KVM, but came with some
strong well described guarantees, would a retry loop be hard NAK still?
Sean Christopherson Sept. 9, 2024, 11:58 p.m. UTC | #8
On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL,
> > > ever.  For its operations, I'm pretty sure the only sane approach is for
> > > KVM to ensure there will be no contention.  And if the TDX module's
> > > single-step protection spuriously kicks in, KVM exits to userspace.  If
> > > the TDX module can't/doesn't/won't communicate that it's mitigating
> > > single-step, e.g. so that KVM can forward the information to userspace,
> > > then that's a TDX module problem to solve.
> > > 
> > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > > return TDX_OPERAND_BUSY.
> > > 
> > > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > > "module" signals #GP if there's a conflict.  #GP is obviously far, far
> > > worse as it lacks the precision that would help software understand
> > > exactly what went wrong, but I think one of the better decisions we made
> > > with the SGX driver was to have a "zero tolerance" policy where the
> > > driver would _never_ retry due to a potential resource conflict, i.e.
> > > that any conflict in the module would be treated as a kernel bug.
> 
> Thanks for the analysis. The direction seems reasonable to me for this lock in
> particular. We need to do some analysis on how much the existing mmu_lock can
> protects us. 

I would operate under the assumption that it provides SEPT no meaningful protection.
I think I would even go so far as to say that it is a _requirement_ that mmu_lock
does NOT provide the ordering required by SEPT, because I do not want to take on
any risk (due to SEPT constraints) that would limit KVM's ability to do things
while holding mmu_lock for read.

> Maybe sprinkle some asserts for documentation purposes.

Not sure I understand, assert on what?

> For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The guest
> side operations can take the locks too. From "Base Architecture Specification":
> "
> Host-Side (SEAMCALL) Operation
> ------------------------------
> The host VMM is expected to retry host-side operations that fail with a
> TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at
> most after a limited time (the longest guest-side TDX module flow) there will be
> no contention with a guest TD attempting to acquire access to the same resource.
> 
> Lock operations process the HOST_PRIORITY bit as follows:
>    - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s
>    HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is
>    the host VMM’s responsibility to re-attempt the SEAMCALL function until is
>    succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD
>    from acquiring the lock.
>    - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the
>    lock’s HOST_PRIORITY bit.

*sigh*

> Guest-Side (TDCALL) Operation
> -----------------------------
> A TDCALL (guest-side) function that attempt to acquire a lock fails if
> HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest.
> The guest is expected to retry the operation.
> 
> Guest-side TDCALL flows that acquire a host priority lock have an upper bound on
> the host-side latency for that lock; once a lock is acquired, the flow either
> releases within a fixed upper time bound, or periodically monitor the
> HOST_PRIORITY flag to see if the host is attempting to acquire the lock.
> "
> 
> So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is
> involved in sorting out contention between the guest as well. We need to double
> check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the
> functionality we need to exercise for base support.
> 
> The thing that makes me nervous about retry based solution is the potential for
> some kind deadlock like pattern. Just to gather your opinion, if there was some
> SEAMCALL contention that couldn't be locked around from KVM, but came with some
> strong well described guarantees, would a retry loop be hard NAK still?

I don't know.  It would depend on what operations can hit BUSY, and what the
alternatives are.  E.g. if we can narrow down the retry paths to a few select
cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of
deadlock, then maybe that's the least awful option.

What I don't think KVM should do is blindly retry N number of times, because
then there are effectively no rules whatsoever.  E.g. if KVM is tearing down a
VM then KVM should assert on immediate success.  And if KVM is handling a fault
on behalf of a vCPU, then KVM can and should resume the guest and let it retry.
Ugh, but that would likely trigger the annoying "zero-step mitigation" crap.

What does this actually mean in practice?  What's the threshold, is the VM-Enter
error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if KVM
runs afoul of the zero-step mitigation?

  After a pre-determined number of such EPT violations occur on the same instruction,
  the TDX module starts tracking the GPAs that caused Secure EPT faults and fails
  further host VMM attempts to enter the TD VCPU unless previously faulting private
  GPAs are properly mapped in the Secure EPT.

If HOST_PRIORITY is set, then one idea would be to resume the guest if there's
SEPT contention on a fault, and then _if_ the zero-step mitigation is triggered,
kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked and
can't be re-locked by the guest.  That would allow KVM to guarantee forward
progress without an arbitrary retry loop in the TDP MMU.

Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure the
one and only retry is guaranteed to succeed.
Rick Edgecombe Sept. 10, 2024, 12:50 a.m. UTC | #9
On Mon, 2024-09-09 at 16:58 -0700, Sean Christopherson wrote:
> On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL,
> > > > ever.  For its operations, I'm pretty sure the only sane approach is for
> > > > KVM to ensure there will be no contention.  And if the TDX module's
> > > > single-step protection spuriously kicks in, KVM exits to userspace.  If
> > > > the TDX module can't/doesn't/won't communicate that it's mitigating
> > > > single-step, e.g. so that KVM can forward the information to userspace,
> > > > then that's a TDX module problem to solve.
> > > > 
> > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > > > return TDX_OPERAND_BUSY.
> > > > 
> > > > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > > > "module" signals #GP if there's a conflict.  #GP is obviously far, far
> > > > worse as it lacks the precision that would help software understand
> > > > exactly what went wrong, but I think one of the better decisions we made
> > > > with the SGX driver was to have a "zero tolerance" policy where the
> > > > driver would _never_ retry due to a potential resource conflict, i.e.
> > > > that any conflict in the module would be treated as a kernel bug.
> > 
> > Thanks for the analysis. The direction seems reasonable to me for this lock
> > in
> > particular. We need to do some analysis on how much the existing mmu_lock
> > can
> > protects us. 
> 
> I would operate under the assumption that it provides SEPT no meaningful
> protection.
> I think I would even go so far as to say that it is a _requirement_ that
> mmu_lock
> does NOT provide the ordering required by SEPT, because I do not want to take
> on
> any risk (due to SEPT constraints) that would limit KVM's ability to do things
> while holding mmu_lock for read.

Ok. Not sure, but I think you are saying not to add any extra acquisitions of
mmu_lock.

> > 
> > Maybe sprinkle some asserts for documentation purposes.
> 
> Not sure I understand, assert on what?

Please ignore. For the asserts, I was imagining mmu_lock acquisitions in core
MMU code might already protect the non-zero-step TDX_OPERAND_BUSY cases, and we
could somehow explain this in code. But it seems less likely.

[snip]
> 
> I don't know.  It would depend on what operations can hit BUSY, and what the
> alternatives are.  E.g. if we can narrow down the retry paths to a few select
> cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of
> deadlock, then maybe that's the least awful option.
> 
> What I don't think KVM should do is blindly retry N number of times, because
> then there are effectively no rules whatsoever.

Complete agreement.

>   E.g. if KVM is tearing down a
> VM then KVM should assert on immediate success.  And if KVM is acquit ions a
> fault
> on behalf of a vCPU, then KVM can and should resume the guest and let it
> retry.
> Ugh, but that would likely trigger the annoying "zero-step mitigation" crap.
> 
> What does this actually mean in practice?  What's the threshold, is the VM-
> Enter
> error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if
> KVM
> runs afoul of the zero-step mitigation?
> 
>   After a pre-determined number of such EPT violations occur on the same
> instruction,
>   the TDX module starts tracking the GPAs that caused Secure EPT faults and
> fails
>   further host VMM attempts to enter the TD VCPU unless previously faulting
> private
>   GPAs are properly mapped in the Secure EPT.
> 
> If HOST_PRIORITY is set, then one idea would be to resume the guest if there's
> SEPT contention on a fault, and then _if_ the zero-step mitigation is
> triggered,
> kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked
> and
> can't be re-locked by the guest.  That would allow KVM to guarantee forward
> progress without an arbitrary retry loop in the TDP MMU.
> 
> Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure
> the
> one and only retry is guaranteed to succeed.

Ok so not against retry loops, just against magic number retry loops with no
explanation that can be found. Makes sense.

Until we answer some of the questions (i.e. HOST_PRIORITY exposure), it's hard
to say. We need to check some stuff on our end.
Sean Christopherson Sept. 10, 2024, 1:46 a.m. UTC | #10
On Tue, Sep 10, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-09-09 at 16:58 -0700, Sean Christopherson wrote:
> > On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> > > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL,
> > > > > ever.  For its operations, I'm pretty sure the only sane approach is for
> > > > > KVM to ensure there will be no contention.  And if the TDX module's
> > > > > single-step protection spuriously kicks in, KVM exits to userspace.  If
> > > > > the TDX module can't/doesn't/won't communicate that it's mitigating
> > > > > single-step, e.g. so that KVM can forward the information to userspace,
> > > > > then that's a TDX module problem to solve.
> > > > > 
> > > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > > > > return TDX_OPERAND_BUSY.
> > > > > 
> > > > > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > > > > "module" signals #GP if there's a conflict.  #GP is obviously far, far
> > > > > worse as it lacks the precision that would help software understand
> > > > > exactly what went wrong, but I think one of the better decisions we made
> > > > > with the SGX driver was to have a "zero tolerance" policy where the
> > > > > driver would _never_ retry due to a potential resource conflict, i.e.
> > > > > that any conflict in the module would be treated as a kernel bug.
> > > 
> > > Thanks for the analysis. The direction seems reasonable to me for this lock
> > > in
> > > particular. We need to do some analysis on how much the existing mmu_lock
> > > can
> > > protects us. 
> > 
> > I would operate under the assumption that it provides SEPT no meaningful
> > protection.
> > I think I would even go so far as to say that it is a _requirement_ that
> > mmu_lock
> > does NOT provide the ordering required by SEPT, because I do not want to take
> > on
> > any risk (due to SEPT constraints) that would limit KVM's ability to do things
> > while holding mmu_lock for read.
> 
> Ok. Not sure, but I think you are saying not to add any extra acquisitions of
> mmu_lock.

No new write_lock.  If read_lock is truly needed, no worries.  But SEPT needing
a write_lock is likely a hard "no", as the TDP MMU's locking model depends
heavily on vCPUs being readers.  E.g. the TDP MMU has _much_ coarser granularity
than core MM, but it works because almost everything is done while holding
mmu_lock for read.

> Until we answer some of the questions (i.e. HOST_PRIORITY exposure), it's hard
> to say. We need to check some stuff on our end.

Ya, agreed.
Paolo Bonzini Sept. 10, 2024, 1:15 p.m. UTC | #11
On 9/9/24 23:11, Sean Christopherson wrote:
> In general, I am_very_  opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> will be no contention.  And if the TDX module's single-step protection spuriously
> kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> that it's mitigating single-step, e.g. so that KVM can forward the information
> to userspace, then that's a TDX module problem to solve.

In principle I agree but we also need to be pragmatic.  Exiting to 
userspace may not be practical in all flows, for example.

First of all, we can add a spinlock around affected seamcalls.  This way 
we know that "busy" errors must come from the guest and have set 
HOST_PRIORITY.  It is still kinda bad that guests can force the VMM to 
loop, but the VMM can always say enough is enough.  In other words, 
let's assume that a limit of 16 is probably appropriate but we can also 
increase the limit and crash the VM if things become ridiculous.

Something like this:

	static u32 max = 16;
	int retry = 0;
	spin_lock(&kvm->arch.seamcall_lock);
	for (;;) {
		args_in = *in;
		ret = seamcall_ret(op, in);
		if (++retry == 1) {
			/* protected by the same seamcall_lock */
			kvm->stat.retried_seamcalls++;
		} else if (retry == READ_ONCE(max)) {
			pr_warn("Exceeded %d retries for S-EPT operation\n", max);
			if (KVM_BUG_ON(kvm, retry == 1024)) {
				pr_err("Crashing due to lock contention in the TDX module\n");
				break;
			}
			cmpxchg(&max, retry, retry * 2);
		}
	}
	spin_unlock(&kvm->arch.seamcall_lock);

This way we can do some testing and figure out a useful limit.

For zero step detection, my reading is that it's TDH.VP.ENTER that 
fails; not any of the MEM seamcalls.  For that one to be resolved, it 
should be enough to do take and release the mmu_lock back to back, which 
ensures that all pending critical sections have completed (that is, 
"write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then 
loop.  Adding a vCPU stat for that one is a good idea, too.

Paolo
Sean Christopherson Sept. 10, 2024, 1:57 p.m. UTC | #12
On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On 9/9/24 23:11, Sean Christopherson wrote:
> > In general, I am_very_  opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> > its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> > will be no contention.  And if the TDX module's single-step protection spuriously
> > kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> > that it's mitigating single-step, e.g. so that KVM can forward the information
> > to userspace, then that's a TDX module problem to solve.
> 
> In principle I agree but we also need to be pragmatic.  Exiting to userspace
> may not be practical in all flows, for example.
> 
> First of all, we can add a spinlock around affected seamcalls.

No, because that defeates the purpose of having mmu_lock be a rwlock.

> This way we know that "busy" errors must come from the guest and have set
> HOST_PRIORITY.
 
We should be able to achieve that without a VM-wide spinlock.  My thought (from
v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
it set until the SEAMCALL completes.

> It is still kinda bad that guests can force the VMM to loop, but the VMM can
> always say enough is enough.  In other words, let's assume that a limit of
> 16 is probably appropriate but we can also increase the limit and crash the
> VM if things become ridiculous.
> 
> Something like this:
> 
> 	static u32 max = 16;
> 	int retry = 0;
> 	spin_lock(&kvm->arch.seamcall_lock);
> 	for (;;) {
> 		args_in = *in;
> 		ret = seamcall_ret(op, in);
> 		if (++retry == 1) {
> 			/* protected by the same seamcall_lock */
> 			kvm->stat.retried_seamcalls++;
> 		} else if (retry == READ_ONCE(max)) {
> 			pr_warn("Exceeded %d retries for S-EPT operation\n", max);
> 			if (KVM_BUG_ON(kvm, retry == 1024)) {
> 				pr_err("Crashing due to lock contention in the TDX module\n");
> 				break;
> 			}
> 			cmpxchg(&max, retry, retry * 2);
> 		}
> 	}
> 	spin_unlock(&kvm->arch.seamcall_lock);
> 
> This way we can do some testing and figure out a useful limit.

2 :-)

One try that guarantees no other host task is accessing the S-EPT entry, and a
second try after blasting IPI to kick vCPUs to ensure no guest-side task has
locked the S-EPT entry.

My concern with an arbitrary retry loop is that we'll essentially propagate the
TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
etc...

> For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> not any of the MEM seamcalls.  For that one to be resolved, it should be
> enough to do take and release the mmu_lock back to back, which ensures that
> all pending critical sections have completed (that is,
> "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> loop.  Adding a vCPU stat for that one is a good idea, too.

As above and in my discussion with Rick, I would prefer to kick vCPUs to force
forward progress, especially for the zero-step case.  If KVM gets to the point
where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
kicks in, then it's time to kick and wait, not keep retrying blindly.

There is still risk of a hang, e.g. if a CPU fails to respond to the IPI, but
that's a possibility that always exists.  Kicking vCPUs allows KVM to know with
100% certainty that a SEAMCALL should succeed.

Hrm, the wrinkle is that if we want to guarantee success, the vCPU kick would
need to happen when the SPTE is frozen, to ensure some other host task doesn't
"steal" the lock.
Paolo Bonzini Sept. 10, 2024, 3:16 p.m. UTC | #13
On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote:
> On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> No, because that defeates the purpose of having mmu_lock be a rwlock.

But if this part of the TDX module is wrapped in a single big
try_lock, there's no difference in spinning around busy seamcalls, or
doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
in the same way.  With respect to FROZEN_SPTE...

> > This way we know that "busy" errors must come from the guest and have set
> > HOST_PRIORITY.
>
> We should be able to achieve that without a VM-wide spinlock.  My thought (from
> v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> it set until the SEAMCALL completes.

Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
which documents that the TDX module returns TDX_OPERAND_BUSY on a
CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
to prevent contention in the TDX module.

If we want to be a bit more optimistic, let's do something more
sophisticated, like only take the lock after the first busy reply. But
the spinlock is the easiest way to completely remove host-induced
TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

> > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > always say enough is enough.  In other words, let's assume that a limit of
> > 16 is probably appropriate but we can also increase the limit and crash the
> > VM if things become ridiculous.
>
> 2 :-)
>
> One try that guarantees no other host task is accessing the S-EPT entry, and a
> second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> locked the S-EPT entry.

Fair enough. Though in principle it is possible to race and have the
vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
So I would make it 5 or so just to be safe.

> My concern with an arbitrary retry loop is that we'll essentially propagate the
> TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
> retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> etc...

How slow are the failed ones? The number of retries is essentially the
cost of successful seamcall / cost of busy seamcall.

If HOST_PRIORITY works, even a not-small-but-not-huge number of
retries would be better than the IPIs. IPIs are not cheap either.

> > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > enough to do take and release the mmu_lock back to back, which ensures that
> > all pending critical sections have completed (that is,
> > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > loop.  Adding a vCPU stat for that one is a good idea, too.
>
> As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> forward progress, especially for the zero-step case.  If KVM gets to the point
> where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> kicks in, then it's time to kick and wait, not keep retrying blindly.

Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
TDH.VP.ENTER is delayed. If it is delayed to the point of failing, we
can do write_lock/write_unlock() in the vCPU entry path.

My issue is that, even if we could make it a bit better by looking at
the TDX module source code, we don't have enough information to make a
good choice.  For now we should start with something _easy_, even if
it may not be the greatest.

Paolo
Sean Christopherson Sept. 10, 2024, 3:57 p.m. UTC | #14
On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> > No, because that defeates the purpose of having mmu_lock be a rwlock.
> 
> But if this part of the TDX module is wrapped in a single big
> try_lock, there's no difference in spinning around busy seamcalls, or
> doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
> in the same way.  With respect to FROZEN_SPTE...
>
> > > This way we know that "busy" errors must come from the guest and have set
> > > HOST_PRIORITY.
> >
> > We should be able to achieve that without a VM-wide spinlock.  My thought (from
> > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> > it set until the SEAMCALL completes.
> 
> Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> which documents that the TDX module returns TDX_OPERAND_BUSY on a
> CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> to prevent contention in the TDX module.

Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read mode.

So for the operations that KVM can do in parallel, the locking should effectively
be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
SPTEs will need to be done while holding mmu_lock for write, i.e. KVM shouldn't
have problems with host tasks competing for the TDX module's VM-wide lock.

> If we want to be a bit more optimistic, let's do something more
> sophisticated, like only take the lock after the first busy reply. But
> the spinlock is the easiest way to completely remove host-induced
> TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

I am not convinced that's necessary or a good idea.  I worry that doing so would
just kick the can down the road, and potentially make the problems harder to solve,
e.g. because we'd have to worry about regressing existing setups.

> > > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > > always say enough is enough.  In other words, let's assume that a limit of
> > > 16 is probably appropriate but we can also increase the limit and crash the
> > > VM if things become ridiculous.
> >
> > 2 :-)
> >
> > One try that guarantees no other host task is accessing the S-EPT entry, and a
> > second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> > locked the S-EPT entry.
> 
> Fair enough. Though in principle it is possible to race and have the
> vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.

My limit of '2' is predicated on the lock being a "host priority" lock, i.e. that
kicking vCPUs would ensure the lock has been dropped and can't be re-acquired by
the guest.

> So I would make it 5 or so just to be safe.
> 
> > My concern with an arbitrary retry loop is that we'll essentially propagate the
> > TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
> > retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> > etc...
> 
> How slow are the failed ones? The number of retries is essentially the
> cost of successful seamcall / cost of busy seamcall.

I haven't measured, but would be surprised if it's less than 2000 cycles.

> If HOST_PRIORITY works, even a not-small-but-not-huge number of
> retries would be better than the IPIs. IPIs are not cheap either.

Agreed, but we also need to account for the operations that are conflicting.
E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy waiting
for the to-be-zapped S-EPT entry to be available doesn't make much sense.

> > > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > > enough to do take and release the mmu_lock back to back, which ensures that
> > > all pending critical sections have completed (that is,
> > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > > loop.  Adding a vCPU stat for that one is a good idea, too.
> >
> > As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> > forward progress, especially for the zero-step case.  If KVM gets to the point
> > where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> > kicks in, then it's time to kick and wait, not keep retrying blindly.
> 
> Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
> TDH.VP.ENTER is delayed.

Blocked, not delayed.  Yes, it's TDH.VP.ENTER that "fails", but to get past
TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to guarantee
forward progress for TDH.MEM (or whatever the operations are called).

Though I wonder, are there any operations guest/host operations that can conflict
if the vCPU is faulting?  Maybe this particular scenario is a complete non-issue.

> If it is delayed to the point of failing, we can do write_lock/write_unlock()
> in the vCPU entry path.

I was thinking that KVM could set a flag (another synthetic error code bit?) to
tell the page fault handler that it needs to kick vCPUs.  But as above, it might
be unnecessary.

> My issue is that, even if we could make it a bit better by looking at
> the TDX module source code, we don't have enough information to make a
> good choice.  For now we should start with something _easy_, even if
> it may not be the greatest.

I am not opposed to an easy/simple solution, but I am very much opposed to
implementing a retry loop without understanding _exactly_ when and why it's
needed.
Rick Edgecombe Sept. 10, 2024, 4:28 p.m. UTC | #15
On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote:
> > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> > which documents that the TDX module returns TDX_OPERAND_BUSY on a
> > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> > to prevent contention in the TDX module.
> 
> Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
> lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read
> mode.

AUG does take other locks as exclusive:
https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c

I count 5 locks in total as well. I think trying to mirror the locking in KVM
will be an uphill battle.

> 
> So for the operations that KVM can do in parallel, the locking should
> effectively
> be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
> SPTEs will need to be done while holding mmu_lock for write, i.e. KVM
> shouldn't
> have problems with host tasks competing for the TDX module's VM-wide lock.
> 
> > If we want to be a bit more optimistic, let's do something more
> > sophisticated, like only take the lock after the first busy reply. But
> > the spinlock is the easiest way to completely remove host-induced
> > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.
> 
> I am not convinced that's necessary or a good idea.  I worry that doing so
> would
> just kick the can down the road, and potentially make the problems harder to
> solve,
> e.g. because we'd have to worry about regressing existing setups.
> 
> > > > It is still kinda bad that guests can force the VMM to loop, but the VMM
> > > > can
> > > > always say enough is enough.  In other words, let's assume that a limit
> > > > of
> > > > 16 is probably appropriate but we can also increase the limit and crash
> > > > the
> > > > VM if things become ridiculous.
> > > 
> > > 2 :-)
> > > 
> > > One try that guarantees no other host task is accessing the S-EPT entry,
> > > and a
> > > second try after blasting IPI to kick vCPUs to ensure no guest-side task
> > > has
> > > locked the S-EPT entry.
> > 
> > Fair enough. Though in principle it is possible to race and have the
> > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
> 
> My limit of '2' is predicated on the lock being a "host priority" lock, i.e.
> that
> kicking vCPUs would ensure the lock has been dropped and can't be re-acquired
> by
> the guest.

So kicking would be to try to break loose any deadlock we encountered? It sounds
like the kind of kludge that could be hard to remove.

> 
> > So I would make it 5 or so just to be safe.
> > 
> > > My concern with an arbitrary retry loop is that we'll essentially
> > > propagate the
> > > TDX module issues to the broader kernel.  Each of those SEAMCALLs is
> > > slooow, so
> > > retrying even ~20 times could exceed the system's tolerances for
> > > scheduling, RCU,
> > > etc...
> > 
> > How slow are the failed ones? The number of retries is essentially the
> > cost of successful seamcall / cost of busy seamcall.
> 
> I haven't measured, but would be surprised if it's less than 2000 cycles.
> 
> > If HOST_PRIORITY works, even a not-small-but-not-huge number of
> > retries would be better than the IPIs. IPIs are not cheap either.
> 
> Agreed, but we also need to account for the operations that are conflicting.
> E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy
> waiting
> for the to-be-zapped S-EPT entry to be available doesn't make much sense.
> 
> > > > For zero step detection, my reading is that it's TDH.VP.ENTER that
> > > > fails;
> > > > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > > > enough to do take and release the mmu_lock back to back, which ensures
> > > > that
> > > > all pending critical sections have completed (that is,
> > > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > > > loop.  Adding a vCPU stat for that one is a good idea, too.
> > > 
> > > As above and in my discussion with Rick, I would prefer to kick vCPUs to
> > > force
> > > forward progress, especially for the zero-step case.  If KVM gets to the
> > > point
> > > where it has retried TDH.VP.ENTER on the same fault so many times that
> > > zero-step
> > > kicks in, then it's time to kick and wait, not keep retrying blindly.
> > 
> > Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
> > TDH.VP.ENTER is delayed.
> 
> Blocked, not delayed.  Yes, it's TDH.VP.ENTER that "fails", but to get past
> TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to
> guarantee
> forward progress for TDH.MEM (or whatever the operations are called).
> 
> Though I wonder, are there any operations guest/host operations that can
> conflict
> if the vCPU is faulting?  Maybe this particular scenario is a complete non-
> issue.
> 
> > If it is delayed to the point of failing, we can do
> > write_lock/write_unlock()
> > in the vCPU entry path.
> 
> I was thinking that KVM could set a flag (another synthetic error code bit?)
> to
> tell the page fault handler that it needs to kick vCPUs.  But as above, it
> might
> be unnecessary.
> 
> > My issue is that, even if we could make it a bit better by looking at
> > the TDX module source code, we don't have enough information to make a
> > good choice.  For now we should start with something _easy_, even if
> > it may not be the greatest.
> 
> I am not opposed to an easy/simple solution, but I am very much opposed to
> implementing a retry loop without understanding _exactly_ when and why it's
> needed.

I'd like to explore letting KVM do the retries (i.e. EPT fault loop) a bit more.
We can verify that we can survive zero-step in this case. After all, zero-step
doesn't kill the TD, just generates an EPT violation exit. So we would just need
to verify that the EPT violation getting generated would result in KVM
eventually fixing whatever zero-step is requiring.

Then we would have to handle BUSY in each SEAMCALL call chain, which currently
we don't. Like the zapping case. If we ended up needing a retry loop for limited
cases like that, at least it would be more limited.
Sean Christopherson Sept. 10, 2024, 5:42 p.m. UTC | #16
On Tue, Sep 10, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote:
> > > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> > > which documents that the TDX module returns TDX_OPERAND_BUSY on a
> > > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> > > to prevent contention in the TDX module.
> > 
> > Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
> > lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read
> > mode.
> 
> AUG does take other locks as exclusive:
> https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c

Only a lock on the underlying physical page.  guest_memfd should prevent mapping
the same HPA into multiple GPAs, and FROZEN_SPTE should prevent two vCPUs from
concurrently AUGing the same GPA+HPA.

> I count 5 locks in total as well. I think trying to mirror the locking in KVM
> will be an uphill battle.

I don't want to mirror the locking, I want to understand and document the
expectations and rules.  "Throw 16 noodles and hope one sticks" is not a recipe
for success.

> > So for the operations that KVM can do in parallel, the locking should
> > effectively
> > be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
> > SPTEs will need to be done while holding mmu_lock for write, i.e. KVM
> > shouldn't
> > have problems with host tasks competing for the TDX module's VM-wide lock.
> > 
> > > If we want to be a bit more optimistic, let's do something more
> > > sophisticated, like only take the lock after the first busy reply. But
> > > the spinlock is the easiest way to completely remove host-induced
> > > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.
> > 
> > I am not convinced that's necessary or a good idea.  I worry that doing so
> > would
> > just kick the can down the road, and potentially make the problems harder to
> > solve,
> > e.g. because we'd have to worry about regressing existing setups.
> > 
> > > > > It is still kinda bad that guests can force the VMM to loop, but the VMM
> > > > > can
> > > > > always say enough is enough.  In other words, let's assume that a limit
> > > > > of
> > > > > 16 is probably appropriate but we can also increase the limit and crash
> > > > > the
> > > > > VM if things become ridiculous.
> > > > 
> > > > 2 :-)
> > > > 
> > > > One try that guarantees no other host task is accessing the S-EPT entry,
> > > > and a
> > > > second try after blasting IPI to kick vCPUs to ensure no guest-side task
> > > > has
> > > > locked the S-EPT entry.
> > > 
> > > Fair enough. Though in principle it is possible to race and have the
> > > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
> > 
> > My limit of '2' is predicated on the lock being a "host priority" lock,
> > i.e.  that kicking vCPUs would ensure the lock has been dropped and can't
> > be re-acquired by the guest.
> 
> So kicking would be to try to break loose any deadlock we encountered? It sounds
> like the kind of kludge that could be hard to remove.

No, the intent of the kick would be to wait for vCPUs to exit, which in turn
guarantees that any locks held by vCPUs have been dropped.  Again, this idea is
predicated on the lock being "host priority", i.e. that vCPUs can't re-take the
lock before KVM.
Kai Huang Sept. 11, 2024, 1:17 a.m. UTC | #17
>> Host-Side (SEAMCALL) Operation
>> ------------------------------
>> The host VMM is expected to retry host-side operations that fail with a
>> TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at
>> most after a limited time (the longest guest-side TDX module flow) there will be
>> no contention with a guest TD attempting to acquire access to the same resource.
>>
>> Lock operations process the HOST_PRIORITY bit as follows:
>>     - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s
>>     HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is
>>     the host VMM’s responsibility to re-attempt the SEAMCALL function until is
>>     succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD
>>     from acquiring the lock.
>>     - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the
>>     lock’s HOST_PRIORITY bit.
> 
> *sigh*
> 
>> Guest-Side (TDCALL) Operation
>> -----------------------------
>> A TDCALL (guest-side) function that attempt to acquire a lock fails if
>> HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest.
>> The guest is expected to retry the operation.
>>
>> Guest-side TDCALL flows that acquire a host priority lock have an upper bound on
>> the host-side latency for that lock; once a lock is acquired, the flow either
>> releases within a fixed upper time bound, or periodically monitor the
>> HOST_PRIORITY flag to see if the host is attempting to acquire the lock.
>> "
>>
>> So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is
>> involved in sorting out contention between the guest as well. We need to double
>> check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the
>> functionality we need to exercise for base support.
>>
>> The thing that makes me nervous about retry based solution is the potential for
>> some kind deadlock like pattern. Just to gather your opinion, if there was some
>> SEAMCALL contention that couldn't be locked around from KVM, but came with some
>> strong well described guarantees, would a retry loop be hard NAK still?
> 
> I don't know.  It would depend on what operations can hit BUSY, and what the
> alternatives are.  E.g. if we can narrow down the retry paths to a few select
> cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of
> deadlock, then maybe that's the least awful option.
> 
> What I don't think KVM should do is blindly retry N number of times, because
> then there are effectively no rules whatsoever.  E.g. if KVM is tearing down a
> VM then KVM should assert on immediate success.  And if KVM is handling a fault
> on behalf of a vCPU, then KVM can and should resume the guest and let it retry.
> Ugh, but that would likely trigger the annoying "zero-step mitigation" crap.
> 
> What does this actually mean in practice?  What's the threshold, 

FWIW, the limit in the public TDX module code is 6:

   #define STEPPING_EPF_THRESHOLD 6   // Threshold of confidence in 	
			detecting EPT fault-based stepping in progress

We might be able to change it to a larger value though but we need to 
understand why it is necessary.

> is the VM-Enter
> error uniquely identifiable, 

When zero-step mitigation is active in the module, TDH.VP.ENTER tries to 
grab the SEPT lock thus it can fail with SEPT BUSY error.  But if it 
does grab the lock successfully, it exits to VMM with EPT violation on 
that GPA immediately.

In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step 
mitigation" must have been active.  A normal EPT violation _COULD_ mean 
mitigation is already active, but AFAICT we don't have a way to tell 
that in the EPT violation.

> and can KVM rely on HOST_PRIORITY to be set if KVM
> runs afoul of the zero-step mitigation?

I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY.

> 
>    After a pre-determined number of such EPT violations occur on the same instruction,
>    the TDX module starts tracking the GPAs that caused Secure EPT faults and fails
>    further host VMM attempts to enter the TD VCPU unless previously faulting private
>    GPAs are properly mapped in the Secure EPT.
> 
> If HOST_PRIORITY is set, then one idea would be to resume the guest if there's
> SEPT contention on a fault, and then _if_ the zero-step mitigation is triggered,
> kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked and
> can't be re-locked by the guest.  That would allow KVM to guarantee forward
> progress without an arbitrary retry loop in the TDP MMU.

I think this should work.

It doesn't seem we can tell whether the zero step mitigation is active 
in EPT violation TDEXIT, or when SEPT SEAMCALL fails with SEPT BUSY. 
But when any SEPT SEAMCALL fails with SEPT BUSY, if we just kick all 
vCPUs and make them wait until the next retry is done (which must be 
successful otherwise it is illegal error), then this should handle both 
contention from guest and the zero-step mitigation.

> 
> Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure the
> one and only retry is guaranteed to succeed.

Yeah seems so.
Rick Edgecombe Sept. 11, 2024, 2:48 a.m. UTC | #18
On Wed, 2024-09-11 at 13:17 +1200, Huang, Kai wrote:
> > is the VM-Enter
> > error uniquely identifiable, 
> 
> When zero-step mitigation is active in the module, TDH.VP.ENTER tries to 
> grab the SEPT lock thus it can fail with SEPT BUSY error.  But if it 
> does grab the lock successfully, it exits to VMM with EPT violation on 
> that GPA immediately.
> 
> In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step 
> mitigation" must have been active.  

I think this isn't true. A sept locking related busy, maybe. But there are other
things going on that return BUSY.

> A normal EPT violation _COULD_ mean 
> mitigation is already active, but AFAICT we don't have a way to tell 
> that in the EPT violation.
> 
> > and can KVM rely on HOST_PRIORITY to be set if KVM
> > runs afoul of the zero-step mitigation?
> 
> I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY.

What led you to think this? It seemed more limited to me.
Kai Huang Sept. 11, 2024, 10:55 p.m. UTC | #19
On 11/09/2024 2:48 pm, Edgecombe, Rick P wrote:
> On Wed, 2024-09-11 at 13:17 +1200, Huang, Kai wrote:
>>> is the VM-Enter
>>> error uniquely identifiable,
>>
>> When zero-step mitigation is active in the module, TDH.VP.ENTER tries to
>> grab the SEPT lock thus it can fail with SEPT BUSY error.  But if it
>> does grab the lock successfully, it exits to VMM with EPT violation on
>> that GPA immediately.
>>
>> In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step
>> mitigation" must have been active.
> 
> I think this isn't true. A sept locking related busy, maybe. But there are other
> things going on that return BUSY.

I thought we are talking about SEPT locking here.  For BUSY in general 
yeah it tries to grab other locks too (e.g., share lock of 
TDR/TDCS/TDVPS etc) but those are impossible to contend in the current 
KVM TDX implementation I suppose?  Perhaps we need to look more to make 
sure.

> 
>> A normal EPT violation _COULD_ mean
>> mitigation is already active, but AFAICT we don't have a way to tell
>> that in the EPT violation.
>>
>>> and can KVM rely on HOST_PRIORITY to be set if KVM
>>> runs afoul of the zero-step mitigation?
>>
>> I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY.
> 
> What led you to think this? It seemed more limited to me.

I interpreted from the spec (chapter 18.1.4 Concurrency Restrictions 
with Host Priority).  But looking at the module public code, it seems 
only when the lock can be contended from the guest the HOST_PRIORITY 
will be set when host fails to grab the lock (see 
acquire_sharex_lock_hp_ex() and acquire_sharex_lock_hp_sh()), which 
makes sense anyway.
Yan Zhao Sept. 13, 2024, 8:36 a.m. UTC | #20
This is a lock status report of TDX module for current SEAMCALL retry issue
based on code in TDX module public repo https://github.com/intel/tdx-module.git
branch TDX_1.5.05.

TL;DR:
- tdh_mem_track() can contend with tdh_vp_enter().
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
- tdg_mem_page_accept() can contend with other tdh_mem*().

Proposal:
- Return -EAGAIN directly in ops link_external_spt/set_external_spte when
  tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
- Kick off vCPUs at the beginning of page removal path, i.e. before the
  tdh_mem_range_block().
  Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
  (one possible optimization:
   since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
   do not kick off vCPUs in normal conditions.
   When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
   TD enter until page removal completes.)

Below are detailed analysis:

=== Background ===
In TDX module, there are 4 kinds of locks:
1. sharex_lock:
   Normal read/write lock. (no host priority stuff)

2. sharex_hp_lock:
   Just like normal read/write lock, except that host can set host priority bit
   on failure.
   when guest tries to acquire the lock and sees host priority bit set, it will
   return "busy host priority" directly, letting host win.
   After host acquires the lock successfully, host priority bit is cleared.

3. sept entry lock:
   Lock utilizing software bits in SEPT entry.
   HP bit (Host priority): bit 52 
   EL bit (Entry lock): bit 11, used as a bit lock.

   - host sets HP bit when host fails to acquire EL bit lock;
   - host resets HP bit when host wins.
   - guest returns "busy host priority" if HP bit is found set when guest tries
     to acquire EL bit lock.

4. mutex lock:
   Lock with only 2 states: free, lock.
   (not the same as linux mutex, not re-scheduled, could pause() for debugging).

===Resources & users list===

Resources              SHARED  users              EXCLUSIVE users
------------------------------------------------------------------------
(1) TDR                tdh_mng_rdwr               tdh_mng_create
                       tdh_vp_create              tdh_mng_add_cx
                       tdh_vp_addcx               tdh_mng_init
		       tdh_vp_init                tdh_mng_vpflushdone
                       tdh_vp_enter               tdh_mng_key_config 
                       tdh_vp_flush               tdh_mng_key_freeid
                       tdh_vp_rd_wr               tdh_mr_extend
                       tdh_mem_sept_add           tdh_mr_finalize
                       tdh_mem_sept_remove        tdh_vp_init_apicid
                       tdh_mem_page_aug           tdh_mem_page_add
                       tdh_mem_page_remove
                       tdh_mem_range_block
                       tdh_mem_track
                       tdh_mem_range_unblock
                       tdh_phymem_page_reclaim
------------------------------------------------------------------------
(2) KOT                tdh_phymem_cache_wb        tdh_mng_create
                                                  tdh_mng_vpflushdone
                                                  tdh_mng_key_freeid
------------------------------------------------------------------------
(3) TDCS               tdh_mng_rdwr
                       tdh_vp_create
                       tdh_vp_addcx
                       tdh_vp_init
                       tdh_vp_init_apicid
                       tdh_vp_enter 
                       tdh_vp_rd_wr
                       tdh_mem_sept_add
                       tdh_mem_sept_remove
                       tdh_mem_page_aug
                       tdh_mem_page_remove
                       tdh_mem_range_block
                       tdh_mem_track
                       tdh_mem_range_unblock
------------------------------------------------------------------------
(4) TDVPR              tdh_vp_rd_wr                tdh_vp_create
                                                   tdh_vp_addcx
                                                   tdh_vp_init
                                                   tdh_vp_init_apicid
                                                   tdh_vp_enter 
                                                   tdh_vp_flush 
------------------------------------------------------------------------
(5) TDCS epoch         tdh_vp_enter                tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock    tdh_mem_sept_add            tdh_vp_enter
                       tdh_mem_page_aug            tdh_mem_sept_remove
                       tdh_mem_page_remove         tdh_mem_range_block
                                                   tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry                                     tdh_mem_sept_add
                                                   tdh_mem_sept_remove
                                                   tdh_mem_page_aug
                                                   tdh_mem_page_remove
                                                   tdh_mem_range_block
                                                   tdh_mem_range_unblock
                                                   tdg_mem_page_accept

Current KVM interested SEAMCALLs:
------------------------------------------------------------------------
  SEAMCALL                Lock Name        Lock Type        Resource       
tdh_mng_create          sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_lock        EXCLUSIVE        KOT

tdh_mng_add_cx          sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_mng_init            sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS

tdh_mng_vpflushdone     sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_lock        EXCLUSIVE        KOT

tdh_mng_key_config      sharex_hp_lock     EXCLUSIVE        TDR

tdh_mng_key_freeid      sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_lock        EXCLUSIVE        KOT

tdh_mng_rdwr            sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_mr_extend           sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS

tdh_mr_finalize         sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS

tdh_vp_create           sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_hp_lock     EXCLUSIVE        TDVPR

tdh_vp_addcx            sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_vp_init             sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_vp_init_apicid      sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_vp_enter(*)         sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS epoch_lock
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock

tdh_vp_flush            sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR

tdh_vp_rd_wr            sharex_hp_lock     SHARED           TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_mem_sept_add        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_mem_sept_remove     sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to remove 

tdh_mem_page_add        sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS
                        sharex_lock        NO_LOCK          TDCS secure_ept_lock
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_mem_page_aug        sharex_hp_lock     SHARED           TDR 
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS secure_ept_lock 
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to aug

tdh_mem_page_remove     sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to remove

tdh_mem_range_block     sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify

tdh_mem_track           sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS epoch_lock

tdh_mem_range_unblock   sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify

tdh_phymem_page_reclaim sharex_hp_lock     EXCLUSIVE        page to reclaim
                        sharex_hp_lock     SHARED           TDR

tdh_phymem_cache_wb     mutex_lock                       per package wbt_entries 
                        sharex_lock        SHARED           KOT

tdh_phymem_page_wbinvd  sharex_hp_lock     SHARED           page to be wbinvd


Current KVM interested TDCALLs:
------------------------------------------------------------------------
tdg_mem_page_accept     sept entry lock    GUEST            SEPT entry to modify

TDCALLs like tdg_mr_rtmr_extend(), tdg_servtd_rd_wr(), tdg_mem_page_attr_wr()
tdg_mem_page_attr_rd() are not included.

*:(a) tdh_vp_enter() holds shared TDR lock and exclusive TDVPR lock, the two
      locks are released when exiting to VMM.
  (b) tdh_vp_enter() holds shared TDCS lock and shared TDCS epoch_lock lock,
      releases them before entering non-root mode.
  (c) tdh_vp_enter() holds shared epoch lock, contending with tdh_mem_track(). 
  (d) tdh_vp_enter() only holds EXCLUSIVE secure_ept_lock when 0-stepping is
      suspected, i.e. when last_epf_gpa_list is not empty.
      When a EPT violation happens, TDX module checks if the guest RIP equals
      to the guest RIP of last TD entry. Only when this is true for 6 continuous
      times, the gpa will be recorded in last_epf_gpa_list. The list will be
      reset once guest RIP of a EPT violation and last TD enter RIP are unequal.


=== Summary ===
For the 8 kinds of common resources protected in TDX module:

(1) TDR:
    There are only shared accesses to TDR during runtime (i.e. after TD is
    finalized and before TD tearing down), if we don't need to support calling
    tdh_vp_init_apicid() at runtime (e.g. for vCPU hotplug).
    tdh_vp_enter() holds shared TDR lock until exiting to VMM.
    TDCALLs do not acquire the TDR lock.

(2) KOT (Key Ownership Table)
    Current KVM code should have avoided contention to this resource.

(3) TDCS:
    Shared accessed or access with no lock when TDR is exclusively locked.
    Seamcalls in runtime (after TD finalized and before TD tearing down) do not
    contend with each other on TDCS.
    tdh_vp_enter() holds shared TDCS lock and releases it before entering
    non-root mode.
    Current TDCALLs for basic TDX do not acquire this lock.

(4) TDVPR:
    Per-vCPU exclusive accessed except for tdh_vp_rd_wr().
    tdh_vp_enter() holds exclusive TDVPR lock until exiting to VMM.
    TDCALLs do not acquire the TDVPR lock.

(5) TDCS epoch:
    tdh_mem_track() requests exclusive access, and tdh_vp_enter() requests
    shared access.
    tdh_mem_track() can contend with tdh_vp_enter().

(6) SEPT tree:
    Protected by secure_ept_lock (sharex_lock).
    tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() holds shared
    lock; tdh_mem_sept_remove()/tdh_mem_range_block()/tdh_mem_range_unblock()
    holds exclusive lock.
    tdh_vp_enter() requests exclusive access when 0-stepping is suspected,
    contending with all other tdh_mem*().
    Guest does not acquire this lock.

    So,
    kvm mmu_lock has prevented contentions between
    tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() and
    tdh_mem_sept_remove()/tdh_mem_range_block().
    Though tdh_mem_sept_add()/tdh_mem_page_aug() races with tdh_vp_enter(),
    returning -EAGAIN directly is fine for them.
    The remaining issue is the contention between tdh_vp_enter() and
    tdh_mem_page_remove()/tdh_mem_sept_remove()/tdh_mem_range_block().

(7) SEPT entry:
    All exclusive access.
    tdg_mem_page_accept() may contend with other tdh_mem*() on a specific SEPT
    entry.

(8) PAMT entry for target pages (e.g. page to add/aug/remove/reclaim/wbinvd):
    Though they are all exclusively locked, no race should be met as long as
    they belong to different pamt entries.

Conclusion:
Current KVM code should have avoided contentions of resources (1)-(4),(8), while
(5),(6),(7) are still possible to meet contention.
- tdh_mem_track() can contend with tdh_vp_enter() for (5)
- tdh_vp_enter() contends with tdh_mem*() for (6) when 0-stepping is suspected.
- tdg_mem_page_accept() can contend with other tdh_mem*() for (7).
Sean Christopherson Sept. 13, 2024, 5:23 p.m. UTC | #21
On Fri, Sep 13, 2024, Yan Zhao wrote:
> This is a lock status report of TDX module for current SEAMCALL retry issue
> based on code in TDX module public repo https://github.com/intel/tdx-module.git
> branch TDX_1.5.05.
> 
> TL;DR:
> - tdh_mem_track() can contend with tdh_vp_enter().
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.

The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
whatever reason.

Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
hits the fault?

For non-TDX, resuming the guest and letting the vCPU retry the instruction is
desirable because in many cases, the winning task will install a valid mapping
before KVM can re-run the vCPU, i.e. the fault will be fixed before the
instruction is re-executed.  In the happy case, that provides optimal performance
as KVM doesn't introduce any extra delay/latency.

But for TDX, the math is different as the cost of a re-hitting a fault is much,
much higher, especially in light of the zero-step issues.

E.g. if the TDP MMU returns a unique error code for the frozen case, and
kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
then the TDX EPT violation path can safely retry locally, similar to the do-while
loop in kvm_tdp_map_page().

The only part I don't like about this idea is having two "retry" return values,
which creates the potential for bugs due to checking one but not the other.

Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
option better even though the out-param is a bit gross, because it makes it more
obvious that the "frozen_spte" is a special case that doesn't need attention for
most paths.

> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> Proposal:
> - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
>   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.

What is the result of returning -EAGAIN?  E.g. does KVM redo tdh_vp_enter()?

Also tdh_mem_sept_add() is strictly pre-finalize, correct?  I.e. should never
contend with tdg_mem_page_accept() because vCPUs can't yet be run.

Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
The page isn't yet mapped, so why would the guest be allowed to take a lock on
the S-EPT entry?

> - Kick off vCPUs at the beginning of page removal path, i.e. before the
>   tdh_mem_range_block().
>   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.

This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS.

>   (one possible optimization:
>    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
>    do not kick off vCPUs in normal conditions.
>    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow

Which SEAMCALL is this specifically?  tdh_mem_range_block()?

>    TD enter until page removal completes.)


Idea #1:
---
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b45258285c9c..8113c17bd2f6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
                        return -EINTR;
                cond_resched();
                r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
-       } while (r == RET_PF_RETRY);
+       } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN);
 
        if (r < 0)
                return r;
@@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
                vcpu->stat.pf_spurious++;
 
        if (r != RET_PF_EMULATE)
-               return 1;
+               return r;
 
 emulate:
        return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8d3fb3c8c213..690f03d7daae 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * and of course kvm_mmu_do_page_fault().
  *
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
+ * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_RETRY: let CPU fault again on the address.
+ * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen.
+ *                     Let the CPU fault again on the address, or retry the
+ *                     fault "locally", i.e. without re-entering the guest.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
  *                         gfn and retry, or emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
- * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
  * Any names added to this enum should be exported to userspace for use in
@@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
  * will allow for efficient machine code when checking for CONTINUE, e.g.
  * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
+ *
+ * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
+ * scheme, where 1==success, translates '1' to RET_PF_FIXED.
  */
 enum {
        RET_PF_CONTINUE = 0,
+       RET_PF_FIXED    = 1,
        RET_PF_RETRY,
+       RET_PF_RETRY_FROZEN,
        RET_PF_EMULATE,
        RET_PF_WRITE_PROTECTED,
        RET_PF_INVALID,
-       RET_PF_FIXED,
        RET_PF_SPURIOUS,
 };
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a475a6456d4..cbf9e46203f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 retry:
        rcu_read_unlock();
+       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
+               return RET_PF_RETRY_FOZEN;
        return ret;
 }
 
---


Idea #2:
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 12 ++++++------
 arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  4 ++--
 6 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 46e0a466d7fb..200fecd1de88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
-		       void *insn, int insn_len);
+		       void *insn, int insn_len, bool *frozen_spte);
 void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b45258285c9c..207840a316d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 		return;
 
 	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
-				  true, NULL, NULL);
+				  true, NULL, NULL, NULL);
 
 	/*
 	 * Account fixed page faults, otherwise they'll never be counted, but
@@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 		trace_kvm_page_fault(vcpu, fault_address, error_code);
 
 		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
-				insn_len);
+				       insn_len, NULL);
 	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
 		vcpu->arch.apf.host_apf_flags = 0;
 		local_irq_disable();
@@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
 		if (signal_pending(current))
 			return -EINTR;
 		cond_resched();
-		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL);
 	} while (r == RET_PF_RETRY);
 
 	if (r < 0)
@@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 }
 
 int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
-		       void *insn, int insn_len)
+				void *insn, int insn_len, bool *frozen_spte)
 {
 	int r, emulation_type = EMULTYPE_PF;
 	bool direct = vcpu->arch.mmu->root_role.direct;
@@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		vcpu->stat.pf_taken++;
 
 		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
-					  &emulation_type, NULL);
+					  &emulation_type, NULL, frozen_spte);
 		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
 			return -EIO;
 	}
@@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		vcpu->stat.pf_spurious++;
 
 	if (r != RET_PF_EMULATE)
-		return 1;
+		return r;
 
 emulate:
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8d3fb3c8c213..5b1fc77695c1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -247,6 +247,9 @@ struct kvm_page_fault {
 	 * is changing its own translation in the guest page tables.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	/* Indicates the page fault needs to be retried due to a frozen SPTE. */
+	bool frozen_spte;
 };
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
@@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * and of course kvm_mmu_do_page_fault().
  *
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
+ * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
  *                         gfn and retry, or emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
- * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
  * Any names added to this enum should be exported to userspace for use in
@@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
  * will allow for efficient machine code when checking for CONTINUE, e.g.
  * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
+ *
+ * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
+ * scheme, where 1==success, translates '1' to RET_PF_FIXED.
  */
 enum {
 	RET_PF_CONTINUE = 0,
+	RET_PF_FIXED    = 1,
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_WRITE_PROTECTED,
 	RET_PF_INVALID,
-	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
 
@@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u64 err, bool prefetch,
-					int *emulation_type, u8 *level)
+					int *emulation_type, u8 *level,
+					bool *frozen_spte)
 {
 	struct kvm_page_fault fault = {
 		.addr = cr2_or_gpa,
@@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
 	if (level)
 		*level = fault.goal_level;
+	if (frozen_spte)
+		*frozen_spte = fault.frozen_spte;
 
 	return r;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a475a6456d4..e7fc5ea4b437 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 retry:
 	rcu_read_unlock();
+	fault->frozen_spte = is_frozen_spte(iter.old_spte);
 	return ret;
 }
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 38723b0c435d..269de6a9eb13 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
 				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
 				svm->vmcb->control.insn_bytes : NULL,
-				svm->vmcb->control.insn_len);
+				svm->vmcb->control.insn_len, NULL);
 
 	if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
 		sev_handle_rmp_fault(vcpu, fault_address, error_code);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 368acfebd476..fc2ff5d91a71 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
@@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 		return kvm_skip_emulated_instruction(vcpu);
 	}
 
-	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
+	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL);
 }
 
 static int handle_nmi_window(struct kvm_vcpu *vcpu)

base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe
--
Rick Edgecombe Sept. 13, 2024, 7:19 p.m. UTC | #22
On Fri, 2024-09-13 at 16:36 +0800, Yan Zhao wrote:
> 
Thanks Yan, this is great!

> This is a lock status report of TDX module for current SEAMCALL retry issue
> based on code in TDX module public repo
> https://github.com/intel/tdx-module.git
> branch TDX_1.5.05.
> 
> TL;DR:
> - tdh_mem_track() can contend with tdh_vp_enter().
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> Proposal:
> - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
>   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.

Regarding the sept entry contention with the guest, I think KVM might not be
guaranteed to retry the same path and clear the sept entry host priority bit.
What if the first failure exited to userspace because of a pending signal or
something? Then the vcpu could reenter the guest, handle an NMI and go off in
another direction, never to trigger the EPT violation again. This would leave
the SEPT entry locked to the guest.

That is a convoluted scenario that could probably be considered a buggy guest,
but what I am sort of pondering is that the retry solution that loop outside the
fault handler guts will have more complex failure modes around the host priority
bit. The N local retries solution really is a brown paper bag design, but the
more proper looking solution actually has two downsides compared to it:
1. It is based on locking behavior that is not in the spec (yes we can work with
TDX module folks to keep it workable)
2. Failure modes get complex

I think I'm still onboard. Just trying to stress the design a bit.

(BTW it looks like Linux guest doesn't actually retry accept on host priority
busy, so they won't spin on it anyway. Probably any contention here would be a
buggy guest for Linux TDs at least.)

> - Kick off vCPUs at the beginning of page removal path, i.e. before the
>   tdh_mem_range_block().
>   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
>   (one possible optimization:
>    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
>    do not kick off vCPUs in normal conditions.
>    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
>    TD enter until page removal completes.)
> 
> Below are detailed analysis:
> 
> === Background ===
> In TDX module, there are 4 kinds of locks:
> 1. sharex_lock:
>    Normal read/write lock. (no host priority stuff)
> 
> 2. sharex_hp_lock:
>    Just like normal read/write lock, except that host can set host priority
> bit
>    on failure.
>    when guest tries to acquire the lock and sees host priority bit set, it
> will
>    return "busy host priority" directly, letting host win.
>    After host acquires the lock successfully, host priority bit is cleared.
> 
> 3. sept entry lock:
>    Lock utilizing software bits in SEPT entry.
>    HP bit (Host priority): bit 52 
>    EL bit (Entry lock): bit 11, used as a bit lock.
> 
>    - host sets HP bit when host fails to acquire EL bit lock;
>    - host resets HP bit when host wins.
>    - guest returns "busy host priority" if HP bit is found set when guest
> tries
>      to acquire EL bit lock.
> 
> 4. mutex lock:
>    Lock with only 2 states: free, lock.
>    (not the same as linux mutex, not re-scheduled, could pause() for
> debugging).
> 
> ===Resources & users list===
> 
> Resources              SHARED  users              EXCLUSIVE users
> ------------------------------------------------------------------------
> (1) TDR                tdh_mng_rdwr               tdh_mng_create
>                        tdh_vp_create              tdh_mng_add_cx
>                        tdh_vp_addcx               tdh_mng_init
>                        tdh_vp_init                tdh_mng_vpflushdone
>                        tdh_vp_enter               tdh_mng_key_config 
>                        tdh_vp_flush               tdh_mng_key_freeid
>                        tdh_vp_rd_wr               tdh_mr_extend
>                        tdh_mem_sept_add           tdh_mr_finalize
>                        tdh_mem_sept_remove        tdh_vp_init_apicid
>                        tdh_mem_page_aug           tdh_mem_page_add
>                        tdh_mem_page_remove
>                        tdh_mem_range_block
>                        tdh_mem_track
>                        tdh_mem_range_unblock
>                        tdh_phymem_page_reclaim

In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into
pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is
an exclusive lock. But we can ignore it because we only do reclaim at TD tear
down time?

Separately, I wonder if we should try to add this info as comments around the
SEAMCALL implementations. The locking is not part of the spec, but never-the-
less the kernel is being coded against these assumptions. So it can sort of be
like "the kernel assumes this" and we can at least record what the reason was.
Or maybe just comment the parts that KVM assumes.
Rick Edgecombe Sept. 13, 2024, 7:19 p.m. UTC | #23
On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote:
> > TL;DR:
> > - tdh_mem_track() can contend with tdh_vp_enter().
> > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying
> to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it
> encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.

Can you explain more about what the concern is here? That the zero-step
mitigation activation will be a drag on the TD because of extra contention with
the TDH.MEM calls?

> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM
> retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU
> still
> hits the fault?

It seems like an optimization. To me, I would normally want to know how much it
helped before adding it. But if you think it's an obvious win I'll defer.

> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal
> performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is
> much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-
> while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return
> values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to
> communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it
> more
> obvious that the "frozen_spte" is a special case that doesn't need attention
> for
> most paths.
Sean Christopherson Sept. 13, 2024, 10:18 p.m. UTC | #24
On Fri, Sep 13, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote:
> > > TL;DR:
> > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > 
> > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying
> > to

I am getting a feeling of deja vu.  Please fix your mail client to not generate
newlines in the middle of quoted text.

> > install a page on behalf of two vCPUs, and KVM resumes the guest if it
> > encounters a FROZEN_SPTE when building the non-leaf SPTEs, then one of the
> > vCPUs could trigger the zero-step mitigation if the vCPU that "wins" and
> > gets delayed for whatever reason.
> 
> Can you explain more about what the concern is here? That the zero-step
> mitigation activation will be a drag on the TD because of extra contention with
> the TDH.MEM calls?
> 
> > 
> > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow
> > slow-path, what if instead of resuming the guest if a page fault hits
> > FROZEN_SPTE, KVM retries the fault "locally", i.e. _without_ redoing
> > tdh_vp_enter() to see if the vCPU still hits the fault?
> 
> It seems like an optimization. To me, I would normally want to know how much it
> helped before adding it. But if you think it's an obvious win I'll defer.

I'm not worried about any performance hit with zero-step, I'm worried about KVM
not being able to differentiate between a KVM bug and guest interference.  The
goal with a local retry is to make it so that KVM _never_ triggers zero-step,
unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
report the error to userspace instead of trying to suppress guest activity, and
potentially from other KVM tasks too.

It might even be simpler overall too.  E.g. report status up the call chain and
let the top-level TDX S-EPT handler to do its thing, versus adding various flags
and control knobs to ensure a vCPU can make forward progress.
Yan Zhao Sept. 14, 2024, 9:27 a.m. UTC | #25
On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
> > This is a lock status report of TDX module for current SEAMCALL retry issue
> > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > branch TDX_1.5.05.
> > 
> > TL;DR:
> > - tdh_mem_track() can contend with tdh_vp_enter().
> > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
Good idea.
But could we extend it a bit more to allow TDX's EPT violation handler to also
retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?

> 
> > - tdg_mem_page_accept() can contend with other tdh_mem*().
> > 
> > Proposal:
> > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
> >   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
> What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()?
Sorry, I meant -EBUSY originally.

With the current code in kvm_tdp_map_page(), vCPU should just retry without
tdh_vp_enter() except when there're signals pending.
With a real EPT violation, tdh_vp_enter() should be called again.

I realized that this is not good enough.
So, is it better to return -EAGAIN in ops link_external_spt/set_external_spte
and have kvm_tdp_mmu_map() return RET_PF_RETRY_FROZEN for -EAGAIN?
(or maybe some other name for RET_PF_RETRY_FROZEN).

> Also tdh_mem_sept_add() is strictly pre-finalize, correct?  I.e. should never
> contend with tdg_mem_page_accept() because vCPUs can't yet be run.
tdh_mem_page_add() is pre-finalize, tdh_mem_sept_add() is not.
tdh_mem_sept_add() can be called runtime by tdp_mmu_link_sp().

 
> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
> The page isn't yet mapped, so why would the guest be allowed to take a lock on
> the S-EPT entry?
Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
tdh_mem_page_aug() is called on the same gpa, the second one may contend with
tdg_mem_page_accept().

But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.

> 
> > - Kick off vCPUs at the beginning of page removal path, i.e. before the
> >   tdh_mem_range_block().
> >   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
> 
> This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS.
Great!

> 
> >   (one possible optimization:
> >    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
> >    do not kick off vCPUs in normal conditions.
> >    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
> 
> Which SEAMCALL is this specifically?  tdh_mem_range_block()?
Yes, they are
- tdh_mem_range_block() contends with tdh_vp_enter() for secure_ept_lock.
- tdh_mem_track() contends with tdh_vp_enter() for TD epoch.
  (current code in MMU part 2 just retry tdh_mem_track() endlessly),
- tdh_mem_page_remove()/tdh_mem_range_block() contends with
  tdg_mem_page_accept() for SEPT entry lock.
  (this one should not happen on a sane guest).

 Resources              SHARED  users              EXCLUSIVE users      
------------------------------------------------------------------------
(5) TDCS epoch         tdh_vp_enter                tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock    tdh_mem_sept_add            tdh_vp_enter
                       tdh_mem_page_aug            tdh_mem_sept_remove
                       tdh_mem_page_remove         tdh_mem_range_block
                                                   tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry                                     tdh_mem_sept_add
                                                   tdh_mem_sept_remove
                                                   tdh_mem_page_aug
                                                   tdh_mem_page_remove
                                                   tdh_mem_range_block
                                                   tdh_mem_range_unblock
                                                   tdg_mem_page_accept


> 
> >    TD enter until page removal completes.)
> 
> 
> Idea #1:
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..8113c17bd2f6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>                         return -EINTR;
>                 cond_resched();
>                 r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> -       } while (r == RET_PF_RETRY);
> +       } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN);
>  
>         if (r < 0)
>                 return r;
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>                 vcpu->stat.pf_spurious++;
>  
>         if (r != RET_PF_EMULATE)
> -               return 1;
> +               return r;
>  
>  emulate:
>         return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..690f03d7daae 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * and of course kvm_mmu_do_page_fault().
>   *
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_RETRY: let CPU fault again on the address.
> + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen.
> + *                     Let the CPU fault again on the address, or retry the
> + *                     fault "locally", i.e. without re-entering the guest.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
>   *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
>   * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
>   * will allow for efficient machine code when checking for CONTINUE, e.g.
>   * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
>   */
Looks "r > 0" represents success in vcpu_run()?
So, moving RET_PF_FIXED to 1 is not necessary?

>  enum {
>         RET_PF_CONTINUE = 0,
> +       RET_PF_FIXED    = 1,
>         RET_PF_RETRY,
> +       RET_PF_RETRY_FROZEN,
>         RET_PF_EMULATE,
>         RET_PF_WRITE_PROTECTED,
>         RET_PF_INVALID,
> -       RET_PF_FIXED,
>         RET_PF_SPURIOUS,
>  };
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>         rcu_read_unlock();
> +       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> +               return RET_PF_RETRY_FOZEN;
>         return ret;
>  }
>  
> ---
> 
> 
> Idea #2:
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 12 ++++++------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++---
>  arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/vmx.c          |  4 ++--
>  6 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 46e0a466d7fb..200fecd1de88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>  
>  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> -		       void *insn, int insn_len);
> +		       void *insn, int insn_len, bool *frozen_spte);
>  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
>  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
>  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..207840a316d3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  		return;
>  
>  	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
> -				  true, NULL, NULL);
> +				  true, NULL, NULL, NULL);
>  
>  	/*
>  	 * Account fixed page faults, otherwise they'll never be counted, but
> @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  		trace_kvm_page_fault(vcpu, fault_address, error_code);
>  
>  		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
> -				insn_len);
> +				       insn_len, NULL);
>  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		vcpu->arch.apf.host_apf_flags = 0;
>  		local_irq_disable();
> @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>  		if (signal_pending(current))
>  			return -EINTR;
>  		cond_resched();
> -		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> +		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL);
>  	} while (r == RET_PF_RETRY);
>  
>  	if (r < 0)
> @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  }
>  
>  int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> -		       void *insn, int insn_len)
> +				void *insn, int insn_len, bool *frozen_spte)
>  {
>  	int r, emulation_type = EMULTYPE_PF;
>  	bool direct = vcpu->arch.mmu->root_role.direct;
> @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  		vcpu->stat.pf_taken++;
>  
>  		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
> -					  &emulation_type, NULL);
> +					  &emulation_type, NULL, frozen_spte);
>  		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
>  			return -EIO;
>  	}
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  		vcpu->stat.pf_spurious++;
>  
>  	if (r != RET_PF_EMULATE)
> -		return 1;
> +		return r;
>  
>  emulate:
>  	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..5b1fc77695c1 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -247,6 +247,9 @@ struct kvm_page_fault {
>  	 * is changing its own translation in the guest page tables.
>  	 */
>  	bool write_fault_to_shadow_pgtable;
> +
> +	/* Indicates the page fault needs to be retried due to a frozen SPTE. */
> +	bool frozen_spte;
>  };
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * and of course kvm_mmu_do_page_fault().
>   *
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
>   *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
>   * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
>   * will allow for efficient machine code when checking for CONTINUE, e.g.
>   * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
>   */
>  enum {
>  	RET_PF_CONTINUE = 0,
> +	RET_PF_FIXED    = 1,
>  	RET_PF_RETRY,
>  	RET_PF_EMULATE,
>  	RET_PF_WRITE_PROTECTED,
>  	RET_PF_INVALID,
> -	RET_PF_FIXED,
>  	RET_PF_SPURIOUS,
>  };
>  
> @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  
>  static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  					u64 err, bool prefetch,
> -					int *emulation_type, u8 *level)
> +					int *emulation_type, u8 *level,
> +					bool *frozen_spte)
>  {
>  	struct kvm_page_fault fault = {
>  		.addr = cr2_or_gpa,
> @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>  	if (level)
>  		*level = fault.goal_level;
> +	if (frozen_spte)
> +		*frozen_spte = fault.frozen_spte;
>  
>  	return r;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..e7fc5ea4b437 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>  	rcu_read_unlock();
> +	fault->frozen_spte = is_frozen_spte(iter.old_spte);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 38723b0c435d..269de6a9eb13 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
>  	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
>  				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
>  				svm->vmcb->control.insn_bytes : NULL,
> -				svm->vmcb->control.insn_len);
> +				svm->vmcb->control.insn_len, NULL);
>  
>  	if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
>  		sev_handle_rmp_fault(vcpu, fault_address, error_code);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 368acfebd476..fc2ff5d91a71 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
>  		return kvm_emulate_instruction(vcpu, 0);
>  
> -	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>  		return kvm_skip_emulated_instruction(vcpu);
>  	}
>  
> -	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
> +	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL);
>  }
>  
>  static int handle_nmi_window(struct kvm_vcpu *vcpu)
> 
> base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe
> -- 
>
Yan Zhao Sept. 14, 2024, 10 a.m. UTC | #26
> > ===Resources & users list===
> > 
> > Resources              SHARED  users              EXCLUSIVE users
> > ------------------------------------------------------------------------
> > (1) TDR                tdh_mng_rdwr               tdh_mng_create
> >                        tdh_vp_create              tdh_mng_add_cx
> >                        tdh_vp_addcx               tdh_mng_init
> >                        tdh_vp_init                tdh_mng_vpflushdone
> >                        tdh_vp_enter               tdh_mng_key_config 
> >                        tdh_vp_flush               tdh_mng_key_freeid
> >                        tdh_vp_rd_wr               tdh_mr_extend
> >                        tdh_mem_sept_add           tdh_mr_finalize
> >                        tdh_mem_sept_remove        tdh_vp_init_apicid
> >                        tdh_mem_page_aug           tdh_mem_page_add
> >                        tdh_mem_page_remove
> >                        tdh_mem_range_block
> >                        tdh_mem_track
> >                        tdh_mem_range_unblock
> >                        tdh_phymem_page_reclaim
> 
> In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into
> pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is
> an exclusive lock. But we can ignore it because we only do reclaim at TD tear
> down time?
Hmm, if the page to reclaim is not a TDR page, lock_and_map_implicit_tdr() is
called to lock the page's corresponding TDR page with SHARED lock.

if the page to reclaim is a TDR page, it's indeed locked with EXCLUSIVE.

But in pamt_walk() it calls promote_sharex_lock_hp() for the passed in
TDX_LOCK_EXCLUSIVE only when

if ((pamt_1gb->pt == PT_REG) || (target_size == PT_1GB)) or
if ((pamt_2mb->pt == PT_REG) || (target_size == PT_2MB))

"pamt_1gb->pt == PT_REG" (or "pamt_2mb->pt == PT_REG)") is true when it's
assigned (not PT_NDA) and is a normal page (i.e. not TDR, TDVPR...).
This is true only after tdh_mem_page_add()/tdh_mem_page_aug() assigns the page
to a TD with huge page size.

This will not happen for a TDR page.

For normal pages when huge page is supported in future, looks we need to
update tdh_phymem_page_reclaim() to include size info too.

> 
> Separately, I wonder if we should try to add this info as comments around the
> SEAMCALL implementations. The locking is not part of the spec, but never-the-
> less the kernel is being coded against these assumptions. So it can sort of be
> like "the kernel assumes this" and we can at least record what the reason was.
> Or maybe just comment the parts that KVM assumes.
Agreed.
Yan Zhao Sept. 15, 2024, 9:53 a.m. UTC | #27
On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
> > The page isn't yet mapped, so why would the guest be allowed to take a lock on
> > the S-EPT entry?
> Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
> tdh_mem_page_aug() is called on the same gpa, the second one may contend with
> tdg_mem_page_accept().
> 
> But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
> between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.
I withdraw the reply above.

tdh_mem_page_aug() and tdg_mem_page_accept() both attempt to modify the same
SEPT entry, leading to contention.
- tdg_mem_page_accept() first walks the SEPT tree with no lock to get the SEPT
  entry. It then acquire the guest side lock of the found SEPT entry before
  checking entry state.
- tdh_mem_page_aug() first walks the SEPT tree with shared lock to locate the
  SEPT entry to modify, it then aquires host side lock of the SEPT entry before
  checking entry state.
Kai Huang Sept. 17, 2024, 1:31 a.m. UTC | #28
On 15/09/2024 9:53 pm, Zhao, Yan Y wrote:
> On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
>>> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
>>> The page isn't yet mapped, so why would the guest be allowed to take a lock on
>>> the S-EPT entry?
>> Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
>> tdh_mem_page_aug() is called on the same gpa, the second one may contend with
>> tdg_mem_page_accept().
>>
>> But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
>> between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.
> I withdraw the reply above.
> 
> tdh_mem_page_aug() and tdg_mem_page_accept() both attempt to modify the same
> SEPT entry, leading to contention.
> - tdg_mem_page_accept() first walks the SEPT tree with no lock to get the SEPT
>    entry. It then acquire the guest side lock of the found SEPT entry before
>    checking entry state.
> - tdh_mem_page_aug() first walks the SEPT tree with shared lock to locate the
>    SEPT entry to modify, it then aquires host side lock of the SEPT entry before
>    checking entry state.

This seems can only happen when there are multiple threads in guest 
trying to do tdg_mem_page_accept() on the same page.  This should be 
extremely rare to happen, and if this happens, it will eventually result 
in another fault in KVM.

So now we set SPTE to FROZEN_SPTE before doing AUG to prevent from other 
threads from going on.  I think when tdh_mem_page_aug() fails with 
secure EPT "entry" busy, we can reset FROZEN_SPTE back to old_spte and 
return PF_RETRY so that this thread and another fault thread can both 
try to complete AUG again?

The thread fails with AUG can also go back to guest though, but since 
host priority bit is already set, the further PAGE.ACCEPT will fail but 
this is fine due to another AUG in KVM will eventually resolve this and 
make progress to the guest.
Kai Huang Sept. 17, 2024, 2:11 a.m. UTC | #29
On 14/09/2024 5:23 am, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
>> This is a lock status report of TDX module for current SEAMCALL retry issue
>> based on code in TDX module public repo https://github.com/intel/tdx-module.git
>> branch TDX_1.5.05.
>>
>> TL;DR:
>> - tdh_mem_track() can contend with tdh_vp_enter().
>> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
> 

[...]

>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   
>   retry:
>          rcu_read_unlock();
> +       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> +               return RET_PF_RETRY_FOZEN;

Ack the whole "retry on frozen" approach, either with RET_PF_RETRY_FOZEN 
or fault->frozen_spte.

One minor side effect:

For normal VMs, the fault handler can also see a frozen spte, e.g, when 
kvm_tdp_mmu_map() checks the middle level SPTE:

	/*
          * If SPTE has been frozen by another thread, just give up and
          * retry, avoiding unnecessary page table allocation and free.
          */
         if (is_frozen_spte(iter.old_spte))
         	goto retry;

So for normal VM this RET_PF_RETRY_FOZEN will change "go back to guest 
to retry" to "retry in KVM internally".

As you mentioned above for normal VMs we probably always want to "go 
back to guest to retry" even for FROZEN SPTE, but I guess this is a 
minor issue that we can even notice.

Or we can additionally add:

	if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)
			&& is_mirrored_sptep(iter.sptep))
		return RET_PF_RETRY_FOZEN;

So it only applies to TDX.
Yan Zhao Sept. 25, 2024, 10:53 a.m. UTC | #30
On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > branch TDX_1.5.05.
> > > 
> > > TL;DR:
> > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > 
> > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > whatever reason.
> > 
> > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > hits the fault?
> > 
> > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > desirable because in many cases, the winning task will install a valid mapping
> > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > instruction is re-executed.  In the happy case, that provides optimal performance
> > as KVM doesn't introduce any extra delay/latency.
> > 
> > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > much higher, especially in light of the zero-step issues.
> > 
> > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > loop in kvm_tdp_map_page().
> > 
> > The only part I don't like about this idea is having two "retry" return values,
> > which creates the potential for bugs due to checking one but not the other.
> > 
> > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > option better even though the out-param is a bit gross, because it makes it more
> > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > most paths.
> Good idea.
> But could we extend it a bit more to allow TDX's EPT violation handler to also
> retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
FROZEN_SPTE might not be enough to prevent zero step mitigation.

E.g. in below selftest with a TD configured with pending_ve_disable=N,
zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
vm exit for more than 6 times (due to that user space does not do memslot
conversion correctly).

So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
in EPT violation vm exits.


#include <stdint.h>

#include "kvm_util.h"
#include "processor.h"
#include "tdx/tdcall.h"
#include "tdx/tdx.h"
#include "tdx/tdx_util.h"
#include "tdx/test_util.h"
#include "test_util.h"

/*
 * 0x80000000 is arbitrarily selected, but it should not overlap with selftest
 * code or boot page.
 */
#define ZERO_STEP_TEST_AREA_GPA (0x80000000)
/* Test area GPA is arbitrarily selected */
#define ZERO_STEP_AREA_GVA_PRIVATE (0x90000000)

/* The test area is 2MB in size */
#define ZERO_STEP_AREA_SIZE (2 << 20)

#define ZERO_STEP_ASSERT(x)                             \
        do {                                            \
                if (!(x))                               \
                        tdx_test_fatal(__LINE__);       \
        } while (0)


#define ZERO_STEP_ACCEPT_PRINT_PORT 0x87

#define ZERO_STEP_THRESHOLD 6
#define TRIGGER_ZERO_STEP_MITIGATION 1

static int convert_request_cnt;

static void guest_test_zero_step(void)
{
        void *test_area_gva_private = (void *)ZERO_STEP_AREA_GVA_PRIVATE;

        memset(test_area_gva_private, 1, 8);
        tdx_test_success();
}

static void guest_ve_handler(struct ex_regs *regs)
{
        uint64_t ret;
        struct ve_info ve;

        ret = tdg_vp_veinfo_get(&ve);
        ZERO_STEP_ASSERT(!ret);

        /* For this test, we will only handle EXIT_REASON_EPT_VIOLATION */
        ZERO_STEP_ASSERT(ve.exit_reason == EXIT_REASON_EPT_VIOLATION);


        tdx_test_send_64bit(ZERO_STEP_ACCEPT_PRINT_PORT, ve.gpa);

#define MEM_PAGE_ACCEPT_LEVEL_4K 0
#define MEM_PAGE_ACCEPT_LEVEL_2M 1
        ret = tdg_mem_page_accept(ve.gpa, MEM_PAGE_ACCEPT_LEVEL_4K);
        ZERO_STEP_ASSERT(!ret);
}

static void zero_step_test(void)
{
        struct kvm_vm *vm;
        struct kvm_vcpu *vcpu;
        void *guest_code;
        uint64_t test_area_npages;
        vm_vaddr_t test_area_gva_private;

        vm = td_create();
        td_initialize(vm, VM_MEM_SRC_ANONYMOUS, 0);
        guest_code = guest_test_zero_step;
        vcpu = td_vcpu_add(vm, 0, guest_code);
        vm_install_exception_handler(vm, VE_VECTOR, guest_ve_handler);

        test_area_npages = ZERO_STEP_AREA_SIZE / vm->page_size;
        vm_userspace_mem_region_add(vm,
                                    VM_MEM_SRC_ANONYMOUS, ZERO_STEP_TEST_AREA_GPA,
                                    3, test_area_npages, KVM_MEM_GUEST_MEMFD);
        vm->memslots[MEM_REGION_TEST_DATA] = 3;

        test_area_gva_private = ____vm_vaddr_alloc(
                vm, ZERO_STEP_AREA_SIZE, ZERO_STEP_AREA_GVA_PRIVATE,
                ZERO_STEP_TEST_AREA_GPA, MEM_REGION_TEST_DATA, true);
        TEST_ASSERT_EQ(test_area_gva_private, ZERO_STEP_AREA_GVA_PRIVATE);

        td_finalize(vm);
	handle_memory_conversion(vm, ZERO_STEP_TEST_AREA_GPA, ZERO_STEP_AREA_SIZE, false);
        for (;;) {
                vcpu_run(vcpu);
                if (vcpu->run->exit_reason == KVM_EXIT_IO &&
                        vcpu->run->io.port == ZERO_STEP_ACCEPT_PRINT_PORT) {
                        uint64_t gpa = tdx_test_read_64bit(
                                        vcpu, ZERO_STEP_ACCEPT_PRINT_PORT);
                        printf("\t ... guest accepting 1 page at GPA: 0x%lx\n", gpa);
                        continue;
                } else if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
                        bool skip = TRIGGER_ZERO_STEP_MITIGATION &&
                                    (convert_request_cnt < ZERO_STEP_THRESHOLD -1);

                        convert_request_cnt++;

                        printf("guest request conversion of gpa 0x%llx - 0x%llx to %s, skip=%d\n",
                                vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size,
                                (vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE) ? "private" : "shared", skip);


                        if (skip)
                                continue;

                        handle_memory_conversion(
                                vm, vcpu->run->memory_fault.gpa,
                                vcpu->run->memory_fault.size,
                                vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE);
                        continue;
                }
                printf("exit reason %d\n", vcpu->run->exit_reason);
                break;
        }

        kvm_vm_free(vm);
}

int main(int argc, char **argv)
{
        /* Disable stdout buffering */
        setbuf(stdout, NULL);

        if (!is_tdx_enabled()) {
                printf("TDX is not supported by the KVM\n"
                       "Skipping the TDX tests.\n");
                return 0;
        }

        run_in_new_process(&zero_step_test);
}
Sean Christopherson Oct. 8, 2024, 2:51 p.m. UTC | #31
On Wed, Sep 25, 2024, Yan Zhao wrote:
> On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > branch TDX_1.5.05.
> > > > 
> > > > TL;DR:
> > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > 
> > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > whatever reason.
> > > 
> > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > hits the fault?
> > > 
> > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > desirable because in many cases, the winning task will install a valid mapping
> > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > as KVM doesn't introduce any extra delay/latency.
> > > 
> > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > much higher, especially in light of the zero-step issues.
> > > 
> > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > loop in kvm_tdp_map_page().
> > > 
> > > The only part I don't like about this idea is having two "retry" return values,
> > > which creates the potential for bugs due to checking one but not the other.
> > > 
> > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > option better even though the out-param is a bit gross, because it makes it more
> > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > most paths.
> > Good idea.
> > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> FROZEN_SPTE might not be enough to prevent zero step mitigation.

The goal isn't to make it completely impossible for zero-step to fire, it's to
make it so that _if_ zero-step fires, KVM can report the error to userspace without
having to retry, because KVM _knows_ that advancing past the zero-step isn't
something KVM can solve.

 : I'm not worried about any performance hit with zero-step, I'm worried about KVM
 : not being able to differentiate between a KVM bug and guest interference.  The
 : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
 : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 : report the error to userspace instead of trying to suppress guest activity, and
 : potentially from other KVM tasks too.

In other words, for the selftest you crafted, KVM reporting an error to userspace
due to zero-step would be working as intended.  

> E.g. in below selftest with a TD configured with pending_ve_disable=N,
> zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
> vm exit for more than 6 times (due to that user space does not do memslot
> conversion correctly).
> 
> So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
> contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
> in EPT violation vm exits.
Yan Zhao Oct. 10, 2024, 5:23 a.m. UTC | #32
On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> On Wed, Sep 25, 2024, Yan Zhao wrote:
> > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > branch TDX_1.5.05.
> > > > > 
> > > > > TL;DR:
> > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > 
> > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > whatever reason.
> > > > 
> > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > hits the fault?
> > > > 
> > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > desirable because in many cases, the winning task will install a valid mapping
> > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > as KVM doesn't introduce any extra delay/latency.
> > > > 
> > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > much higher, especially in light of the zero-step issues.
> > > > 
> > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > loop in kvm_tdp_map_page().
> > > > 
> > > > The only part I don't like about this idea is having two "retry" return values,
> > > > which creates the potential for bugs due to checking one but not the other.
> > > > 
> > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > option better even though the out-param is a bit gross, because it makes it more
> > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > most paths.
> > > Good idea.
> > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> 
> The goal isn't to make it completely impossible for zero-step to fire, it's to
> make it so that _if_ zero-step fires, KVM can report the error to userspace without
> having to retry, because KVM _knows_ that advancing past the zero-step isn't
> something KVM can solve.
> 
>  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
>  : not being able to differentiate between a KVM bug and guest interference.  The
>  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
>  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  : report the error to userspace instead of trying to suppress guest activity, and
>  : potentially from other KVM tasks too.
> 
> In other words, for the selftest you crafted, KVM reporting an error to userspace
> due to zero-step would be working as intended.  
Hmm, but the selftest is an example to show that 6 continuous EPT violations on
the same GPA could trigger zero-step.

For an extremely unlucky vCPU, is it still possible to fire zero step when
nothing is wrong both in KVM and QEMU?
e.g.

1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

 
> > E.g. in below selftest with a TD configured with pending_ve_disable=N,
> > zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
> > vm exit for more than 6 times (due to that user space does not do memslot
> > conversion correctly).
> > 
> > So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
> > contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
> > in EPT violation vm exits.
Sean Christopherson Oct. 10, 2024, 5:33 p.m. UTC | #33
On Thu, Oct 10, 2024, Yan Zhao wrote:
> On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > branch TDX_1.5.05.
> > > > > > 
> > > > > > TL;DR:
> > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > 
> > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > whatever reason.
> > > > > 
> > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > hits the fault?
> > > > > 
> > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > 
> > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > much higher, especially in light of the zero-step issues.
> > > > > 
> > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > loop in kvm_tdp_map_page().
> > > > > 
> > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > 
> > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > most paths.
> > > > Good idea.
> > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > 
> > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > something KVM can solve.
> > 
> >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> >  : not being able to differentiate between a KVM bug and guest interference.  The
> >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >  : report the error to userspace instead of trying to suppress guest activity, and
> >  : potentially from other KVM tasks too.
> > 
> > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > due to zero-step would be working as intended.  
> Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> the same GPA could trigger zero-step.
> 
> For an extremely unlucky vCPU, is it still possible to fire zero step when
> nothing is wrong both in KVM and QEMU?
> e.g.
> 
> 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

Very technically, this shouldn't be possible.  The only way for there to be
contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
6th attempt should succeed, even if the faulting vCPU wasn't the one to create
the SPTE.

That said, a few thoughts:

1. Where did we end up on the idea of requiring userspace to pre-fault memory?

2. The zero-step logic really should have a slightly more conservative threshold.
   I have a hard time believing that e.g. 10 attempts would create a side channel,
   but 6 attempts is "fine".

3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map().
   Yes, I'm being somewhat hypocritical since I'm so against retrying for the
   S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy
   for KVM to guarantee success.

E.g. for #3, the below (compile tested only) patch should make it impossible for
the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs
should never trigger A/D assists, i.e. retry should always succeed.

---
 arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3b996c1fdaab..e47573a652a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 				   struct kvm_mmu_page *sp, bool shared);
 
+static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu,
+					       struct kvm_mmu_page *sp)
+{
+	if (!sp)
+		return tdp_mmu_alloc_sp(vcpu);
+
+	memset(sp, 0, sizeof(*sp));
+	memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value,
+		 PAGE_SIZE / sizeof(u64));
+	return sp;
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	struct kvm_mmu_page *sp = NULL;
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
-	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	rcu_read_lock();
 
 	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
-		int r;
-
+		/*
+		 * Somewhat arbitrarily allow two local retries, e.g. to play
+		 * nice with the extremely unlikely case that KVM encounters a
+		 * huge SPTE an Access-assist _and_ a subsequent Dirty-assist.
+		 * Retrying is inexpensive, but if KVM fails to install a SPTE
+		 * three times, then a fourth attempt is likely futile and it's
+		 * time to back off.
+		 */
+		int r, retry_locally = 2;
+again:
 		if (fault->nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
@@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * The SPTE is either non-present or points to a huge page that
 		 * needs to be split.
 		 */
-		sp = tdp_mmu_alloc_sp(vcpu);
+		sp = tdp_mmu_realloc_sp(vcpu, sp);
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
@@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
 
 		/*
-		 * Force the guest to retry if installing an upper level SPTE
-		 * failed, e.g. because a different task modified the SPTE.
+		 * If installing an upper level SPTE failed, retry the walk
+		 * locally before forcing the guest to retry.  If the SPTE was
+		 * modified by a different task, odds are very good the new
+		 * SPTE is usable as-is.  And if the SPTE was modified by the
+		 * CPU, e.g. to set A/D bits, then unless KVM gets *extremely*
+		 * unlucky, the CMPXCHG should succeed the second time around.
 		 */
 		if (r) {
-			tdp_mmu_free_sp(sp);
+			if (retry_locally--)
+				goto again;
 			goto retry;
 		}
 
@@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 				track_possible_nx_huge_page(kvm, sp);
 			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 		}
+		sp = NULL;
 	}
 
 	/*
@@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 retry:
 	rcu_read_unlock();
+
+	/*
+	 * Free the previously allocated MMU page if KVM retried locally and
+	 * ended up not using said page.
+	 */
+	if (sp)
+		tdp_mmu_free_sp(sp);
 	return ret;
 }
 

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
--
Rick Edgecombe Oct. 10, 2024, 9:53 p.m. UTC | #34
On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

Isn't there a more general scenario:

vcpu0                              vcpu1
1. Freezes PTE
2. External op to do the SEAMCALL
3.                                 Faults same PTE, hits frozen PTE
4.                                 Retries N times, triggers zero-step
5. Finally finishes external op

Am I missing something?

> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e.
> the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
> 
> That said, a few thoughts:
> 
> 1. Where did we end up on the idea of requiring userspace to pre-fault memory?

For others reference, I think you are referring to the idea to pre-fault the
entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre-
faulting/PAGE.ADD dance we are already doing.

The last discussion with Paolo was to resume the retry solution discussion on
the v2 posting because it would be easier "with everything else already
addressed". Also, there was also some discussion that it was not immediately
obvious how prefaulting everything would work for memory hot plug (i.e. memslots
added during runtime).

> 
> 2. The zero-step logic really should have a slightly more conservative
> threshold.
>    I have a hard time believing that e.g. 10 attempts would create a side
> channel,
>    but 6 attempts is "fine".

No idea where the threshold came from. I'm not sure if it affects the KVM
design? We can look into it for curiosity sake in either case.

> 
> 3. This would be a good reason to implement a local retry in
> kvm_tdp_mmu_map().
>    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
>    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be
> easy
>    for KVM to guarantee success.
> 
> E.g. for #3, the below (compile tested only) patch should make it impossible
> for
> the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror
> SPTEs
> should never trigger A/D assists, i.e. retry should always succeed.

I don't see how it addresses the scenario above. More retires could just make it
rarer, but never fix it. Very possible I'm missing something though.
Yan Zhao Oct. 11, 2024, 2:06 a.m. UTC | #35
On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Yan Zhao wrote:
> > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > > branch TDX_1.5.05.
> > > > > > > 
> > > > > > > TL;DR:
> > > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > > 
> > > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > > whatever reason.
> > > > > > 
> > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > > hits the fault?
> > > > > > 
> > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > > 
> > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > > much higher, especially in light of the zero-step issues.
> > > > > > 
> > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > > loop in kvm_tdp_map_page().
> > > > > > 
> > > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > > 
> > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > > most paths.
> > > > > Good idea.
> > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > > 
> > > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > > something KVM can solve.
> > > 
> > >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> > >  : not being able to differentiate between a KVM bug and guest interference.  The
> > >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> > >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >  : report the error to userspace instead of trying to suppress guest activity, and
> > >  : potentially from other KVM tasks too.
> > > 
> > > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > > due to zero-step would be working as intended.  
> > Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> > the same GPA could trigger zero-step.
> > 
> > For an extremely unlucky vCPU, is it still possible to fire zero step when
> > nothing is wrong both in KVM and QEMU?
> > e.g.
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
Hmm, the 7th EPT violation could still occur if the vCPU that sees failure of
"try_cmpxchg64()" returns to guest faster than the one that successfully
installs the SPTE.

> 
> That said, a few thoughts:
> 
> 1. Where did we end up on the idea of requiring userspace to pre-fault memory?
I didn't follow this question.
Do you want to disallow userspace to pre-fault memory after TD finalization
or do you want to suggest userspace to do it?

> 
> 2. The zero-step logic really should have a slightly more conservative threshold.
>    I have a hard time believing that e.g. 10 attempts would create a side channel,
>    but 6 attempts is "fine".
Don't know where the value 6 comes. :)
We may need to ask. 

> 3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map().
>    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
>    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy
>    for KVM to guarantee success.
It's reasonable.

But TDX code still needs to retry for the RET_PF_RETRY_FROZEN without
re-entering guest.

Would it be good for TDX code to retry whenever it sees RET_PF_RETRY or
RET_PF_RETRY_FOZEN?
We can have tdx_sept_link_private_spt()/tdx_sept_set_private_spte() to return
-EBUSY on contention.


> 
> E.g. for #3, the below (compile tested only) patch should make it impossible for
> the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs
> should never trigger A/D assists, i.e. retry should always succeed.
> 
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3b996c1fdaab..e47573a652a9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  				   struct kvm_mmu_page *sp, bool shared);
>  
> +static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu,
> +					       struct kvm_mmu_page *sp)
> +{
> +	if (!sp)
> +		return tdp_mmu_alloc_sp(vcpu);
> +
> +	memset(sp, 0, sizeof(*sp));
> +	memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value,
> +		 PAGE_SIZE / sizeof(u64));
> +	return sp;
> +}
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
> +	struct kvm_mmu_page *sp = NULL;
>  	struct kvm *kvm = vcpu->kvm;
>  	struct tdp_iter iter;
> -	struct kvm_mmu_page *sp;
>  	int ret = RET_PF_RETRY;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	rcu_read_lock();
>  
>  	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> -		int r;
> -
> +		/*
> +		 * Somewhat arbitrarily allow two local retries, e.g. to play
> +		 * nice with the extremely unlikely case that KVM encounters a
> +		 * huge SPTE an Access-assist _and_ a subsequent Dirty-assist.
> +		 * Retrying is inexpensive, but if KVM fails to install a SPTE
> +		 * three times, then a fourth attempt is likely futile and it's
> +		 * time to back off.
> +		 */
> +		int r, retry_locally = 2;
> +again:
>  		if (fault->nx_huge_page_workaround_enabled)
>  			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
>  
> @@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 * The SPTE is either non-present or points to a huge page that
>  		 * needs to be split.
>  		 */
> -		sp = tdp_mmu_alloc_sp(vcpu);
> +		sp = tdp_mmu_realloc_sp(vcpu, sp);
>  		tdp_mmu_init_child_sp(sp, &iter);
>  
>  		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> @@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
>  
>  		/*
> -		 * Force the guest to retry if installing an upper level SPTE
> -		 * failed, e.g. because a different task modified the SPTE.
> +		 * If installing an upper level SPTE failed, retry the walk
> +		 * locally before forcing the guest to retry.  If the SPTE was
> +		 * modified by a different task, odds are very good the new
> +		 * SPTE is usable as-is.  And if the SPTE was modified by the
> +		 * CPU, e.g. to set A/D bits, then unless KVM gets *extremely*
> +		 * unlucky, the CMPXCHG should succeed the second time around.
>  		 */
>  		if (r) {
> -			tdp_mmu_free_sp(sp);
> +			if (retry_locally--)
> +				goto again;
>  			goto retry;
>  		}
>  
> @@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  				track_possible_nx_huge_page(kvm, sp);
>  			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  		}
> +		sp = NULL;
>  	}
>  
>  	/*
> @@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>  	rcu_read_unlock();
> +
> +	/*
> +	 * Free the previously allocated MMU page if KVM retried locally and
> +	 * ended up not using said page.
> +	 */
> +	if (sp)
> +		tdp_mmu_free_sp(sp);
>  	return ret;
>  }
>  
> 
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> -- 
>
Yan Zhao Oct. 11, 2024, 2:30 a.m. UTC | #36
On Fri, Oct 11, 2024 at 05:53:29AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > > 
> > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Isn't there a more general scenario:
> 
> vcpu0                              vcpu1
> 1. Freezes PTE
> 2. External op to do the SEAMCALL
> 3.                                 Faults same PTE, hits frozen PTE
> 4.                                 Retries N times, triggers zero-step
> 5. Finally finishes external op
> 
> Am I missing something?
Yes, it's a follow-up discussion of Sean's proposal [1] of having TDX code to
retry on RET_PF_RETRY_FROZEN to avoid zero-step.
My worry is that merely avoiding entering guest for vCPUs seeing FROZEN_SPTE is
not enough to prevent zero-step. 
The two examples shows zero-step is possible without re-entering guest for
FROZEN_SPTE:
- The selftest [2]: a single vCPU can fire zero-step when userspace does
  something wrong (though KVM is correct).
- The above case: Nothing wrong in KVM/QEMU, except an extremely unlucky vCPU.


[1] https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/
[2] https://lore.kernel.org/all/ZvPrqMj1BWrkkwqN@yzhao56-desk.sh.intel.com/

> 
> > 
> > Very technically, this shouldn't be possible.  The only way for there to be
> > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e.
> > the
> > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> > the SPTE.
> > 
> > That said, a few thoughts:
> > 
> > 1. Where did we end up on the idea of requiring userspace to pre-fault memory?
> 
> For others reference, I think you are referring to the idea to pre-fault the
> entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre-
> faulting/PAGE.ADD dance we are already doing.
> 
> The last discussion with Paolo was to resume the retry solution discussion on
> the v2 posting because it would be easier "with everything else already
> addressed". Also, there was also some discussion that it was not immediately
> obvious how prefaulting everything would work for memory hot plug (i.e. memslots
> added during runtime).
> 
> > 
> > 2. The zero-step logic really should have a slightly more conservative
> > threshold.
> >    I have a hard time believing that e.g. 10 attempts would create a side
> > channel,
> >    but 6 attempts is "fine".
> 
> No idea where the threshold came from. I'm not sure if it affects the KVM
> design? We can look into it for curiosity sake in either case.
> 
> > 
> > 3. This would be a good reason to implement a local retry in
> > kvm_tdp_mmu_map().
> >    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
> >    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be
> > easy
> >    for KVM to guarantee success.
> > 
> > E.g. for #3, the below (compile tested only) patch should make it impossible
> > for
> > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror
> > SPTEs
> > should never trigger A/D assists, i.e. retry should always succeed.
> 
> I don't see how it addresses the scenario above. More retires could just make it
> rarer, but never fix it. Very possible I'm missing something though.
I'm also not 100% sure if zero-step must not happen after this change even when
KVM/QEMU do nothing wrong.
Kai Huang Oct. 14, 2024, 10:54 a.m. UTC | #37
On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote:
> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > > 
> > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Isn't there a more general scenario:
> 
> vcpu0                              vcpu1
> 1. Freezes PTE
> 2. External op to do the SEAMCALL
> 3.                                 Faults same PTE, hits frozen PTE
> 4.                                 Retries N times, triggers zero-step
> 5. Finally finishes external op
> 
> Am I missing something?

I must be missing something.  I thought KVM is going to retry internally for
step 4 (retries N times) because it sees the frozen PTE, but will never go back
to guest after the fault is resolved?  How can step 4 triggers zero-step?
Rick Edgecombe Oct. 14, 2024, 5:36 p.m. UTC | #38
On Mon, 2024-10-14 at 10:54 +0000, Huang, Kai wrote:
> On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > > > 
> > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> > 
> > Isn't there a more general scenario:
> > 
> > vcpu0                              vcpu1
> > 1. Freezes PTE
> > 2. External op to do the SEAMCALL
> > 3.                                 Faults same PTE, hits frozen PTE
> > 4.                                 Retries N times, triggers zero-step
> > 5. Finally finishes external op
> > 
> > Am I missing something?
> 
> I must be missing something.  I thought KVM is going to 
> 

"Is going to", as in "will be changed to"? Or "does today"?

> retry internally for
> step 4 (retries N times) because it sees the frozen PTE, but will never go back
> to guest after the fault is resolved?  How can step 4 triggers zero-step?

Step 3-4 is saying it will go back to the guest and fault again.


As far as what KVM will do in the future, I think it is still open. I've not had
the chance to think about this for more than 30 min at a time, but the plan to
handle OPERAND_BUSY by taking an expensive path to break any contention (i.e.
kick+lock + whatever TDX module changes we come up with) seems to the leading
idea.

Retry N times is too hacky. Retry internally forever might be awkward to
implement. Because of the signal_pending() check, you would have to handle
exiting to userspace and going back to an EPT violation next time the vcpu tries
to enter.
Kai Huang Oct. 14, 2024, 11:03 p.m. UTC | #39
On 15/10/2024 6:36 am, Edgecombe, Rick P wrote:
> On Mon, 2024-10-14 at 10:54 +0000, Huang, Kai wrote:
>> On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote:
>>> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
>>>>>
>>>>> 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
>>>>> 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
>>>
>>> Isn't there a more general scenario:
>>>
>>> vcpu0                              vcpu1
>>> 1. Freezes PTE
>>> 2. External op to do the SEAMCALL
>>> 3.                                 Faults same PTE, hits frozen PTE
>>> 4.                                 Retries N times, triggers zero-step
>>> 5. Finally finishes external op
>>>
>>> Am I missing something?
>>
>> I must be missing something.  I thought KVM is going to
>>
> 
> "Is going to", as in "will be changed to"? Or "does today"?

Will be changed to (today's behaviour is to go back to guest to let the 
fault happen again to retry).

AFAICT this is what Sean suggested:

https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/

The whole point is to let KVM loop internally but not go back to guest 
when the fault handler sees a frozen PTE.  And in this proposal this 
applies to both leaf and non-leaf PTEs IIUC, so it should handle the 
case where try_cmpxchg64() fails as mentioned by Yan.

> 
>> retry internally for
>> step 4 (retries N times) because it sees the frozen PTE, but will never go back
>> to guest after the fault is resolved?  How can step 4 triggers zero-step?
> 
> Step 3-4 is saying it will go back to the guest and fault again.

As said above, the whole point is to make KVM loop internally when it 
sees a frozen PTE, but not go back to guest.
Rick Edgecombe Oct. 15, 2024, 1:24 a.m. UTC | #40
On Tue, 2024-10-15 at 12:03 +1300, Huang, Kai wrote:
> > "Is going to", as in "will be changed to"? Or "does today"?
> 
> Will be changed to (today's behaviour is to go back to guest to let the 
> fault happen again to retry).
> 
> AFAICT this is what Sean suggested:
> 
> https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/
> 
> The whole point is to let KVM loop internally but not go back to guest 
> when the fault handler sees a frozen PTE.  And in this proposal this 
> applies to both leaf and non-leaf PTEs IIUC, so it should handle the 
> case where try_cmpxchg64() fails as mentioned by Yan.
> 
> > 
> > > retry internally for
> > > step 4 (retries N times) because it sees the frozen PTE, but will never go
> > > back
> > > to guest after the fault is resolved?  How can step 4 triggers zero-step?
> > 
> > Step 3-4 is saying it will go back to the guest and fault again.
> 
> As said above, the whole point is to make KVM loop internally when it 
> sees a frozen PTE, but not go back to guest.

Yea, I was saying on that idea that I thought looping forever without checking
for a signal would be problematic. Then userspace could re-enter the TD. I don't
know if it's a show stopper.

In any case the discussion between these threads and LPC/KVM forum hallway
chatter has gotten a bit fragmented. I don't think there is any concrete
consensus solution at this point.
Yan Zhao Oct. 16, 2024, 2:13 p.m. UTC | #41
On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Yan Zhao wrote:
> > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > > branch TDX_1.5.05.
> > > > > > > 
> > > > > > > TL;DR:
> > > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > > 
> > > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > > whatever reason.
> > > > > > 
> > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > > hits the fault?
> > > > > > 
> > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > > 
> > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > > much higher, especially in light of the zero-step issues.
> > > > > > 
> > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > > loop in kvm_tdp_map_page().
> > > > > > 
> > > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > > 
> > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > > most paths.
> > > > > Good idea.
> > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > > 
> > > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > > something KVM can solve.
> > > 
> > >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> > >  : not being able to differentiate between a KVM bug and guest interference.  The
> > >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> > >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >  : report the error to userspace instead of trying to suppress guest activity, and
> > >  : potentially from other KVM tasks too.
> > > 
> > > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > > due to zero-step would be working as intended.  
> > Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> > the same GPA could trigger zero-step.
> > 
> > For an extremely unlucky vCPU, is it still possible to fire zero step when
> > nothing is wrong both in KVM and QEMU?
> > e.g.
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
You are right!
I just realized that if TDX code retries internally for FROZEN_SPTEs, the 6th
attempt should succeed.

But I found below might be another case to return RET_PF_RETRY and trigger
zero-step:

Suppose GFNs are shared from 0x80000 - 0x80200,
with HVA starting from hva1 of size 0x200


     vCPU 0                                    vCPU 1
                                     1. Access GFN 0x80002
	                             2. convert GFN 0x80002 to private

3.munmap hva1 of size 0x200
  kvm_mmu_invalidate_begin
  mmu_invalidate_range_start=0x80000
  mmu_invalidate_range_end=0x80200

                                     4. kvm_faultin_pfn
	                                mmu_invalidate_retry_gfn_unsafe of
				        GFN 0x80002 and return RET_PF_RETRY!

5.kvm_mmu_invalidate_end


Before step 5, step 4 will produce RET_PF_RETRY and re-enter guest.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 0363d8544f42..8ca3e252a6ed 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -31,6 +31,40 @@ 
 #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
 	pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
 
+/*
+ * TDX module acquires its internal lock for resources.  It doesn't spin to get
+ * locks because of its restrictions of allowed execution time.  Instead, it
+ * returns TDX_OPERAND_BUSY with an operand id.
+ *
+ * Multiple VCPUs can operate on SEPT.  Also with zero-step attack mitigation,
+ * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
+ * attack is suspected.  It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
+ * with TDH.MEM.* operation.  Note: TDH.MEM.TRACK is an exception.
+ *
+ * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
+ * spoils TDP MMU effort.  Retry several times with the assumption that SEPT
+ * lock contention is rare.  But don't loop forever to avoid lockup.  Let TDP
+ * MMU retry.
+ */
+#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
+
+static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
+{
+#define SEAMCALL_RETRY_MAX     16
+	struct tdx_module_args args_in;
+	int retry = SEAMCALL_RETRY_MAX;
+	u64 ret;
+
+	do {
+		args_in = *in;
+		ret = seamcall_ret(op, in);
+	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
+
+	*in = args_in;
+
+	return ret;
+}
+
 static inline u64 tdh_mng_addcx(struct kvm_tdx *kvm_tdx, hpa_t addr)
 {
 	struct tdx_module_args in = {
@@ -55,7 +89,7 @@  static inline u64 tdh_mem_page_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -76,7 +110,7 @@  static inline u64 tdh_mem_sept_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 
 	clflush_cache_range(__va(page), PAGE_SIZE);
 
-	ret = seamcall_ret(TDH_MEM_SEPT_ADD, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -93,7 +127,7 @@  static inline u64 tdh_mem_sept_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_SEPT_REMOVE, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -123,7 +157,7 @@  static inline u64 tdh_mem_page_aug(struct kvm_tdx *kvm_tdx, gpa_t gpa, hpa_t hpa
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -140,7 +174,7 @@  static inline u64 tdh_mem_range_block(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -335,7 +369,7 @@  static inline u64 tdh_mem_page_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -361,7 +395,7 @@  static inline u64 tdh_mem_range_unblock(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_RANGE_UNBLOCK, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;