diff mbox series

[3/7] KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY

Message ID 20250113021218.18922-1-yan.y.zhao@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: TDX SEPT SEAMCALL retry | expand

Commit Message

Yan Zhao Jan. 13, 2025, 2:12 a.m. UTC
Retry locally in the TDX EPT violation handler for private memory to reduce
the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with
tdh_vp_enter().

TDX EPT violation installs private pages via tdh_mem_sept_add() and
tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or
TDCALLs.

Resources    SHARED  users      EXCLUSIVE users
------------------------------------------------------------
SEPT tree  tdh_mem_sept_add     tdh_vp_enter(0-step mitigation)
           tdh_mem_page_aug
------------------------------------------------------------
SEPT entry                      tdh_mem_sept_add (Host lock)
                                tdh_mem_page_aug (Host lock)
                                tdg_mem_page_accept (Guest lock)
                                tdg_mem_page_attr_rd (Guest lock)
                                tdg_mem_page_attr_wr (Guest lock)

Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and
TDCALLs may be removed in future TDX module, their contention with
tdh_vp_enter() due to 0-step mitigation still persists.

The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER,
which works as follows:
0. Each TDH.VP.ENTER records the guest RIP on TD entry.
1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it
   checks if the guest RIP is the same as last guest RIP on TD entry.
   -if yes, it means the EPT violation is caused by the same instruction
            that caused the last VM exit.
            Then, the TDX module increases the guest RIP no-progress count.
            When the count increases from 0 to the threshold (currently 6),
            the TDX module records the faulting GPA into a
            last_epf_gpa_list.
   -if no,  it means the guest RIP has made progress.
            So, the TDX module resets the RIP no-progress count and the
            last_epf_gpa_list.
2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on
   TD entry) checks if the last_epf_gpa_list is empty.
   -if yes, TD entry continues without acquiring the lock on the SEPT tree.
   -if no,  it triggers the 0-step mitigation by acquiring the exclusive
            lock on SEPT tree, walking the EPT tree to check if all page
            faults caused by the GPAs in the last_epf_gpa_list have been
            resolved before continuing TD entry.

Since KVM TDP MMU usually re-enters guest whenever it exits to userspace
(e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a
tdh_vp_enter() to be called more than the threshold count before a page
fault is addressed, triggering contention when tdh_vp_enter() attempts to
acquire exclusive lock on SEPT tree.

Retry locally in TDX EPT violation handler to reduce the count of invoking
tdh_vp_enter(), hence reducing the possibility of its contention with
tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and
the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT,
signals/interrupts, and cases when one instruction faults more GFNs than
the threshold count.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

Comments

Sean Christopherson Jan. 17, 2025, 9:14 p.m. UTC | #1
On Mon, Jan 13, 2025, Yan Zhao wrote:
> @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
>  	}
>  
>  	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
> -	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> +
> +	while (1) {
> +		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> +
> +		if (ret != RET_PF_RETRY || !local_retry)
> +			break;
> +
> +		/*
> +		 * Break and keep the orig return value.

Wrap at 80.

> +		 * Signal & irq handling will be done later in vcpu_run()

Please don't use "&" as shorthand.  It saves all of two characters.  That said,
I don't see any point in adding this comment, if the reader can't follow the
logic of this code, these comments aren't going to help them.  And the comment
about vcpu_run() in particular is misleading, as posted interrupts aren't truly
handled by vcpu_run(), rather they're handled by hardware (although KVM does send
a self-IPI).

> +		 */
> +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)

This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
Ah, it's a local function.  At a glance, I don't see any harm in exposing that
to TDX.

> +			break;
> +
> +		cond_resched();
> +	}

Nit, IMO this reads better as:

	do {
		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
	} while (ret == RET_PF_RETY && local_retry &&
		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));

> +	return ret;
>  }
>  
>  int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> -- 
> 2.43.2
>
Yan Zhao Jan. 20, 2025, 8:05 a.m. UTC | #2
On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Yan Zhao wrote:
> > @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> >  	}
> >  
> >  	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
> > -	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> > +
> > +	while (1) {
> > +		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > +
> > +		if (ret != RET_PF_RETRY || !local_retry)
> > +			break;
> > +
> > +		/*
> > +		 * Break and keep the orig return value.
> 
> Wrap at 80.
> 
> > +		 * Signal & irq handling will be done later in vcpu_run()
> 
> Please don't use "&" as shorthand.  It saves all of two characters.  That said,
Got it!

> I don't see any point in adding this comment, if the reader can't follow the
> logic of this code, these comments aren't going to help them.  And the comment
> about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> a self-IPI).
What about below version?

"
Bail out the local retry
- for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
  --> kvm_handle_signal_exit() can exit to userspace for signal handling.
- for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
  be re-executed for interrupt injection through posted interrupt.
- for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
  re-executed to process and pend NMI to the TDX module. KVM always regards NMI
  as allowed and the TDX module will inject it when NMI is allowed in the TD.
"

> > +		 */
> > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> 
> This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
Yes. However, vt_nmi_allowed() is always true for TDs.
For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
EXIT_REASON_HLT.

> Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> to TDX.
Besides that kvm_vcpu_has_events() is a local function, the consideration to
check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
vcpu->arch.nmi_pending" instead that
(1) the two are effectively equivalent for TDs (as nested is not supported yet)
(2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
    pending. However, vt_inject_exception() is NULL for TDs.

> > +			break;
> > +
> > +		cond_resched();
> > +	}
> 
> Nit, IMO this reads better as:
> 
> 	do {
> 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> 	} while (ret == RET_PF_RETY && local_retry &&
> 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
>
Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
when ret != RET_PF_RETRY or when gpa is shared .

> > +	return ret;
> >  }
> >  
> >  int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> > -- 
> > 2.43.2
> >
Sean Christopherson Jan. 25, 2025, 1:23 a.m. UTC | #3
On Mon, Jan 20, 2025, Yan Zhao wrote:
> On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > I don't see any point in adding this comment, if the reader can't follow the
> > logic of this code, these comments aren't going to help them.  And the comment
> > about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> > handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> > a self-IPI).
> What about below version?
> 
> "
> Bail out the local retry
> - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
>   --> kvm_handle_signal_exit() can exit to userspace for signal handling.

Eh, pending signals should be self-explanatory.

> - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
>   be re-executed for interrupt injection through posted interrupt.
> - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
>   re-executed to process and pend NMI to the TDX module. KVM always regards NMI
>   as allowed and the TDX module will inject it when NMI is allowed in the TD.
> "
> 
> > > +		 */
> > > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> > 
> > This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> > matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
> Yes. However, vt_nmi_allowed() is always true for TDs.
> For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
> EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
> EXIT_REASON_HLT.
> 
> > Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> > to TDX.
> Besides that kvm_vcpu_has_events() is a local function, the consideration to
> check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
> vcpu->arch.nmi_pending" instead that

*sigh*

  PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
  the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
  directly. When there is already one NMI pending in the TDX module, i.e. it
  has not been delivered to TDX guest yet, if there is NMI pending in KVM,
  collapse the pending NMI in KVM into the one pending in the TDX module.
  Such collapse is OK considering on X86 bare metal, multiple NMIs could
  collapse into one NMI, e.g. when NMI is blocked by SMI.  It's OS's
  responsibility to poll all NMI sources in the NMI handler to avoid missing
  handling of some NMI events. More details can be found in the changelog of
  the patch "KVM: TDX: Implement methods to inject NMI".

That's probably fine?  But it's still unfortunate that TDX manages to be different
at almost every opportunity :-(

> (1) the two are effectively equivalent for TDs (as nested is not supported yet)

If they're all equivalent, then *not* open coding is desriable, IMO.  Ah, but
they aren't equivalent.  tdx_protected_apic_has_interrupt() also checks whatever
TD_VCPU_STATE_DETAILS_NON_ARCH is.

	vcpu_state_details =
		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);

	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);

That code needs a comment, because depending on the behavior of that field, it
might not even be correct.

> (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
>     pending. However, vt_inject_exception() is NULL for TDs.

Wouldn't a pending exception be a KVM bug?

The bigger oddity, which I think is worth calling out, is that because KVM can't
determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false
positives are inevitable.  I.e. KVM may re-enter the guest even if the IRQ/NMI
can't be delivered.  Call *that* out, and explain why it's fine.

> > > +			break;
> > > +
> > > +		cond_resched();
> > > +	}
> > 
> > Nit, IMO this reads better as:
> > 
> > 	do {
> > 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > 	} while (ret == RET_PF_RETY && local_retry &&
> > 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
> >
> Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
> when ret != RET_PF_RETRY or when gpa is shared .

Hrm, right.  Maybe this?  Dunno if that's any better.

	ret = 0;
	do {
		if (ret)
			cond_resched();

		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
	} while (...)
Yan Zhao Jan. 27, 2025, 9:24 a.m. UTC | #4
On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> On Mon, Jan 20, 2025, Yan Zhao wrote:
> > On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote:
> > > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > > I don't see any point in adding this comment, if the reader can't follow the
> > > logic of this code, these comments aren't going to help them.  And the comment
> > > about vcpu_run() in particular is misleading, as posted interrupts aren't truly
> > > handled by vcpu_run(), rather they're handled by hardware (although KVM does send
> > > a self-IPI).
> > What about below version?
> > 
> > "
> > Bail out the local retry
> > - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work()
> >   --> kvm_handle_signal_exit() can exit to userspace for signal handling.
> 
> Eh, pending signals should be self-explanatory.
Ok.

> 
> > - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will
> >   be re-executed for interrupt injection through posted interrupt.
> > - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be
> >   re-executed to process and pend NMI to the TDX module. KVM always regards NMI
> >   as allowed and the TDX module will inject it when NMI is allowed in the TD.
> > "
> > 
> > > > +		 */
> > > > +		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
> > > > +		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
> > > 
> > > This needs to check that the IRQ/NMI is actually allowed.  I guess it doesn't
> > > matter for IRQs, but it does matter for NMIs.  Why not use kvm_vcpu_has_events()?
> > Yes. However, vt_nmi_allowed() is always true for TDs.
> > For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is
> > EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be
> > EXIT_REASON_HLT.
> > 
> > > Ah, it's a local function.  At a glance, I don't see any harm in exposing that
> > > to TDX.
> > Besides that kvm_vcpu_has_events() is a local function, the consideration to
> > check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) ||
> > vcpu->arch.nmi_pending" instead that
> 
> *sigh*
> 
>   PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
>   the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
>   directly. When there is already one NMI pending in the TDX module, i.e. it
>   has not been delivered to TDX guest yet, if there is NMI pending in KVM,
>   collapse the pending NMI in KVM into the one pending in the TDX module.
>   Such collapse is OK considering on X86 bare metal, multiple NMIs could
>   collapse into one NMI, e.g. when NMI is blocked by SMI.  It's OS's
>   responsibility to poll all NMI sources in the NMI handler to avoid missing
>   handling of some NMI events. More details can be found in the changelog of
>   the patch "KVM: TDX: Implement methods to inject NMI".
> 
> That's probably fine?  But it's still unfortunate that TDX manages to be different
> at almost every opportunity :-(
:(

> > (1) the two are effectively equivalent for TDs (as nested is not supported yet)
> 
> If they're all equivalent, then *not* open coding is desriable, IMO.  Ah, but
> they aren't equivalent.  tdx_protected_apic_has_interrupt() also checks whatever
> TD_VCPU_STATE_DETAILS_NON_ARCH is.
> 
> 	vcpu_state_details =
> 		td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
> 
> 	return tdx_vcpu_state_details_intr_pending(vcpu_state_details);
Right. That's why I asked that in the PUCK.

The "tdx_vcpu_state_details_intr_pending(vcpu_state_details)" checks if there's
a pending interrupt that can be recognized.
As in the code snippet of TDX module:

            case MD_TDVPS_VCPU_STATE_DETAILS_FIELD_CODE:
            {
                // Calculate virtual interrupt pending status
                vcpu_state_t vcpu_state_details;
                guest_interrupt_status_t interrupt_status;
                uint64_t interrupt_status_raw;

                set_vm_vmcs_as_active(md_ctx.tdvps_ptr, 0);

                ia32_vmread(VMX_GUEST_INTERRUPT_STATUS_ENCODE, &interrupt_status_raw);
                interrupt_status.raw = (uint16_t)interrupt_status_raw;
                vcpu_state_details.raw = 0ULL;
                if ((interrupt_status.rvi & 0xF0UL) > (md_ctx.tdvps_ptr->vapic.apic[PPR_INDEX] & 0xF0UL))
                {
                    vcpu_state_details.vmxip = 1ULL;
                }
                read_value = vcpu_state_details.raw;

                break;
            }

My previous consideration is that
when there's a pending interrupt that can be recognized, given the current VM
Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
since the condition to recognize and deliver interrupt is unchanged after the
EPT violation VM Exit.
So checking pending interrupt brings only false positive, which is unlike
checking PID that the vector in the PID could arrive after the EPT violation VM
Exit and PID would be cleared after VM Entry even if the interrupts are not
deliverable. So checking PID may lead to true positive and less false positive.

But I understand your point now. As checking PID can also be false positive, it
would be no harm to introduce another source of false positive.

So using kvm_vcpu_has_events() looks like a kind of trade-off?
kvm_vcpu_has_events() can make TDX's code less special but might lead to the
local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
are disabled in the guest.


> 
> That code needs a comment, because depending on the behavior of that field, it
> might not even be correct.
> 
> > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> >     pending. However, vt_inject_exception() is NULL for TDs.
> 
> Wouldn't a pending exception be a KVM bug?
Hmm, yes, it should be.
Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
an exception for TDs, this should not occur while VCPU_RUN is in progress.

> The bigger oddity, which I think is worth calling out, is that because KVM can't
> determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false
> positives are inevitable.  I.e. KVM may re-enter the guest even if the IRQ/NMI
> can't be delivered.  Call *that* out, and explain why it's fine.
Will do.

It's fine because:
In the worst-case scenario where every break is a false positive, the local vCPU
becomes more vulnerable to 0-step mitigation due to the lack of local retries.
If the local vCPU triggers a 0-step mitigation in tdh_vp_enter(), the remote
vCPUs would either retry while contending for the SEPT tree lock if they are in
the page installation path, or retry after waiting for the local vCPU to be
kicked off and ensuring no tdh_vp_enter() is occurring in the local vCPU.

Moreover, even there's no break out for interrupt injection in the local vCPU,
the local vCPU could still suffer 0-step mitigation, e.g. when an RIP faults
more than 6 GFNs.

> > > > +			break;
> > > > +
> > > > +		cond_resched();
> > > > +	}
> > > 
> > > Nit, IMO this reads better as:
> > > 
> > > 	do {
> > > 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> > > 	} while (ret == RET_PF_RETY && local_retry &&
> > > 		 !kvm_vcpu_has_events(vcpu) && !signal_pending(current));
> > >
> > Hmm, the previous way can save one "cond_resched()" for the common cases, i.e.,
> > when ret != RET_PF_RETRY or when gpa is shared .
> 
> Hrm, right.  Maybe this?  Dunno if that's any better.
> 
> 	ret = 0;
> 	do {
> 		if (ret)
> 			cond_resched();
Not sure either :)
This is clearer, however, brings one more check to the normal path.

> 		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
> 	} while (...)
Sean Christopherson Jan. 27, 2025, 5:04 p.m. UTC | #5
On Mon, Jan 27, 2025, Yan Zhao wrote:
> On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> My previous consideration is that
> when there's a pending interrupt that can be recognized, given the current VM
> Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
> since the condition to recognize and deliver interrupt is unchanged after the
> EPT violation VM Exit.
> So checking pending interrupt brings only false positive, which is unlike
> checking PID that the vector in the PID could arrive after the EPT violation VM
> Exit and PID would be cleared after VM Entry even if the interrupts are not
> deliverable. So checking PID may lead to true positive and less false positive.
> 
> But I understand your point now. As checking PID can also be false positive, it
> would be no harm to introduce another source of false positive.

Yep.  FWIW, I agree that checking VMXIP is theoretically "worse", in the sense
that it's much more likely to be a false positive.  Off the top of my head, the
only time VMXIP will be set with RFLAGS.IF=1 on an EPT Violation is if the EPT
violation happens in an STI or MOV_SS/POP_SS shadow.

> So using kvm_vcpu_has_events() looks like a kind of trade-off?
> kvm_vcpu_has_events() can make TDX's code less special but might lead to the
> local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
> are disabled in the guest.

Ya.  I think it's worth worth using kvm_vcpu_has_events() though.  In practice,
observing VMXIP=1 with RFLAGS=0 on an EPT Violation means the guest is accessing
memory for the first time in atomic kernel context.  That alone seems highly
unlikely.  Add in that triggering retry requires an uncommon race, and the overall
probability becomes miniscule.

> > That code needs a comment, because depending on the behavior of that field, it
> > might not even be correct.
> > 
> > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> > >     pending. However, vt_inject_exception() is NULL for TDs.
> > 
> > Wouldn't a pending exception be a KVM bug?
> Hmm, yes, it should be.
> Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
> an exception for TDs, 

I thought TDX didn't support synthetic #MCs?

> this should not occur while VCPU_RUN is in progress.

Not "should not", _cannot_, because they're mutually exclusive.
Yan Zhao Feb. 5, 2025, 7:34 a.m. UTC | #6
On Mon, Jan 27, 2025 at 09:04:59AM -0800, Sean Christopherson wrote:
> On Mon, Jan 27, 2025, Yan Zhao wrote:
> > On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote:
> > My previous consideration is that
> > when there's a pending interrupt that can be recognized, given the current VM
> > Exit reason is EPT violation, the next VM Entry will not deliver the interrupt
> > since the condition to recognize and deliver interrupt is unchanged after the
> > EPT violation VM Exit.
> > So checking pending interrupt brings only false positive, which is unlike
> > checking PID that the vector in the PID could arrive after the EPT violation VM
> > Exit and PID would be cleared after VM Entry even if the interrupts are not
> > deliverable. So checking PID may lead to true positive and less false positive.
> > 
> > But I understand your point now. As checking PID can also be false positive, it
> > would be no harm to introduce another source of false positive.
> 
> Yep.  FWIW, I agree that checking VMXIP is theoretically "worse", in the sense
> that it's much more likely to be a false positive.  Off the top of my head, the
> only time VMXIP will be set with RFLAGS.IF=1 on an EPT Violation is if the EPT
> violation happens in an STI or MOV_SS/POP_SS shadow.
> 
> > So using kvm_vcpu_has_events() looks like a kind of trade-off?
> > kvm_vcpu_has_events() can make TDX's code less special but might lead to the
> > local vCPU more vulnerable to the 0-step mitigation, especially when interrupts
> > are disabled in the guest.
> 
> Ya.  I think it's worth worth using kvm_vcpu_has_events() though.  In practice,
> observing VMXIP=1 with RFLAGS=0 on an EPT Violation means the guest is accessing
> memory for the first time in atomic kernel context.  That alone seems highly
> unlikely.  Add in that triggering retry requires an uncommon race, and the overall
> probability becomes miniscule.
Ok. Will use kvm_vcpu_has_events() instead.

> 
> > > That code needs a comment, because depending on the behavior of that field, it
> > > might not even be correct.
> > > 
> > > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception
> > > >     pending. However, vt_inject_exception() is NULL for TDs.
> > > 
> > > Wouldn't a pending exception be a KVM bug?
> > Hmm, yes, it should be.
> > Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue
> > an exception for TDs, 
> 
> I thought TDX didn't support synthetic #MCs?
Hmm, it's just an example to show kvm_queue_exception() can be invoked for a TD,
as it's not disallowed :)
Another example is kvm_arch_vcpu_ioctl_set_guest_debug() when
vcpu->arch.guest_state_protected is false for a DEBUG TD.

But as you said, they are not possible to be called during vcpu_run().

> > this should not occur while VCPU_RUN is in progress.
> 
> Not "should not", _cannot_, because they're mutually exclusive.
Right, "cannot" is the accurate term.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1cf3ef0faff7..bb9d914765fc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1854,6 +1854,8 @@  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
 	gpa_t gpa = tdexit_gpa(vcpu);
 	unsigned long exit_qual;
+	bool local_retry = false;
+	int ret;
 
 	if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) {
 		if (tdx_is_sept_violation_unexpected_pending(vcpu)) {
@@ -1872,6 +1874,24 @@  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 		 * due to aliasing a single HPA to multiple GPAs.
 		 */
 		exit_qual = EPT_VIOLATION_ACC_WRITE;
+
+		/*
+		 * Mapping of private memory may return RET_PF_RETRY due to
+		 * SEAMCALL contention, e.g.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU may
+		 *   contend with TDH.VP.ENTER (due to 0-step mitigation)
+		 *   on a remote vCPU.
+		 * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU may
+		 *   contend with TDG.MEM.PAGE.ACCEPT on a remote vCPU.
+		 *
+		 * Retry internally in TDX to prevent exacerbating the
+		 * activation of 0-step mitigation on local vCPU.
+		 * However, despite these retries, the 0-step mitigation on the
+		 * local vCPU may still be triggered due to:
+		 * - Exiting on signals, interrupts.
+		 * - KVM_EXIT_MEMORY_FAULT.
+		 */
+		local_retry = true;
 	} else {
 		exit_qual = tdexit_exit_qual(vcpu);
 		/*
@@ -1884,7 +1904,24 @@  static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 	}
 
 	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
-	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+
+	while (1) {
+		ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual);
+
+		if (ret != RET_PF_RETRY || !local_retry)
+			break;
+
+		/*
+		 * Break and keep the orig return value.
+		 * Signal & irq handling will be done later in vcpu_run()
+		 */
+		if (signal_pending(current) || pi_has_pending_interrupt(vcpu) ||
+		    kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending)
+			break;
+
+		cond_resched();
+	}
+	return ret;
 }
 
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)