Message ID | 20250113021218.18922-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: TDX SEPT SEAMCALL retry | expand |
On Mon, Jan 13, 2025, Yan Zhao wrote: > @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > } > > trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual); > - return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual); > + > + while (1) { > + ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > + > + if (ret != RET_PF_RETRY || !local_retry) > + break; > + > + /* > + * Break and keep the orig return value. Wrap at 80. > + * Signal & irq handling will be done later in vcpu_run() Please don't use "&" as shorthand. It saves all of two characters. That said, I don't see any point in adding this comment, if the reader can't follow the logic of this code, these comments aren't going to help them. And the comment about vcpu_run() in particular is misleading, as posted interrupts aren't truly handled by vcpu_run(), rather they're handled by hardware (although KVM does send a self-IPI). > + */ > + if (signal_pending(current) || pi_has_pending_interrupt(vcpu) || > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) This needs to check that the IRQ/NMI is actually allowed. I guess it doesn't matter for IRQs, but it does matter for NMIs. Why not use kvm_vcpu_has_events()? Ah, it's a local function. At a glance, I don't see any harm in exposing that to TDX. > + break; > + > + cond_resched(); > + } Nit, IMO this reads better as: do { ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); } while (ret == RET_PF_RETY && local_retry && !kvm_vcpu_has_events(vcpu) && !signal_pending(current)); > + return ret; > } > > int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath) > -- > 2.43.2 >
On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote: > On Mon, Jan 13, 2025, Yan Zhao wrote: > > @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) > > } > > > > trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual); > > - return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual); > > + > > + while (1) { > > + ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > > + > > + if (ret != RET_PF_RETRY || !local_retry) > > + break; > > + > > + /* > > + * Break and keep the orig return value. > > Wrap at 80. > > > + * Signal & irq handling will be done later in vcpu_run() > > Please don't use "&" as shorthand. It saves all of two characters. That said, Got it! > I don't see any point in adding this comment, if the reader can't follow the > logic of this code, these comments aren't going to help them. And the comment > about vcpu_run() in particular is misleading, as posted interrupts aren't truly > handled by vcpu_run(), rather they're handled by hardware (although KVM does send > a self-IPI). What about below version? " Bail out the local retry - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work() --> kvm_handle_signal_exit() can exit to userspace for signal handling. - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will be re-executed for interrupt injection through posted interrupt. - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be re-executed to process and pend NMI to the TDX module. KVM always regards NMI as allowed and the TDX module will inject it when NMI is allowed in the TD. " > > + */ > > + if (signal_pending(current) || pi_has_pending_interrupt(vcpu) || > > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) > > This needs to check that the IRQ/NMI is actually allowed. I guess it doesn't > matter for IRQs, but it does matter for NMIs. Why not use kvm_vcpu_has_events()? Yes. However, vt_nmi_allowed() is always true for TDs. For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be EXIT_REASON_HLT. > Ah, it's a local function. At a glance, I don't see any harm in exposing that > to TDX. Besides that kvm_vcpu_has_events() is a local function, the consideration to check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending" instead that (1) the two are effectively equivalent for TDs (as nested is not supported yet) (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception pending. However, vt_inject_exception() is NULL for TDs. > > + break; > > + > > + cond_resched(); > > + } > > Nit, IMO this reads better as: > > do { > ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > } while (ret == RET_PF_RETY && local_retry && > !kvm_vcpu_has_events(vcpu) && !signal_pending(current)); > Hmm, the previous way can save one "cond_resched()" for the common cases, i.e., when ret != RET_PF_RETRY or when gpa is shared . > > + return ret; > > } > > > > int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath) > > -- > > 2.43.2 > >
On Mon, Jan 20, 2025, Yan Zhao wrote: > On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote: > > On Mon, Jan 13, 2025, Yan Zhao wrote: > > I don't see any point in adding this comment, if the reader can't follow the > > logic of this code, these comments aren't going to help them. And the comment > > about vcpu_run() in particular is misleading, as posted interrupts aren't truly > > handled by vcpu_run(), rather they're handled by hardware (although KVM does send > > a self-IPI). > What about below version? > > " > Bail out the local retry > - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work() > --> kvm_handle_signal_exit() can exit to userspace for signal handling. Eh, pending signals should be self-explanatory. > - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will > be re-executed for interrupt injection through posted interrupt. > - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be > re-executed to process and pend NMI to the TDX module. KVM always regards NMI > as allowed and the TDX module will inject it when NMI is allowed in the TD. > " > > > > + */ > > > + if (signal_pending(current) || pi_has_pending_interrupt(vcpu) || > > > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) > > > > This needs to check that the IRQ/NMI is actually allowed. I guess it doesn't > > matter for IRQs, but it does matter for NMIs. Why not use kvm_vcpu_has_events()? > Yes. However, vt_nmi_allowed() is always true for TDs. > For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is > EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be > EXIT_REASON_HLT. > > > Ah, it's a local function. At a glance, I don't see any harm in exposing that > > to TDX. > Besides that kvm_vcpu_has_events() is a local function, the consideration to > check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) || > vcpu->arch.nmi_pending" instead that *sigh* PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit directly. When there is already one NMI pending in the TDX module, i.e. it has not been delivered to TDX guest yet, if there is NMI pending in KVM, collapse the pending NMI in KVM into the one pending in the TDX module. Such collapse is OK considering on X86 bare metal, multiple NMIs could collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI sources in the NMI handler to avoid missing handling of some NMI events. More details can be found in the changelog of the patch "KVM: TDX: Implement methods to inject NMI". That's probably fine? But it's still unfortunate that TDX manages to be different at almost every opportunity :-( > (1) the two are effectively equivalent for TDs (as nested is not supported yet) If they're all equivalent, then *not* open coding is desriable, IMO. Ah, but they aren't equivalent. tdx_protected_apic_has_interrupt() also checks whatever TD_VCPU_STATE_DETAILS_NON_ARCH is. vcpu_state_details = td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH); return tdx_vcpu_state_details_intr_pending(vcpu_state_details); That code needs a comment, because depending on the behavior of that field, it might not even be correct. > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception > pending. However, vt_inject_exception() is NULL for TDs. Wouldn't a pending exception be a KVM bug? The bigger oddity, which I think is worth calling out, is that because KVM can't determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false positives are inevitable. I.e. KVM may re-enter the guest even if the IRQ/NMI can't be delivered. Call *that* out, and explain why it's fine. > > > + break; > > > + > > > + cond_resched(); > > > + } > > > > Nit, IMO this reads better as: > > > > do { > > ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > > } while (ret == RET_PF_RETY && local_retry && > > !kvm_vcpu_has_events(vcpu) && !signal_pending(current)); > > > Hmm, the previous way can save one "cond_resched()" for the common cases, i.e., > when ret != RET_PF_RETRY or when gpa is shared . Hrm, right. Maybe this? Dunno if that's any better. ret = 0; do { if (ret) cond_resched(); ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); } while (...)
On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote: > On Mon, Jan 20, 2025, Yan Zhao wrote: > > On Fri, Jan 17, 2025 at 01:14:02PM -0800, Sean Christopherson wrote: > > > On Mon, Jan 13, 2025, Yan Zhao wrote: > > > I don't see any point in adding this comment, if the reader can't follow the > > > logic of this code, these comments aren't going to help them. And the comment > > > about vcpu_run() in particular is misleading, as posted interrupts aren't truly > > > handled by vcpu_run(), rather they're handled by hardware (although KVM does send > > > a self-IPI). > > What about below version? > > > > " > > Bail out the local retry > > - for pending signal, so that vcpu_run() --> xfer_to_guest_mode_handle_work() > > --> kvm_handle_signal_exit() can exit to userspace for signal handling. > > Eh, pending signals should be self-explanatory. Ok. > > > - for pending interrupts, so that tdx_vcpu_enter_exit() --> tdh_vp_enter() will > > be re-executed for interrupt injection through posted interrupt. > > - for pending nmi or KVM_REQ_NMI, so that vcpu_enter_guest() will be > > re-executed to process and pend NMI to the TDX module. KVM always regards NMI > > as allowed and the TDX module will inject it when NMI is allowed in the TD. > > " > > > > > > + */ > > > > + if (signal_pending(current) || pi_has_pending_interrupt(vcpu) || > > > > + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) > > > > > > This needs to check that the IRQ/NMI is actually allowed. I guess it doesn't > > > matter for IRQs, but it does matter for NMIs. Why not use kvm_vcpu_has_events()? > > Yes. However, vt_nmi_allowed() is always true for TDs. > > For interrupt, tdx_interrupt_allowed() is always true unless the exit reason is > > EXIT_REASON_HLT. For the EPT violation handler, the exit reason should not be > > EXIT_REASON_HLT. > > > > > Ah, it's a local function. At a glance, I don't see any harm in exposing that > > > to TDX. > > Besides that kvm_vcpu_has_events() is a local function, the consideration to > > check "pi_has_pending_interrupt() || kvm_test_request(KVM_REQ_NMI, vcpu) || > > vcpu->arch.nmi_pending" instead that > > *sigh* > > PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in > the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit > directly. When there is already one NMI pending in the TDX module, i.e. it > has not been delivered to TDX guest yet, if there is NMI pending in KVM, > collapse the pending NMI in KVM into the one pending in the TDX module. > Such collapse is OK considering on X86 bare metal, multiple NMIs could > collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's > responsibility to poll all NMI sources in the NMI handler to avoid missing > handling of some NMI events. More details can be found in the changelog of > the patch "KVM: TDX: Implement methods to inject NMI". > > That's probably fine? But it's still unfortunate that TDX manages to be different > at almost every opportunity :-( :( > > (1) the two are effectively equivalent for TDs (as nested is not supported yet) > > If they're all equivalent, then *not* open coding is desriable, IMO. Ah, but > they aren't equivalent. tdx_protected_apic_has_interrupt() also checks whatever > TD_VCPU_STATE_DETAILS_NON_ARCH is. > > vcpu_state_details = > td_state_non_arch_read64(to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH); > > return tdx_vcpu_state_details_intr_pending(vcpu_state_details); Right. That's why I asked that in the PUCK. The "tdx_vcpu_state_details_intr_pending(vcpu_state_details)" checks if there's a pending interrupt that can be recognized. As in the code snippet of TDX module: case MD_TDVPS_VCPU_STATE_DETAILS_FIELD_CODE: { // Calculate virtual interrupt pending status vcpu_state_t vcpu_state_details; guest_interrupt_status_t interrupt_status; uint64_t interrupt_status_raw; set_vm_vmcs_as_active(md_ctx.tdvps_ptr, 0); ia32_vmread(VMX_GUEST_INTERRUPT_STATUS_ENCODE, &interrupt_status_raw); interrupt_status.raw = (uint16_t)interrupt_status_raw; vcpu_state_details.raw = 0ULL; if ((interrupt_status.rvi & 0xF0UL) > (md_ctx.tdvps_ptr->vapic.apic[PPR_INDEX] & 0xF0UL)) { vcpu_state_details.vmxip = 1ULL; } read_value = vcpu_state_details.raw; break; } My previous consideration is that when there's a pending interrupt that can be recognized, given the current VM Exit reason is EPT violation, the next VM Entry will not deliver the interrupt since the condition to recognize and deliver interrupt is unchanged after the EPT violation VM Exit. So checking pending interrupt brings only false positive, which is unlike checking PID that the vector in the PID could arrive after the EPT violation VM Exit and PID would be cleared after VM Entry even if the interrupts are not deliverable. So checking PID may lead to true positive and less false positive. But I understand your point now. As checking PID can also be false positive, it would be no harm to introduce another source of false positive. So using kvm_vcpu_has_events() looks like a kind of trade-off? kvm_vcpu_has_events() can make TDX's code less special but might lead to the local vCPU more vulnerable to the 0-step mitigation, especially when interrupts are disabled in the guest. > > That code needs a comment, because depending on the behavior of that field, it > might not even be correct. > > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception > > pending. However, vt_inject_exception() is NULL for TDs. > > Wouldn't a pending exception be a KVM bug? Hmm, yes, it should be. Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue an exception for TDs, this should not occur while VCPU_RUN is in progress. > The bigger oddity, which I think is worth calling out, is that because KVM can't > determine if IRQs (or NMIs) are blocked at the time of the EPT violation, false > positives are inevitable. I.e. KVM may re-enter the guest even if the IRQ/NMI > can't be delivered. Call *that* out, and explain why it's fine. Will do. It's fine because: In the worst-case scenario where every break is a false positive, the local vCPU becomes more vulnerable to 0-step mitigation due to the lack of local retries. If the local vCPU triggers a 0-step mitigation in tdh_vp_enter(), the remote vCPUs would either retry while contending for the SEPT tree lock if they are in the page installation path, or retry after waiting for the local vCPU to be kicked off and ensuring no tdh_vp_enter() is occurring in the local vCPU. Moreover, even there's no break out for interrupt injection in the local vCPU, the local vCPU could still suffer 0-step mitigation, e.g. when an RIP faults more than 6 GFNs. > > > > + break; > > > > + > > > > + cond_resched(); > > > > + } > > > > > > Nit, IMO this reads better as: > > > > > > do { > > > ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > > > } while (ret == RET_PF_RETY && local_retry && > > > !kvm_vcpu_has_events(vcpu) && !signal_pending(current)); > > > > > Hmm, the previous way can save one "cond_resched()" for the common cases, i.e., > > when ret != RET_PF_RETRY or when gpa is shared . > > Hrm, right. Maybe this? Dunno if that's any better. > > ret = 0; > do { > if (ret) > cond_resched(); Not sure either :) This is clearer, however, brings one more check to the normal path. > ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); > } while (...)
On Mon, Jan 27, 2025, Yan Zhao wrote: > On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote: > My previous consideration is that > when there's a pending interrupt that can be recognized, given the current VM > Exit reason is EPT violation, the next VM Entry will not deliver the interrupt > since the condition to recognize and deliver interrupt is unchanged after the > EPT violation VM Exit. > So checking pending interrupt brings only false positive, which is unlike > checking PID that the vector in the PID could arrive after the EPT violation VM > Exit and PID would be cleared after VM Entry even if the interrupts are not > deliverable. So checking PID may lead to true positive and less false positive. > > But I understand your point now. As checking PID can also be false positive, it > would be no harm to introduce another source of false positive. Yep. FWIW, I agree that checking VMXIP is theoretically "worse", in the sense that it's much more likely to be a false positive. Off the top of my head, the only time VMXIP will be set with RFLAGS.IF=1 on an EPT Violation is if the EPT violation happens in an STI or MOV_SS/POP_SS shadow. > So using kvm_vcpu_has_events() looks like a kind of trade-off? > kvm_vcpu_has_events() can make TDX's code less special but might lead to the > local vCPU more vulnerable to the 0-step mitigation, especially when interrupts > are disabled in the guest. Ya. I think it's worth worth using kvm_vcpu_has_events() though. In practice, observing VMXIP=1 with RFLAGS=0 on an EPT Violation means the guest is accessing memory for the first time in atomic kernel context. That alone seems highly unlikely. Add in that triggering retry requires an uncommon race, and the overall probability becomes miniscule. > > That code needs a comment, because depending on the behavior of that field, it > > might not even be correct. > > > > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception > > > pending. However, vt_inject_exception() is NULL for TDs. > > > > Wouldn't a pending exception be a KVM bug? > Hmm, yes, it should be. > Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue > an exception for TDs, I thought TDX didn't support synthetic #MCs? > this should not occur while VCPU_RUN is in progress. Not "should not", _cannot_, because they're mutually exclusive.
On Mon, Jan 27, 2025 at 09:04:59AM -0800, Sean Christopherson wrote: > On Mon, Jan 27, 2025, Yan Zhao wrote: > > On Fri, Jan 24, 2025 at 05:23:36PM -0800, Sean Christopherson wrote: > > My previous consideration is that > > when there's a pending interrupt that can be recognized, given the current VM > > Exit reason is EPT violation, the next VM Entry will not deliver the interrupt > > since the condition to recognize and deliver interrupt is unchanged after the > > EPT violation VM Exit. > > So checking pending interrupt brings only false positive, which is unlike > > checking PID that the vector in the PID could arrive after the EPT violation VM > > Exit and PID would be cleared after VM Entry even if the interrupts are not > > deliverable. So checking PID may lead to true positive and less false positive. > > > > But I understand your point now. As checking PID can also be false positive, it > > would be no harm to introduce another source of false positive. > > Yep. FWIW, I agree that checking VMXIP is theoretically "worse", in the sense > that it's much more likely to be a false positive. Off the top of my head, the > only time VMXIP will be set with RFLAGS.IF=1 on an EPT Violation is if the EPT > violation happens in an STI or MOV_SS/POP_SS shadow. > > > So using kvm_vcpu_has_events() looks like a kind of trade-off? > > kvm_vcpu_has_events() can make TDX's code less special but might lead to the > > local vCPU more vulnerable to the 0-step mitigation, especially when interrupts > > are disabled in the guest. > > Ya. I think it's worth worth using kvm_vcpu_has_events() though. In practice, > observing VMXIP=1 with RFLAGS=0 on an EPT Violation means the guest is accessing > memory for the first time in atomic kernel context. That alone seems highly > unlikely. Add in that triggering retry requires an uncommon race, and the overall > probability becomes miniscule. Ok. Will use kvm_vcpu_has_events() instead. > > > > That code needs a comment, because depending on the behavior of that field, it > > > might not even be correct. > > > > > > > (2) kvm_vcpu_has_events() may lead to unnecessary breaks due to exception > > > > pending. However, vt_inject_exception() is NULL for TDs. > > > > > > Wouldn't a pending exception be a KVM bug? > > Hmm, yes, it should be. > > Although kvm_vcpu_ioctl_x86_set_mce() can invoke kvm_queue_exception() to queue > > an exception for TDs, > > I thought TDX didn't support synthetic #MCs? Hmm, it's just an example to show kvm_queue_exception() can be invoked for a TD, as it's not disallowed :) Another example is kvm_arch_vcpu_ioctl_set_guest_debug() when vcpu->arch.guest_state_protected is false for a DEBUG TD. But as you said, they are not possible to be called during vcpu_run(). > > this should not occur while VCPU_RUN is in progress. > > Not "should not", _cannot_, because they're mutually exclusive. Right, "cannot" is the accurate term.
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 1cf3ef0faff7..bb9d914765fc 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -1854,6 +1854,8 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) { gpa_t gpa = tdexit_gpa(vcpu); unsigned long exit_qual; + bool local_retry = false; + int ret; if (vt_is_tdx_private_gpa(vcpu->kvm, gpa)) { if (tdx_is_sept_violation_unexpected_pending(vcpu)) { @@ -1872,6 +1874,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) * due to aliasing a single HPA to multiple GPAs. */ exit_qual = EPT_VIOLATION_ACC_WRITE; + + /* + * Mapping of private memory may return RET_PF_RETRY due to + * SEAMCALL contention, e.g. + * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU may + * contend with TDH.VP.ENTER (due to 0-step mitigation) + * on a remote vCPU. + * - TDH.MEM.PAGE.AUG/TDH.MEM.SEPT.ADD on local vCPU may + * contend with TDG.MEM.PAGE.ACCEPT on a remote vCPU. + * + * Retry internally in TDX to prevent exacerbating the + * activation of 0-step mitigation on local vCPU. + * However, despite these retries, the 0-step mitigation on the + * local vCPU may still be triggered due to: + * - Exiting on signals, interrupts. + * - KVM_EXIT_MEMORY_FAULT. + */ + local_retry = true; } else { exit_qual = tdexit_exit_qual(vcpu); /* @@ -1884,7 +1904,24 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu) } trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual); - return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual); + + while (1) { + ret = __vmx_handle_ept_violation(vcpu, gpa, exit_qual); + + if (ret != RET_PF_RETRY || !local_retry) + break; + + /* + * Break and keep the orig return value. + * Signal & irq handling will be done later in vcpu_run() + */ + if (signal_pending(current) || pi_has_pending_interrupt(vcpu) || + kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) + break; + + cond_resched(); + } + return ret; } int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
Retry locally in the TDX EPT violation handler for private memory to reduce the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with tdh_vp_enter(). TDX EPT violation installs private pages via tdh_mem_sept_add() and tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) tdh_mem_page_aug ------------------------------------------------------------ SEPT entry tdh_mem_sept_add (Host lock) tdh_mem_page_aug (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and TDCALLs may be removed in future TDX module, their contention with tdh_vp_enter() due to 0-step mitigation still persists. The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER, which works as follows: 0. Each TDH.VP.ENTER records the guest RIP on TD entry. 1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it checks if the guest RIP is the same as last guest RIP on TD entry. -if yes, it means the EPT violation is caused by the same instruction that caused the last VM exit. Then, the TDX module increases the guest RIP no-progress count. When the count increases from 0 to the threshold (currently 6), the TDX module records the faulting GPA into a last_epf_gpa_list. -if no, it means the guest RIP has made progress. So, the TDX module resets the RIP no-progress count and the last_epf_gpa_list. 2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on TD entry) checks if the last_epf_gpa_list is empty. -if yes, TD entry continues without acquiring the lock on the SEPT tree. -if no, it triggers the 0-step mitigation by acquiring the exclusive lock on SEPT tree, walking the EPT tree to check if all page faults caused by the GPAs in the last_epf_gpa_list have been resolved before continuing TD entry. Since KVM TDP MMU usually re-enters guest whenever it exits to userspace (e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a tdh_vp_enter() to be called more than the threshold count before a page fault is addressed, triggering contention when tdh_vp_enter() attempts to acquire exclusive lock on SEPT tree. Retry locally in TDX EPT violation handler to reduce the count of invoking tdh_vp_enter(), hence reducing the possibility of its contention with tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT, signals/interrupts, and cases when one instruction faults more GFNs than the threshold count. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/vmx/tdx.c | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-)