[v19,085/130] KVM: TDX: Complete interrupts after tdexit

Message ID	aa6a927214a5d29d5591a0079f4374b05a82a03f.1708933498.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E831F77A0E; Mon, 26 Feb 2024 08:28:42 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, Kai Huang <kai.huang@intel.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com, Binbin Wu <binbin.wu@linux.intel.com> Subject: [PATCH v19 085/130] KVM: TDX: Complete interrupts after tdexit Date: Mon, 26 Feb 2024 00:26:27 -0800 Message-Id: <aa6a927214a5d29d5591a0079f4374b05a82a03f.1708933498.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1708933498.git.isaku.yamahata@intel.com> References: <cover.1708933498.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro \| expand [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro [v19,002/130] x86/virt/tdx: Move TDMR metadata fields map table to local variable [v19,003/130] x86/virt/tdx: Unbind global metadata read with 'struct tdx_tdmr_sysinfo' [v19,004/130] x86/virt/tdx: Support global metadata read for all element sizes [v19,005/130] x86/virt/tdx: Export global metadata read infrastructure [v19,006/130] x86/virt/tdx: Export TDX KeyID information [v19,007/130] x86/virt/tdx: Export SEAMCALL functions [v19,008/130] x86/tdx: Warning with 32bit build shift-count-overflow [v19,009/130] KVM: x86: Add gmem hook for determining max NPT mapping level [v19,010/130] KVM: x86: Pass is_private to gmem hook of gmem_max_level [v19,011/130] KVM: Add new members to struct kvm_gfn_range to operate on [v19,012/130] KVM: x86/mmu: Pass around full 64-bit error code for the KVM page fault [v19,013/130] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is private [v19,014/130] KVM: Add KVM vcpu ioctl to pre-populate guest memory [v19,015/130] KVM: Document KVM_MEMORY_MAPPING ioctl [v19,016/130] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v19,017/130] KVM: x86: Implement kvm_arch_{, pre_}vcpu_memory_mapping() [v19,018/130] KVM: x86/mmu: Assume guest MMIOs are shared [v19,019/130] KVM: x86: Add is_vm_type_supported callback [v19,020/130] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v19,021/130] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init() [v19,022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v19,023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v19,024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v19,025/130] KVM: TDX: Make TDX VM type supported [v19,026/130,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v19,027/130] KVM: TDX: Define TDX architectural definitions [v19,028/130] KVM: TDX: Add TDX "architectural" error codes [v19,029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v19,030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v19,031/130,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v19,032/130] KVM: TDX: Add helper functions to allocate/free TDX private host key id [v19,033/130] KVM: TDX: Add helper function to read TDX metadata in array [v19,034/130] KVM: TDX: Get system-wide info about TDX module on initialization [v19,035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v19,036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v19,037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific [v19,038/130] KVM: TDX: create/destroy VM structure [v19,039/130] KVM: TDX: initialize VM with TDX specific parameters [v19,040/130] KVM: TDX: Make pmu_intel.c ignore guest TD case [v19,041/130] KVM: TDX: Refuse to unplug the last cpu on the package [v19,042/130,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v19,043/130] KVM: TDX: create/free TDX vcpu structure [v19,044/130] KVM: TDX: Do TDX specific vcpu initialization [v19,045/130,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v19,046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v19,047/130,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v19,048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v19,049/130] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v19,050/130] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v19,051/130] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v19,052/130] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v19,053/130] KVM: x86/mmu: Disallow fast page fault on private GPA [v19,054/130] KVM: VMX: Introduce test mode related to EPT violation VE [v19,055/130,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v19,056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v19,057/130] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v19,058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v19,059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v19,060/130] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA [v19,061/130] KVM: x86/tdp_mmu: Sprinkle __must_check [v19,062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v19,063/130,MARKER] The start of TDX KVM patch series: TDX EPT violation [v19,064/130] KVM: x86/mmu: Do not enable page track for TD guest [v19,065/130] KVM: VMX: Split out guts of EPT violation to common/exposed function [v19,066/130] KVM: TDX: Add accessors VMX VMCS helpers [v19,067/130] KVM: TDX: Add load_mmu_pgd method for TDX [v19,068/130] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v19,069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX [v19,070/130] KVM: TDX: TDP MMU TDX support [v19,071/130] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v19,072/130,MARKER] The start of TDX KVM patch series: TD finalization [v19,073/130] KVM: x86: Add hooks in kvm_arch_vcpu_memory_mapping() [v19,074/130] KVM: TDX: Create initial guest memory [v19,075/130] KVM: TDX: Extend memory measurement with initial guest memory [v19,076/130] KVM: TDX: Finalize VM initialization [v19,077/130,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v19,078/130] KVM: TDX: Implement TDX vcpu enter/exit path [v19,079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v19,080/130] KVM: TDX: restore host xsave state when exit from the guest TD [v19,081/130] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v19,082/130] KVM: TDX: restore user ret MSRs [v19,083/130] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v19,084/130,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v19,085/130] KVM: TDX: Complete interrupts after tdexit [v19,086/130] KVM: TDX: restore debug store when TD exit [v19,087/130] KVM: TDX: handle vcpu migration over logical processor [v19,088/130] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v19,089/130] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v19,090/130] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v19,091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v19,092/130] KVM: TDX: Implement interrupt injection [v19,093/130] KVM: TDX: Implements vcpu request_immediate_exit [v19,094/130] KVM: TDX: Implement methods to inject NMI [v19,095/130] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v19,096/130] KVM: VMX: Move NMI/exception handler to common helper [v19,097/130] KVM: x86: Split core of hypercall emulation to helper function [v19,098/130] KVM: TDX: Add a place holder to handle TDX VM exit [v19,099/130] KVM: TDX: Handle vmentry failure for INTEL TD guest [v19,100/130] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v19,101/130] KVM: TDX: handle ept violation/misconfig exit [v19,102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v19,103/130] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v19,104/130] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v19,105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v19,106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v19,107/130] KVM: TDX: Handle TDX PV CPUID hypercall [v19,108/130] KVM: TDX: Handle TDX PV HLT hypercall [v19,109/130] KVM: TDX: Handle TDX PV port io hypercall [v19,110/130] KVM: TDX: Handle TDX PV MMIO hypercall [v19,111/130] KVM: TDX: Implement callbacks for MSR operations for TDX [v19,112/130] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v19,113/130] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v19,114/130] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v19,115/130] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v19,116/130] KVM: TDX: Silently discard SMI request [v19,117/130] KVM: TDX: Silently ignore INIT/SIPI [v19,118/130] KVM: TDX: Add methods to ignore accesses to CPU state [v19,119/130] KVM: TDX: Add methods to ignore guest instruction emulation [v19,120/130] KVM: TDX: Add a method to ignore dirty logging [v19,121/130] KVM: TDX: Add methods to ignore VMX preemption timer [v19,122/130] KVM: TDX: Add methods to ignore accesses to TSC [v19,123/130] KVM: TDX: Ignore setting up mce [v19,124/130] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v19,125/130] KVM: TDX: Add methods to ignore virtual apic related operation [v19,126/130] KVM: TDX: Inhibit APICv for TDX guest [v19,127/130] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v19,128/130] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v19,129/130] RFC: KVM: x86: Add x86 callback to check cpuid [v19,130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

Isaku Yamahata Feb. 26, 2024, 8:26 a.m. UTC

From: Isaku Yamahata <isaku.yamahata@intel.com>

This corresponds to VMX __vmx_complete_interrupts().  Because TDX
virtualize vAPIC, KVM only needs to care NMI injection.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
v19:
- move tdvps_management_check() to this patch
- typo: complete -> Complete in short log
---
 arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
 arch/x86/kvm/vmx/tdx.h |  4 ++++
 2 files changed, 14 insertions(+)

Reinette Chatre April 16, 2024, 6:23 p.m. UTC | #1

Hi Isaku,

(In shortlog "tdexit" can be "TD exit" to be consistent with
documentation.)

On 2/26/2024 12:26 AM, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
> virtualize vAPIC, KVM only needs to care NMI injection.

This seems to be the first appearance of NMI and the changelog
is very brief. How about expending it with:

"This corresponds to VMX __vmx_complete_interrupts().  Because TDX
 virtualize vAPIC, KVM only needs to care about NMI injection.

 KVM can request TDX to inject an NMI into a guest TD vCPU when the
 vCPU is not active. TDX will attempt to inject an NMI as soon as
 possible on TD entry. NMI injection is managed by writing to (to
 inject NMI) and reading from (to get status of NMI injection)
 the PEND_NMI field within the TDX vCPU scope metadata (Trust
 Domain Virtual Processor State (TDVPS)).

 Update KVM's NMI status on TD exit by checking whether a requested
 NMI has been injected into the TD. Reading the metadata via SEAMCALL
 is expensive so only perform the check if an NMI was injected.

 This is the first need to access vCPU scope metadata in the
 "management" class. Ensure that needed accessor is available. 
"

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> v19:
> - move tdvps_management_check() to this patch
> - typo: complete -> Complete in short log
> ---
>  arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
>  arch/x86/kvm/vmx/tdx.h |  4 ++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 83dcaf5b6fbd..b8b168f74dfe 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	 */
>  }
>  
> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> +{
> +	/* Avoid costly SEAMCALL if no nmi was injected */

	/* Avoid costly SEAMCALL if no NMI was injected. */

> +	if (vcpu->arch.nmi_injected)
> +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> +							      TD_VCPU_PEND_NMI);
> +}
> +
>  struct tdx_uret_msr {
>  	u32 msr;
>  	unsigned int slot;
> @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>  	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>  	trace_kvm_exit(vcpu, KVM_ISA_VMX);
>  
> +	tdx_complete_interrupts(vcpu);
> +
>  	return EXIT_FASTPATH_NONE;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 44eab734e702..0d8a98feb58e 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
>  			 "Invalid TD VMCS access for 16-bit field");
>  }
>  
> +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}

Is this intended to be a stub or is it expected to be fleshed out with
some checks?

> +
>  #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
>  static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
>  							u32 field)		\
> @@ -200,6 +202,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
>  TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
>  TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
>  
> +TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
> +
>  static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
>  {
>  	struct tdx_module_args out;

Reinette

Isaku Yamahata April 17, 2024, 6:56 a.m. UTC | #2

On Tue, Apr 16, 2024 at 11:23:01AM -0700,
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Hi Isaku,
> 
> (In shortlog "tdexit" can be "TD exit" to be consistent with
> documentation.)
> 
> On 2/26/2024 12:26 AM, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > This corresponds to VMX __vmx_complete_interrupts().  Because TDX
> > virtualize vAPIC, KVM only needs to care NMI injection.
> 
> This seems to be the first appearance of NMI and the changelog
> is very brief. How about expending it with:
> 
> "This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>  virtualize vAPIC, KVM only needs to care about NMI injection.
> 
>  KVM can request TDX to inject an NMI into a guest TD vCPU when the
>  vCPU is not active. TDX will attempt to inject an NMI as soon as
>  possible on TD entry. NMI injection is managed by writing to (to
>  inject NMI) and reading from (to get status of NMI injection)
>  the PEND_NMI field within the TDX vCPU scope metadata (Trust
>  Domain Virtual Processor State (TDVPS)).
> 
>  Update KVM's NMI status on TD exit by checking whether a requested
>  NMI has been injected into the TD. Reading the metadata via SEAMCALL
>  is expensive so only perform the check if an NMI was injected.
> 
>  This is the first need to access vCPU scope metadata in the
>  "management" class. Ensure that needed accessor is available. 
> "
> 
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> > ---
> > v19:
> > - move tdvps_management_check() to this patch
> > - typo: complete -> Complete in short log
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
> >  arch/x86/kvm/vmx/tdx.h |  4 ++++
> >  2 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 83dcaf5b6fbd..b8b168f74dfe 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >  	 */
> >  }
> >  
> > +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> > +{
> > +	/* Avoid costly SEAMCALL if no nmi was injected */
> 
> 	/* Avoid costly SEAMCALL if no NMI was injected. */
> 
> > +	if (vcpu->arch.nmi_injected)
> > +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> > +							      TD_VCPU_PEND_NMI);
> > +}
> > +
> >  struct tdx_uret_msr {
> >  	u32 msr;
> >  	unsigned int slot;
> > @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> >  	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> >  	trace_kvm_exit(vcpu, KVM_ISA_VMX);
> >  
> > +	tdx_complete_interrupts(vcpu);
> > +
> >  	return EXIT_FASTPATH_NONE;
> >  }
> >  
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 44eab734e702..0d8a98feb58e 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
> >  			 "Invalid TD VMCS access for 16-bit field");
> >  }
> >  
> > +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
> 
> Is this intended to be a stub or is it expected to be fleshed out with
> some checks?

It was used to check if field id matches bits.  We should make
tdvps_vmcs_check() common for vmcs, management and state_non_arch.

Binbin Wu April 23, 2024, 1:15 p.m. UTC | #3

On 4/17/2024 2:23 AM, Reinette Chatre wrote:
> Hi Isaku,
>
> (In shortlog "tdexit" can be "TD exit" to be consistent with
> documentation.)
>
> On 2/26/2024 12:26 AM, isaku.yamahata@intel.com wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>> virtualize vAPIC, KVM only needs to care NMI injection.
> This seems to be the first appearance of NMI and the changelog
> is very brief. How about expending it with:
>
> "This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>   virtualize vAPIC, KVM only needs to care about NMI injection.
   ^
   virtualizes

Also, does it need to mention that non-NMI interrupts are handled by 
posted-interrupt mechanism?

For example:

"This corresponds to VMX __vmx_complete_interrupts().  Because TDX
  virtualizes vAPIC, and non-NMI interrupts are delivered using 
posted-interrupt
  mechanism, KVM only needs to care about NMI injection.
...
"

>
>   KVM can request TDX to inject an NMI into a guest TD vCPU when the
>   vCPU is not active. TDX will attempt to inject an NMI as soon as
>   possible on TD entry. NMI injection is managed by writing to (to
>   inject NMI) and reading from (to get status of NMI injection)
>   the PEND_NMI field within the TDX vCPU scope metadata (Trust
>   Domain Virtual Processor State (TDVPS)).
>
>   Update KVM's NMI status on TD exit by checking whether a requested
>   NMI has been injected into the TD. Reading the metadata via SEAMCALL
>   is expensive so only perform the check if an NMI was injected.
>
>   This is the first need to access vCPU scope metadata in the
>   "management" class. Ensure that needed accessor is available.
> "
>

Reinette Chatre April 23, 2024, 2:48 p.m. UTC | #4

On 4/23/2024 6:15 AM, Binbin Wu wrote:
> 
> 
> On 4/17/2024 2:23 AM, Reinette Chatre wrote:
>> Hi Isaku,
>>
>> (In shortlog "tdexit" can be "TD exit" to be consistent with
>> documentation.)
>>
>> On 2/26/2024 12:26 AM, isaku.yamahata@intel.com wrote:
>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>
>>> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>>> virtualize vAPIC, KVM only needs to care NMI injection.
>> This seems to be the first appearance of NMI and the changelog
>> is very brief. How about expending it with:
>>
>> "This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>>   virtualize vAPIC, KVM only needs to care about NMI injection.
>   ^
>   virtualizes
> 
> Also, does it need to mention that non-NMI interrupts are handled by posted-interrupt mechanism?
> 
> For example:
> 
> "This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>  virtualizes vAPIC, and non-NMI interrupts are delivered using posted-interrupt
>  mechanism, KVM only needs to care about NMI injection.
> ...
> "
> 

Thank you Binbin. Looks good to me.

Reinette

Yuan Yao June 17, 2024, 8:07 a.m. UTC | #5

On Mon, Feb 26, 2024 at 12:26:27AM -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
> virtualize vAPIC, KVM only needs to care NMI injection.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> v19:
> - move tdvps_management_check() to this patch
> - typo: complete -> Complete in short log
> ---
>  arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
>  arch/x86/kvm/vmx/tdx.h |  4 ++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 83dcaf5b6fbd..b8b168f74dfe 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	 */
>  }
>
> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> +{
> +	/* Avoid costly SEAMCALL if no nmi was injected */
> +	if (vcpu->arch.nmi_injected)
> +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> +							      TD_VCPU_PEND_NMI);
> +}

Looks this leads to NMI injection delay or even won't be
reinjected if KVM_REQ_EVENT is not set on the target cpu
when more than 1 NMIs are pending there.

On normal VM, KVM uses NMI window vmexit for injection
successful case to rasie the KVM_REQ_EVENT again for remain
pending NMIs, see handle_nmi_window(). KVM also checks
vectoring info after VMEXIT for case that the NMI is not
injected successfully in this vmentry vmexit round, and
raise KVM_REQ_EVENT to try again, see __vmx_complete_interrupts().

In TDX, consider there's no way to get vectoring info or
handle nmi window vmexit, below checking should cover both
scenarios for NMI injection:

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e9c9a185bb7b..9edf446acd3b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -835,9 +835,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
 {
        /* Avoid costly SEAMCALL if no nmi was injected */
-       if (vcpu->arch.nmi_injected)
+       if (vcpu->arch.nmi_injected) {
                vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
                                                              TD_VCPU_PEND_NMI);
+               if (vcpu->arch.nmi_injected || vcpu->arch.nmi_pending)
+                       kvm_make_request(KVM_REQ_EVENT, vcpu);
+       }
 }

> +
>  struct tdx_uret_msr {
>  	u32 msr;
>  	unsigned int slot;
> @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>  	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>  	trace_kvm_exit(vcpu, KVM_ISA_VMX);
>
> +	tdx_complete_interrupts(vcpu);
> +
>  	return EXIT_FASTPATH_NONE;
>  }
>
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 44eab734e702..0d8a98feb58e 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
>  			 "Invalid TD VMCS access for 16-bit field");
>  }
>
> +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
> +
>  #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
>  static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
>  							u32 field)		\
> @@ -200,6 +202,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
>  TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
>  TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
>
> +TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
> +
>  static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
>  {
>  	struct tdx_module_args out;
> --
> 2.25.1
>
>

Binbin Wu June 17, 2024, 9:07 a.m. UTC | #6

On 6/17/2024 4:07 PM, Yuan Yao wrote:
> On Mon, Feb 26, 2024 at 12:26:27AM -0800, isaku.yamahata@intel.com wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>> virtualize vAPIC, KVM only needs to care NMI injection.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
>> ---
>> v19:
>> - move tdvps_management_check() to this patch
>> - typo: complete -> Complete in short log
>> ---
>>   arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
>>   arch/x86/kvm/vmx/tdx.h |  4 ++++
>>   2 files changed, 14 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 83dcaf5b6fbd..b8b168f74dfe 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>   	 */
>>   }
>>
>> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
>> +{
>> +	/* Avoid costly SEAMCALL if no nmi was injected */
>> +	if (vcpu->arch.nmi_injected)
>> +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
>> +							      TD_VCPU_PEND_NMI);
>> +}
> Looks this leads to NMI injection delay or even won't be
> reinjected if KVM_REQ_EVENT is not set on the target cpu
> when more than 1 NMIs are pending there.
>
> On normal VM, KVM uses NMI window vmexit for injection
> successful case to rasie the KVM_REQ_EVENT again for remain
> pending NMIs, see handle_nmi_window(). KVM also checks
> vectoring info after VMEXIT for case that the NMI is not
> injected successfully in this vmentry vmexit round, and
> raise KVM_REQ_EVENT to try again, see __vmx_complete_interrupts().
>
> In TDX, consider there's no way to get vectoring info or
> handle nmi window vmexit, below checking should cover both
> scenarios for NMI injection:
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e9c9a185bb7b..9edf446acd3b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -835,9 +835,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
>   {
>          /* Avoid costly SEAMCALL if no nmi was injected */
> -       if (vcpu->arch.nmi_injected)
> +       if (vcpu->arch.nmi_injected) {
>                  vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
>                                                                TD_VCPU_PEND_NMI);
> +               if (vcpu->arch.nmi_injected || vcpu->arch.nmi_pending)
> +                       kvm_make_request(KVM_REQ_EVENT, vcpu);

For nmi_injected, it should be OK because TD_VCPU_PEND_NMI is still set.
But for nmi_pending, it should be checked and raise event.

I remember there was a discussion in the following link:
https://lore.kernel.org/kvm/20240402065254.GY2444378@ls.amr.corp.intel.com/
It said  tdx_vcpu_run() will ignore force_immediate_exit.
If force_immediate_exit is igored for TDX, then the nmi_pending handling 
could still be delayed if the previous NMI was injected successfully.


> +       }
>   }
>
>> +
>>   struct tdx_uret_msr {
>>   	u32 msr;
>>   	unsigned int slot;
>> @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>>   	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>>   	trace_kvm_exit(vcpu, KVM_ISA_VMX);
>>
>> +	tdx_complete_interrupts(vcpu);
>> +
>>   	return EXIT_FASTPATH_NONE;
>>   }
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>> index 44eab734e702..0d8a98feb58e 100644
>> --- a/arch/x86/kvm/vmx/tdx.h
>> +++ b/arch/x86/kvm/vmx/tdx.h
>> @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
>>   			 "Invalid TD VMCS access for 16-bit field");
>>   }
>>
>> +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
>> +
>>   #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
>>   static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
>>   							u32 field)		\
>> @@ -200,6 +202,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
>>   TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
>>   TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
>>
>> +TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
>> +
>>   static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
>>   {
>>   	struct tdx_module_args out;
>> --
>> 2.25.1
>>
>>

Yuan Yao June 18, 2024, 3:28 a.m. UTC | #7

On Mon, Jun 17, 2024 at 05:07:56PM +0800, Binbin Wu wrote:
>
>
> On 6/17/2024 4:07 PM, Yuan Yao wrote:
> > On Mon, Feb 26, 2024 at 12:26:27AM -0800, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > >
> > > This corresponds to VMX __vmx_complete_interrupts().  Because TDX
> > > virtualize vAPIC, KVM only needs to care NMI injection.
> > >
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> > > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> > > ---
> > > v19:
> > > - move tdvps_management_check() to this patch
> > > - typo: complete -> Complete in short log
> > > ---
> > >   arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
> > >   arch/x86/kvm/vmx/tdx.h |  4 ++++
> > >   2 files changed, 14 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 83dcaf5b6fbd..b8b168f74dfe 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >   	 */
> > >   }
> > >
> > > +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> > > +{
> > > +	/* Avoid costly SEAMCALL if no nmi was injected */
> > > +	if (vcpu->arch.nmi_injected)
> > > +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> > > +							      TD_VCPU_PEND_NMI);
> > > +}
> > Looks this leads to NMI injection delay or even won't be
> > reinjected if KVM_REQ_EVENT is not set on the target cpu
> > when more than 1 NMIs are pending there.
> >
> > On normal VM, KVM uses NMI window vmexit for injection
> > successful case to rasie the KVM_REQ_EVENT again for remain
> > pending NMIs, see handle_nmi_window(). KVM also checks
> > vectoring info after VMEXIT for case that the NMI is not
> > injected successfully in this vmentry vmexit round, and
> > raise KVM_REQ_EVENT to try again, see __vmx_complete_interrupts().
> >
> > In TDX, consider there's no way to get vectoring info or
> > handle nmi window vmexit, below checking should cover both
> > scenarios for NMI injection:
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index e9c9a185bb7b..9edf446acd3b 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -835,9 +835,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >   static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> >   {
> >          /* Avoid costly SEAMCALL if no nmi was injected */
> > -       if (vcpu->arch.nmi_injected)
> > +       if (vcpu->arch.nmi_injected) {
> >                  vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> >                                                                TD_VCPU_PEND_NMI);
> > +               if (vcpu->arch.nmi_injected || vcpu->arch.nmi_pending)
> > +                       kvm_make_request(KVM_REQ_EVENT, vcpu);
>
> For nmi_injected, it should be OK because TD_VCPU_PEND_NMI is still set.
> But for nmi_pending, it should be checked and raise event.

Right, I just forgot the tdx module can do more than "hardware":

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e9c9a185bb7b..3530a4882efc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -835,9 +835,16 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
 {
        /* Avoid costly SEAMCALL if no nmi was injected */
-       if (vcpu->arch.nmi_injected)
+       if (vcpu->arch.nmi_injected) {
                vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
                                                              TD_VCPU_PEND_NMI);
+               /*
+                  tdx module will retry injection in case of TD_VCPU_PEND_NMI,
+                  so don't need to set KVM_REQ_EVENT for it again.
+                */
+               if (!vcpu->arch.nmi_injected && vcpu->arch.nmi_pending)
+                       kvm_make_request(KVM_REQ_EVENT, vcpu);
+       }
 }

>
> I remember there was a discussion in the following link:
> https://lore.kernel.org/kvm/20240402065254.GY2444378@ls.amr.corp.intel.com/
> It said  tdx_vcpu_run() will ignore force_immediate_exit.
> If force_immediate_exit is igored for TDX, then the nmi_pending handling
> could still be delayed if the previous NMI was injected successfully.

Yes, not sure the possibility of meeting this in real use
case, I know it happens in some testing, e.g. the kvm
unit test's multiple NMI tesing.

>
>
> > +       }
> >   }
> >
> > > +
> > >   struct tdx_uret_msr {
> > >   	u32 msr;
> > >   	unsigned int slot;
> > > @@ -663,6 +671,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > >   	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > >   	trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > >
> > > +	tdx_complete_interrupts(vcpu);
> > > +
> > >   	return EXIT_FASTPATH_NONE;
> > >   }
> > >
> > > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > > index 44eab734e702..0d8a98feb58e 100644
> > > --- a/arch/x86/kvm/vmx/tdx.h
> > > +++ b/arch/x86/kvm/vmx/tdx.h
> > > @@ -142,6 +142,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
> > >   			 "Invalid TD VMCS access for 16-bit field");
> > >   }
> > >
> > > +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
> > > +
> > >   #define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
> > >   static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
> > >   							u32 field)		\
> > > @@ -200,6 +202,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
> > >   TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
> > >   TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
> > >
> > > +TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
> > > +
> > >   static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
> > >   {
> > >   	struct tdx_module_args out;
> > > --
> > > 2.25.1
> > >
> > >
>

Binbin Wu July 8, 2024, 6:11 a.m. UTC | #8

On 6/18/2024 11:28 AM, Yuan Yao wrote:
> On Mon, Jun 17, 2024 at 05:07:56PM +0800, Binbin Wu wrote:
>>
>> On 6/17/2024 4:07 PM, Yuan Yao wrote:
>>> On Mon, Feb 26, 2024 at 12:26:27AM -0800, isaku.yamahata@intel.com wrote:
>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>
>>>> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
>>>> virtualize vAPIC, KVM only needs to care NMI injection.
>>>>
>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>>> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
>>>> ---
>>>> v19:
>>>> - move tdvps_management_check() to this patch
>>>> - typo: complete -> Complete in short log
>>>> ---
>>>>    arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
>>>>    arch/x86/kvm/vmx/tdx.h |  4 ++++
>>>>    2 files changed, 14 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>>> index 83dcaf5b6fbd..b8b168f74dfe 100644
>>>> --- a/arch/x86/kvm/vmx/tdx.c
>>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>>> @@ -535,6 +535,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>>>    	 */
>>>>    }
>>>>
>>>> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	/* Avoid costly SEAMCALL if no nmi was injected */
>>>> +	if (vcpu->arch.nmi_injected)
>>>> +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
>>>> +							      TD_VCPU_PEND_NMI);
>>>> +}
>>> Looks this leads to NMI injection delay or even won't be
>>> reinjected if KVM_REQ_EVENT is not set on the target cpu
>>> when more than 1 NMIs are pending there.
>>>
>>> On normal VM, KVM uses NMI window vmexit for injection
>>> successful case to rasie the KVM_REQ_EVENT again for remain
>>> pending NMIs, see handle_nmi_window(). KVM also checks
>>> vectoring info after VMEXIT for case that the NMI is not
>>> injected successfully in this vmentry vmexit round, and
>>> raise KVM_REQ_EVENT to try again, see __vmx_complete_interrupts().
>>>
>>> In TDX, consider there's no way to get vectoring info or
>>> handle nmi window vmexit, below checking should cover both
>>> scenarios for NMI injection:
>>>
>>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>>> index e9c9a185bb7b..9edf446acd3b 100644
>>> --- a/arch/x86/kvm/vmx/tdx.c
>>> +++ b/arch/x86/kvm/vmx/tdx.c
>>> @@ -835,9 +835,12 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>>    static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
>>>    {
>>>           /* Avoid costly SEAMCALL if no nmi was injected */
>>> -       if (vcpu->arch.nmi_injected)
>>> +       if (vcpu->arch.nmi_injected) {
>>>                   vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
>>>                                                                 TD_VCPU_PEND_NMI);
>>> +               if (vcpu->arch.nmi_injected || vcpu->arch.nmi_pending)
>>> +                       kvm_make_request(KVM_REQ_EVENT, vcpu);
>> For nmi_injected, it should be OK because TD_VCPU_PEND_NMI is still set.
>> But for nmi_pending, it should be checked and raise event.
> Right, I just forgot the tdx module can do more than "hardware":
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e9c9a185bb7b..3530a4882efc 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -835,9 +835,16 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
>   {
>          /* Avoid costly SEAMCALL if no nmi was injected */
> -       if (vcpu->arch.nmi_injected)
> +       if (vcpu->arch.nmi_injected) {
>                  vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
>                                                                TD_VCPU_PEND_NMI);
> +               /*
> +                  tdx module will retry injection in case of TD_VCPU_PEND_NMI,
> +                  so don't need to set KVM_REQ_EVENT for it again.
> +                */
> +               if (!vcpu->arch.nmi_injected && vcpu->arch.nmi_pending)
> +                       kvm_make_request(KVM_REQ_EVENT, vcpu);
> +       }
>   }
>
>> I remember there was a discussion in the following link:
>> https://lore.kernel.org/kvm/20240402065254.GY2444378@ls.amr.corp.intel.com/
>> It said  tdx_vcpu_run() will ignore force_immediate_exit.
>> If force_immediate_exit is igored for TDX, then the nmi_pending handling
>> could still be delayed if the previous NMI was injected successfully.
> Yes, not sure the possibility of meeting this in real use
> case, I know it happens in some testing, e.g. the kvm
> unit test's multiple NMI tesing.

Delay the pending NMI to the next VM exit will have problem.
Current Linux kernel code on NMI handling, it will check back-to-back 
NMI when handling unknown NMI.
Here are the comments in arch/x86/kernel/nmi.c
         /*
          * Only one NMI can be latched at a time.  To handle
          * this we may process multiple nmi handlers at once to
          * cover the case where an NMI is dropped.  The downside
          * to this approach is we may process an NMI prematurely,
          * while its real NMI is sitting latched.  This will cause
          * an unknown NMI on the next run of the NMI processing.
          *
          * We tried to flag that condition above, by setting the
          * swallow_nmi flag when we process more than one event.
          * This condition is also only present on the second half
          * of a back-to-back NMI, so we flag that condition too.
          *
          * If both are true, we assume we already processed this
          * NMI previously and we swallow it. ...
          */
Assume there are two NMIs pending in KVM, i.e. nmi_pending is 2.
KVM injects one NMI by settting TD_VCPU_PEND_NMI field and the 
nmi_pending is decreased to 1.
The pending NMI will be delayed until the next VM Exit, it will not be 
detected as the second half of back-to-back NMI in guest.
Then it will be considered as a real unknown NMI, and if no one handles 
it (because it could have been handled in the previous NMI handler).
At last, guest kernel will fire error message for the "unhandled" 
unknown NMI, and even panic if unknown_nmi_panic or 
panic_on_unrecovered_nmi is set true.

Since KVM doesn't have the capability to get NMI blocking status or 
request NMI-window exit for TDX, how about limiting the nmi pending to 1 
for TDX?
I.e. if TD_VCPU_PEND_NMI is not set, limit nmi_pending to 1 in 
process_nmi();
      if TD_VCPU_PEND_NMI is set, limit nmi_pending to 0 in process_nmi().

Had some checks about the history when nmi_pending limit changed to 2.
The discussion in the 
link https://lore.kernel.org/all/4E723A8A.7050405@redhat.com/ said:
" ... the NMI handlers are now being reworked to handle
just one NMI source (hopefully the cheapest) in the handler, and if we
detect a back-to-back NMI, handle all possible NMI sources."
IIUC, the change in NMI handlers described above is referring to the 
patch set "x86, nmi: new NMI handling routines"
https://lore.kernel.org/all/1317409584-23662-1-git-send-email-dzickus@redhat.com/

I noticed that in v6 of the patch series, there was an optimization, but 
removed in v7.
v6 link: 
https://lore.kernel.org/all/1316805435-14832-5-git-send-email-dzickus@redhat.com/
v7 link: 
https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/
The Optimization code in v6, but removed in v7:
           -static int notrace __kprobes nmi_handle(unsigned int type, 
struct pt_regs *regs)
           +static int notrace __kprobes nmi_handle(unsigned int type, 
struct pt_regs *regs, bool b2b)
           {
                struct nmi_desc *desc = nmi_to_desc(type);
                struct nmiaction *a;
           @@ -89,6 +89,15 @@ static int notrace __kprobes 
nmi_handle(unsigned int type, struct pt_regs *regs)

                handled += a->handler(type, regs);

           +        /*
           +          * Optimization: only loop once if this is not a
           +          * back-to-back NMI.  The idea is nothing is dropped
           +          * on the first NMI, only on the second of a 
back-to-back
           +          * NMI.  No need to waste cycles going through all the
           +          * handlers.
           +          */
           +        if (!b2b && handled)
           +            break;
                }

At last, back-to-back NMI optimization is not used in Linux kernel.
So the kernel is able to handle NMI sources if we drop later NMIs when 
there is already one virtual NMI pending for TDX.

[v19,085/130] KVM: TDX: Complete interrupts after tdexit

Commit Message

Comments

Patch