[v19,044/130] KVM: TDX: Do TDX specific vcpu initialization

Message ID	d6a21fe6ea9eb53c24b6527ef8e5a07f0c2e8806.1708933498.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 785435EE68; Mon, 26 Feb 2024 08:28:10 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, Kai Huang <kai.huang@intel.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com, Sean Christopherson <sean.j.christopherson@intel.com> Subject: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization Date: Mon, 26 Feb 2024 00:25:46 -0800 Message-Id: <d6a21fe6ea9eb53c24b6527ef8e5a07f0c2e8806.1708933498.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1708933498.git.isaku.yamahata@intel.com> References: <cover.1708933498.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro \| expand [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro [v19,002/130] x86/virt/tdx: Move TDMR metadata fields map table to local variable [v19,003/130] x86/virt/tdx: Unbind global metadata read with 'struct tdx_tdmr_sysinfo' [v19,004/130] x86/virt/tdx: Support global metadata read for all element sizes [v19,005/130] x86/virt/tdx: Export global metadata read infrastructure [v19,006/130] x86/virt/tdx: Export TDX KeyID information [v19,007/130] x86/virt/tdx: Export SEAMCALL functions [v19,008/130] x86/tdx: Warning with 32bit build shift-count-overflow [v19,009/130] KVM: x86: Add gmem hook for determining max NPT mapping level [v19,010/130] KVM: x86: Pass is_private to gmem hook of gmem_max_level [v19,011/130] KVM: Add new members to struct kvm_gfn_range to operate on [v19,012/130] KVM: x86/mmu: Pass around full 64-bit error code for the KVM page fault [v19,013/130] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is private [v19,014/130] KVM: Add KVM vcpu ioctl to pre-populate guest memory [v19,015/130] KVM: Document KVM_MEMORY_MAPPING ioctl [v19,016/130] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v19,017/130] KVM: x86: Implement kvm_arch_{, pre_}vcpu_memory_mapping() [v19,018/130] KVM: x86/mmu: Assume guest MMIOs are shared [v19,019/130] KVM: x86: Add is_vm_type_supported callback [v19,020/130] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v19,021/130] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init() [v19,022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v19,023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v19,024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v19,025/130] KVM: TDX: Make TDX VM type supported [v19,026/130,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v19,027/130] KVM: TDX: Define TDX architectural definitions [v19,028/130] KVM: TDX: Add TDX "architectural" error codes [v19,029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v19,030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v19,031/130,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v19,032/130] KVM: TDX: Add helper functions to allocate/free TDX private host key id [v19,033/130] KVM: TDX: Add helper function to read TDX metadata in array [v19,034/130] KVM: TDX: Get system-wide info about TDX module on initialization [v19,035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v19,036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v19,037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific [v19,038/130] KVM: TDX: create/destroy VM structure [v19,039/130] KVM: TDX: initialize VM with TDX specific parameters [v19,040/130] KVM: TDX: Make pmu_intel.c ignore guest TD case [v19,041/130] KVM: TDX: Refuse to unplug the last cpu on the package [v19,042/130,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v19,043/130] KVM: TDX: create/free TDX vcpu structure [v19,044/130] KVM: TDX: Do TDX specific vcpu initialization [v19,045/130,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v19,046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v19,047/130,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v19,048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v19,049/130] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v19,050/130] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v19,051/130] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v19,052/130] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v19,053/130] KVM: x86/mmu: Disallow fast page fault on private GPA [v19,054/130] KVM: VMX: Introduce test mode related to EPT violation VE [v19,055/130,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v19,056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v19,057/130] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v19,058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v19,059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v19,060/130] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA [v19,061/130] KVM: x86/tdp_mmu: Sprinkle __must_check [v19,062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v19,063/130,MARKER] The start of TDX KVM patch series: TDX EPT violation [v19,064/130] KVM: x86/mmu: Do not enable page track for TD guest [v19,065/130] KVM: VMX: Split out guts of EPT violation to common/exposed function [v19,066/130] KVM: TDX: Add accessors VMX VMCS helpers [v19,067/130] KVM: TDX: Add load_mmu_pgd method for TDX [v19,068/130] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v19,069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX [v19,070/130] KVM: TDX: TDP MMU TDX support [v19,071/130] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v19,072/130,MARKER] The start of TDX KVM patch series: TD finalization [v19,073/130] KVM: x86: Add hooks in kvm_arch_vcpu_memory_mapping() [v19,074/130] KVM: TDX: Create initial guest memory [v19,075/130] KVM: TDX: Extend memory measurement with initial guest memory [v19,076/130] KVM: TDX: Finalize VM initialization [v19,077/130,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v19,078/130] KVM: TDX: Implement TDX vcpu enter/exit path [v19,079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v19,080/130] KVM: TDX: restore host xsave state when exit from the guest TD [v19,081/130] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v19,082/130] KVM: TDX: restore user ret MSRs [v19,083/130] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v19,084/130,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v19,085/130] KVM: TDX: Complete interrupts after tdexit [v19,086/130] KVM: TDX: restore debug store when TD exit [v19,087/130] KVM: TDX: handle vcpu migration over logical processor [v19,088/130] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v19,089/130] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v19,090/130] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v19,091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v19,092/130] KVM: TDX: Implement interrupt injection [v19,093/130] KVM: TDX: Implements vcpu request_immediate_exit [v19,094/130] KVM: TDX: Implement methods to inject NMI [v19,095/130] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v19,096/130] KVM: VMX: Move NMI/exception handler to common helper [v19,097/130] KVM: x86: Split core of hypercall emulation to helper function [v19,098/130] KVM: TDX: Add a place holder to handle TDX VM exit [v19,099/130] KVM: TDX: Handle vmentry failure for INTEL TD guest [v19,100/130] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v19,101/130] KVM: TDX: handle ept violation/misconfig exit [v19,102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v19,103/130] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v19,104/130] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v19,105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v19,106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v19,107/130] KVM: TDX: Handle TDX PV CPUID hypercall [v19,108/130] KVM: TDX: Handle TDX PV HLT hypercall [v19,109/130] KVM: TDX: Handle TDX PV port io hypercall [v19,110/130] KVM: TDX: Handle TDX PV MMIO hypercall [v19,111/130] KVM: TDX: Implement callbacks for MSR operations for TDX [v19,112/130] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v19,113/130] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v19,114/130] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v19,115/130] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v19,116/130] KVM: TDX: Silently discard SMI request [v19,117/130] KVM: TDX: Silently ignore INIT/SIPI [v19,118/130] KVM: TDX: Add methods to ignore accesses to CPU state [v19,119/130] KVM: TDX: Add methods to ignore guest instruction emulation [v19,120/130] KVM: TDX: Add a method to ignore dirty logging [v19,121/130] KVM: TDX: Add methods to ignore VMX preemption timer [v19,122/130] KVM: TDX: Add methods to ignore accesses to TSC [v19,123/130] KVM: TDX: Ignore setting up mce [v19,124/130] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v19,125/130] KVM: TDX: Add methods to ignore virtual apic related operation [v19,126/130] KVM: TDX: Inhibit APICv for TDX guest [v19,127/130] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v19,128/130] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v19,129/130] RFC: KVM: x86: Add x86 callback to check cpuid [v19,130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

Message ID

d6a21fe6ea9eb53c24b6527ef8e5a07f0c2e8806.1708933498.git.isaku.yamahata@intel.com (mailing list archive)

State

New, archived

Headers

From: isaku.yamahata@intel.com
To: kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: isaku.yamahata@intel.com,
	isaku.yamahata@gmail.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	erdemaktas@google.com,
	Sean Christopherson <seanjc@google.com>,
	Sagi Shahar <sagis@google.com>,
	Kai Huang <kai.huang@intel.com>,
	chen.bo@intel.com,
	hang.yuan@intel.com,
	tina.zhang@intel.com,
	Sean Christopherson <sean.j.christopherson@intel.com>
Subject: [PATCH v19 044/130] KVM: TDX: Do TDX specific vcpu initialization
Date: Mon, 26 Feb 2024 00:25:46 -0800
Message-Id: 
 <d6a21fe6ea9eb53c24b6527ef8e5a07f0c2e8806.1708933498.git.isaku.yamahata@intel.com>
In-Reply-To: <cover.1708933498.git.isaku.yamahata@intel.com>
References: <cover.1708933498.git.isaku.yamahata@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

[v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro | expand

Commit Message

Isaku Yamahata Feb. 26, 2024, 8:25 a.m. UTC

From: Isaku Yamahata <isaku.yamahata@intel.com>

TD guest vcpu needs TDX specific initialization before running.  Repurpose
KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
KVM_TDX_INIT_VCPU, and implement the callback for it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
v18:
- Use tdh_sys_rd() instead of struct tdsysinfo_struct.
- Rename tdx_reclaim_td_page() => tdx_reclaim_control_page()
- Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
---
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   1 +
 arch/x86/include/uapi/asm/kvm.h    |   1 +
 arch/x86/kvm/vmx/main.c            |   9 ++
 arch/x86/kvm/vmx/tdx.c             | 184 ++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h             |   8 ++
 arch/x86/kvm/vmx/x86_ops.h         |   4 +
 arch/x86/kvm/x86.c                 |   6 +
 8 files changed, 211 insertions(+), 3 deletions(-)

Comments

Chao Gao March 21, 2024, 5:43 a.m. UTC | #1

>+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
>+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
>+{
>+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+	struct vcpu_tdx *tdx = to_tdx(vcpu);
>+	unsigned long *tdvpx_pa = NULL;
>+	unsigned long tdvpr_pa;
>+	unsigned long va;
>+	int ret, i;
>+	u64 err;
>+
>+	if (is_td_vcpu_created(tdx))
>+		return -EINVAL;
>+
>+	/*
>+	 * vcpu_free method frees allocated pages.  Avoid partial setup so
>+	 * that the method can't handle it.
>+	 */
>+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+	if (!va)
>+		return -ENOMEM;
>+	tdvpr_pa = __pa(va);
>+
>+	tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
>+			   GFP_KERNEL_ACCOUNT);
>+	if (!tdvpx_pa) {
>+		ret = -ENOMEM;
>+		goto free_tdvpr;
>+	}
>+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
>+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
>+		if (!va) {
>+			ret = -ENOMEM;
>+			goto free_tdvpx;
>+		}
>+		tdvpx_pa[i] = __pa(va);
>+	}
>+
>+	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
>+	if (KVM_BUG_ON(err, vcpu->kvm)) {
>+		ret = -EIO;
>+		pr_tdx_error(TDH_VP_CREATE, err, NULL);
>+		goto free_tdvpx;
>+	}
>+	tdx->tdvpr_pa = tdvpr_pa;
>+
>+	tdx->tdvpx_pa = tdvpx_pa;
>+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {

Can you merge the for-loop above into this one? then ...

>+		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
>+		if (KVM_BUG_ON(err, vcpu->kvm)) {
>+			pr_tdx_error(TDH_VP_ADDCX, err, NULL);

>+			for (; i < tdx_info->nr_tdvpx_pages; i++) {
>+				free_page((unsigned long)__va(tdvpx_pa[i]));
>+				tdvpx_pa[i] = 0;
>+			}

... no need to free remaining pages.

>+			/* vcpu_free method frees TDVPX and TDR donated to TDX */
>+			return -EIO;
>+		}
>+	}
>+
>+	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
>+	if (KVM_BUG_ON(err, vcpu->kvm)) {
>+		pr_tdx_error(TDH_VP_INIT, err, NULL);
>+		return -EIO;
>+	}
>+
>+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>+	tdx->td_vcpu_created = true;
>+	return 0;
>+
>+free_tdvpx:
>+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
>+		if (tdvpx_pa[i])
>+			free_page((unsigned long)__va(tdvpx_pa[i]));
>+		tdvpx_pa[i] = 0;
>+	}
>+	kfree(tdvpx_pa);
>+	tdx->tdvpx_pa = NULL;
>+free_tdvpr:
>+	if (tdvpr_pa)
>+		free_page((unsigned long)__va(tdvpr_pa));
>+	tdx->tdvpr_pa = 0;
>+
>+	return ret;
>+}
>+
>+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>+{
>+	struct msr_data apic_base_msr;
>+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
>+	struct vcpu_tdx *tdx = to_tdx(vcpu);
>+	struct kvm_tdx_cmd cmd;
>+	int ret;
>+
>+	if (tdx->initialized)
>+		return -EINVAL;
>+
>+	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))

These checks look random e.g., I am not sure why is_td_created() isn't check here.

A few helper functions and boolean variables are added to track which stage the
TD or TD vCPU is in. e.g.,

is_hkid_assigned()
is_td_finalized()
is_td_created()
tdx->initialized
td_vcpu_created

Insteading of doing this, I am wondering if adding two state machines for
TD and TD vCPU would make the implementation clear and easy to extend.

>+		return -EINVAL;
>+
>+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
>+		return -EFAULT;
>+
>+	if (cmd.error)
>+		return -EINVAL;
>+
>+	/* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
>+	if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
>+		return -EINVAL;

Even though KVM_TD_INIT_VCPU is the only supported command, it is worthwhile to
use a switch-case statement. New commands can be added easily without the need
to refactor this function first.

>+
>+	/*
>+	 * As TDX requires X2APIC, set local apic mode to X2APIC.  User space
>+	 * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
>+	 * KVM_SET_CPUID2.  Otherwise kvm_set_apic_base() will fail.
>+	 */
>+	apic_base_msr = (struct msr_data) {
>+		.host_initiated = true,
>+		.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
>+		(kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
>+	};
>+	if (kvm_set_apic_base(vcpu, &apic_base_msr))
>+		return -EINVAL;

Exporting kvm_vcpu_is_reset_bsp() and kvm_set_apic_base() should be done
here (rather than in a previous patch).

>+
>+	ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
>+	if (ret)
>+		return ret;
>+
>+	tdx->initialized = true;
>+	return 0;
>+}
>+

>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index c002761bb662..2bd4b7c8fa51 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -6274,6 +6274,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> 	case KVM_SET_DEVICE_ATTR:
> 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
> 		break;
>+	case KVM_MEMORY_ENCRYPT_OP:
>+		r = -ENOTTY;

Maybe -EINVAL is better. Because previously trying to call this on vCPU fd
failed with -EINVAL given ...

>+		if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
>+			goto out;
>+		r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
>+		break;
> 	default:
> 		r = -EINVAL;

... this.

> 	}
>-- 
>2.25.1
>
>

Isaku Yamahata March 21, 2024, 8:43 p.m. UTC | #2

On Thu, Mar 21, 2024 at 01:43:14PM +0800,
Chao Gao <chao.gao@intel.com> wrote:

> >+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
> >+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> >+{
> >+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >+	struct vcpu_tdx *tdx = to_tdx(vcpu);
> >+	unsigned long *tdvpx_pa = NULL;
> >+	unsigned long tdvpr_pa;
> >+	unsigned long va;
> >+	int ret, i;
> >+	u64 err;
> >+
> >+	if (is_td_vcpu_created(tdx))
> >+		return -EINVAL;
> >+
> >+	/*
> >+	 * vcpu_free method frees allocated pages.  Avoid partial setup so
> >+	 * that the method can't handle it.
> >+	 */
> >+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+	if (!va)
> >+		return -ENOMEM;
> >+	tdvpr_pa = __pa(va);
> >+
> >+	tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
> >+			   GFP_KERNEL_ACCOUNT);
> >+	if (!tdvpx_pa) {
> >+		ret = -ENOMEM;
> >+		goto free_tdvpr;
> >+	}
> >+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> >+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> >+		if (!va) {
> >+			ret = -ENOMEM;
> >+			goto free_tdvpx;
> >+		}
> >+		tdvpx_pa[i] = __pa(va);
> >+	}
> >+
> >+	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> >+	if (KVM_BUG_ON(err, vcpu->kvm)) {
> >+		ret = -EIO;
> >+		pr_tdx_error(TDH_VP_CREATE, err, NULL);
> >+		goto free_tdvpx;
> >+	}
> >+	tdx->tdvpr_pa = tdvpr_pa;
> >+
> >+	tdx->tdvpx_pa = tdvpx_pa;
> >+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> 
> Can you merge the for-loop above into this one? then ...
> 
> >+		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> >+		if (KVM_BUG_ON(err, vcpu->kvm)) {
> >+			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> 
> >+			for (; i < tdx_info->nr_tdvpx_pages; i++) {
> >+				free_page((unsigned long)__va(tdvpx_pa[i]));
> >+				tdvpx_pa[i] = 0;
> >+			}
> 
> ... no need to free remaining pages.

Makes sense. Let me clean up this.


> >+			/* vcpu_free method frees TDVPX and TDR donated to TDX */
> >+			return -EIO;
> >+		}
> >+	}
> >+
> >+	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> >+	if (KVM_BUG_ON(err, vcpu->kvm)) {
> >+		pr_tdx_error(TDH_VP_INIT, err, NULL);
> >+		return -EIO;
> >+	}
> >+
> >+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >+	tdx->td_vcpu_created = true;
> >+	return 0;
> >+
> >+free_tdvpx:
> >+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> >+		if (tdvpx_pa[i])
> >+			free_page((unsigned long)__va(tdvpx_pa[i]));
> >+		tdvpx_pa[i] = 0;
> >+	}
> >+	kfree(tdvpx_pa);
> >+	tdx->tdvpx_pa = NULL;
> >+free_tdvpr:
> >+	if (tdvpr_pa)
> >+		free_page((unsigned long)__va(tdvpr_pa));
> >+	tdx->tdvpr_pa = 0;
> >+
> >+	return ret;
> >+}
> >+
> >+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> >+{
> >+	struct msr_data apic_base_msr;
> >+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> >+	struct vcpu_tdx *tdx = to_tdx(vcpu);
> >+	struct kvm_tdx_cmd cmd;
> >+	int ret;
> >+
> >+	if (tdx->initialized)
> >+		return -EINVAL;
> >+
> >+	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> 
> These checks look random e.g., I am not sure why is_td_created() isn't check here.
> 
> A few helper functions and boolean variables are added to track which stage the
> TD or TD vCPU is in. e.g.,
> 
> is_hkid_assigned()
> is_td_finalized()
> is_td_created()
> tdx->initialized
> td_vcpu_created
> 
> Insteading of doing this, I am wondering if adding two state machines for
> TD and TD vCPU would make the implementation clear and easy to extend.

Let me look into the state machine. Originally I hoped we don't need it, but
it seems to deserve the state machine..


> >+		return -EINVAL;
> >+
> >+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
> >+		return -EFAULT;
> >+
> >+	if (cmd.error)
> >+		return -EINVAL;
> >+
> >+	/* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
> >+	if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
> >+		return -EINVAL;
> 
> Even though KVM_TD_INIT_VCPU is the only supported command, it is worthwhile to
> use a switch-case statement. New commands can be added easily without the need
> to refactor this function first.

Yes. For KVM_MAP_MEMORY, I will make KVM_TDX_INIT_MEM_REGION vcpu ioctl instead
of vm ioctl because it is consistent and scalable.  We'll have switch statement
in the next respin.

> >+
> >+	/*
> >+	 * As TDX requires X2APIC, set local apic mode to X2APIC.  User space
> >+	 * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
> >+	 * KVM_SET_CPUID2.  Otherwise kvm_set_apic_base() will fail.
> >+	 */
> >+	apic_base_msr = (struct msr_data) {
> >+		.host_initiated = true,
> >+		.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
> >+		(kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
> >+	};
> >+	if (kvm_set_apic_base(vcpu, &apic_base_msr))
> >+		return -EINVAL;
> 
> Exporting kvm_vcpu_is_reset_bsp() and kvm_set_apic_base() should be done
> here (rather than in a previous patch).

Sure.


> >+
> >+	ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
> >+	if (ret)
> >+		return ret;
> >+
> >+	tdx->initialized = true;
> >+	return 0;
> >+}
> >+
> 
> >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >index c002761bb662..2bd4b7c8fa51 100644
> >--- a/arch/x86/kvm/x86.c
> >+++ b/arch/x86/kvm/x86.c
> >@@ -6274,6 +6274,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> > 	case KVM_SET_DEVICE_ATTR:
> > 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
> > 		break;
> >+	case KVM_MEMORY_ENCRYPT_OP:
> >+		r = -ENOTTY;
> 
> Maybe -EINVAL is better. Because previously trying to call this on vCPU fd
> failed with -EINVAL given ...

Oh, ok. Will change it.  I followed VM ioctl case as default value. But vcpu
ioctl seems to have -EINVAL as default value.

Edgecombe, Rick P March 27, 2024, 12:27 a.m. UTC | #3

On Mon, 2024-02-26 at 00:25 -0800, isaku.yamahata@intel.com wrote:
> +/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
> +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> +{
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +       unsigned long *tdvpx_pa = NULL;
> +       unsigned long tdvpr_pa;


I think we could drop theselocal variables and just use tdx->tdvpr_pa and tdx->tdvpx_pa. Then we
don't have to have the assignments later.

> +       unsigned long va;
> +       int ret, i;
> +       u64 err;
> +
> +       if (is_td_vcpu_created(tdx))
> +               return -EINVAL;
> +
> +       /*
> +        * vcpu_free method frees allocated pages.  Avoid partial setup so
> +        * that the method can't handle it.
> +        */
> +       va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +       if (!va)
> +               return -ENOMEM;
> +       tdvpr_pa = __pa(va);
> +
> +       tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
> +                          GFP_KERNEL_ACCOUNT);
> +       if (!tdvpx_pa) {
> +               ret = -ENOMEM;
> +               goto free_tdvpr;
> +       }
> +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> +               va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +               if (!va) {
> +                       ret = -ENOMEM;
> +                       goto free_tdvpx;
> +               }
> +               tdvpx_pa[i] = __pa(va);
> +       }
> +
> +       err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> +       if (KVM_BUG_ON(err, vcpu->kvm)) {
> +               ret = -EIO;
> +               pr_tdx_error(TDH_VP_CREATE, err, NULL);
> +               goto free_tdvpx;
> +       }
> +       tdx->tdvpr_pa = tdvpr_pa;
> +
> +       tdx->tdvpx_pa = tdvpx_pa;

Or alternatively let's move these to right before they are used. (in the current branch 

> +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> +               err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> +               if (KVM_BUG_ON(err, vcpu->kvm)) {
> +                       pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> +                       for (; i < tdx_info->nr_tdvpx_pages; i++) {
> +                               free_page((unsigned long)__va(tdvpx_pa[i]));
> +                               tdvpx_pa[i] = 0;
> +                       }
> +                       /* vcpu_free method frees TDVPX and TDR donated to TDX */
> +                       return -EIO;
> +               }
> +       }
> 
> 
In the current branch tdh_vp_init() takes struct vcpu_tdx, so they would be moved right here.

What do you think?

> +
> +       err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> +       if (KVM_BUG_ON(err, vcpu->kvm)) {
> +               pr_tdx_error(TDH_VP_INIT, err, NULL);
> +               return -EIO;
> +       }
> +
> +       vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +       tdx->td_vcpu_created = true;
> +       return 0;
> +
> +free_tdvpx:
> +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> +               if (tdvpx_pa[i])
> +                       free_page((unsigned long)__va(tdvpx_pa[i]));
> +               tdvpx_pa[i] = 0;
> +       }
> +       kfree(tdvpx_pa);
> +       tdx->tdvpx_pa = NULL;
> +free_tdvpr:
> +       if (tdvpr_pa)
> +               free_page((unsigned long)__va(tdvpr_pa));
> +       tdx->tdvpr_pa = 0;
> +
> +       return ret;
> +}

Isaku Yamahata March 27, 2024, 10:56 p.m. UTC | #4

On Wed, Mar 27, 2024 at 12:27:03AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Mon, 2024-02-26 at 00:25 -0800, isaku.yamahata@intel.com wrote:
> > +/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
> > +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > +{
> > +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +       unsigned long *tdvpx_pa = NULL;
> > +       unsigned long tdvpr_pa;
> 
> 
> I think we could drop theselocal variables and just use tdx->tdvpr_pa and tdx->tdvpx_pa. Then we
> don't have to have the assignments later.

Yes, let me clean it up. The old version acquired spin lock in the middle. Now
we don't have it.


> > +       unsigned long va;
> > +       int ret, i;
> > +       u64 err;
> > +
> > +       if (is_td_vcpu_created(tdx))
> > +               return -EINVAL;
> > +
> > +       /*
> > +        * vcpu_free method frees allocated pages.  Avoid partial setup so
> > +        * that the method can't handle it.
> > +        */
> > +       va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +       if (!va)
> > +               return -ENOMEM;
> > +       tdvpr_pa = __pa(va);
> > +
> > +       tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
> > +                          GFP_KERNEL_ACCOUNT);
> > +       if (!tdvpx_pa) {
> > +               ret = -ENOMEM;
> > +               goto free_tdvpr;
> > +       }
> > +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> > +               va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +               if (!va) {
> > +                       ret = -ENOMEM;
> > +                       goto free_tdvpx;
> > +               }
> > +               tdvpx_pa[i] = __pa(va);
> > +       }
> > +
> > +       err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> > +       if (KVM_BUG_ON(err, vcpu->kvm)) {
> > +               ret = -EIO;
> > +               pr_tdx_error(TDH_VP_CREATE, err, NULL);
> > +               goto free_tdvpx;
> > +       }
> > +       tdx->tdvpr_pa = tdvpr_pa;
> > +
> > +       tdx->tdvpx_pa = tdvpx_pa;
> 
> Or alternatively let's move these to right before they are used. (in the current branch 
> 
> > +       for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
> > +               err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> > +               if (KVM_BUG_ON(err, vcpu->kvm)) {
> > +                       pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> > +                       for (; i < tdx_info->nr_tdvpx_pages; i++) {
> > +                               free_page((unsigned long)__va(tdvpx_pa[i]));
> > +                               tdvpx_pa[i] = 0;
> > +                       }
> > +                       /* vcpu_free method frees TDVPX and TDR donated to TDX */
> > +                       return -EIO;
> > +               }
> > +       }
> > 
> > 
> In the current branch tdh_vp_init() takes struct vcpu_tdx, so they would be moved right here.
> 
> What do you think?

Yes, I should revise the error recovery path.

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index f78200492a3d..a8e96804a252 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -129,6 +129,7 @@  KVM_X86_OP(leave_smm)
 KVM_X86_OP(enable_smi_window)
 #endif
 KVM_X86_OP(mem_enc_ioctl)
+KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0e2408a4707e..5da3c211955d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1778,6 +1778,7 @@  struct kvm_x86_ops {
 #endif
 
 	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
+	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
 	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9ac0246bd974..4000a2e087a8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -571,6 +571,7 @@  struct kvm_pmu_event_filter {
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5796fb45433f..d0f75020579f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -131,6 +131,14 @@  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	return tdx_vm_ioctl(kvm, argp);
 }
 
+static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	if (!is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return tdx_vcpu_ioctl(vcpu, argp);
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLE)|			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -291,6 +299,7 @@  struct kvm_x86_ops vt_x86_ops __initdata = {
 	.get_untagged_addr = vmx_get_untagged_addr,
 
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
+	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 51283d2cd011..aa1da51b8af7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -49,6 +49,7 @@  struct tdx_info {
 	u64 xfam_fixed1;
 
 	u8 nr_tdcs_pages;
+	u8 nr_tdvpx_pages;
 
 	u16 num_cpuid_config;
 	/* This must the last member. */
@@ -104,6 +105,11 @@  static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
 }
 
+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+	return tdx->td_vcpu_created;
+}
+
 static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
 {
 	return kvm_tdx->tdr_pa;
@@ -121,6 +127,11 @@  static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->hkid > 0;
 }
 
+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->finalized;
+}
+
 static void tdx_clear_page(unsigned long page_pa)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -399,7 +410,32 @@  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
-	/* This is stub for now.  More logic will come. */
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	/*
+	 * This methods can be called when vcpu allocation/initialization
+	 * failed. So it's possible that hkid, tdvpx and tdvpr are not assigned
+	 * yet.
+	 */
+	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm))) {
+		WARN_ON_ONCE(tdx->tdvpx_pa);
+		WARN_ON_ONCE(tdx->tdvpr_pa);
+		return;
+	}
+
+	if (tdx->tdvpx_pa) {
+		for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+			if (tdx->tdvpx_pa[i])
+				tdx_reclaim_control_page(tdx->tdvpx_pa[i]);
+		}
+		kfree(tdx->tdvpx_pa);
+		tdx->tdvpx_pa = NULL;
+	}
+	if (tdx->tdvpr_pa) {
+		tdx_reclaim_control_page(tdx->tdvpr_pa);
+		tdx->tdvpr_pa = 0;
+	}
 }
 
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -408,8 +444,13 @@  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	/* Ignore INIT silently because TDX doesn't support INIT event. */
 	if (init_event)
 		return;
+	if (KVM_BUG_ON(is_td_vcpu_created(to_tdx(vcpu)), vcpu->kvm))
+		return;
 
-	/* This is stub for now. More logic will come here. */
+	/*
+	 * Don't update mp_state to runnable because more initialization
+	 * is needed by TDX_VCPU_INIT.
+	 */
 }
 
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
@@ -904,6 +945,137 @@  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	return r;
 }
 
+/* VMM can pass one 64bit auxiliary data to vcpu via RCX for guest BIOS. */
+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	unsigned long *tdvpx_pa = NULL;
+	unsigned long tdvpr_pa;
+	unsigned long va;
+	int ret, i;
+	u64 err;
+
+	if (is_td_vcpu_created(tdx))
+		return -EINVAL;
+
+	/*
+	 * vcpu_free method frees allocated pages.  Avoid partial setup so
+	 * that the method can't handle it.
+	 */
+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!va)
+		return -ENOMEM;
+	tdvpr_pa = __pa(va);
+
+	tdvpx_pa = kcalloc(tdx_info->nr_tdvpx_pages, sizeof(*tdx->tdvpx_pa),
+			   GFP_KERNEL_ACCOUNT);
+	if (!tdvpx_pa) {
+		ret = -ENOMEM;
+		goto free_tdvpr;
+	}
+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
+		if (!va) {
+			ret = -ENOMEM;
+			goto free_tdvpx;
+		}
+		tdvpx_pa[i] = __pa(va);
+	}
+
+	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
+	if (KVM_BUG_ON(err, vcpu->kvm)) {
+		ret = -EIO;
+		pr_tdx_error(TDH_VP_CREATE, err, NULL);
+		goto free_tdvpx;
+	}
+	tdx->tdvpr_pa = tdvpr_pa;
+
+	tdx->tdvpx_pa = tdvpx_pa;
+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
+		if (KVM_BUG_ON(err, vcpu->kvm)) {
+			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
+			for (; i < tdx_info->nr_tdvpx_pages; i++) {
+				free_page((unsigned long)__va(tdvpx_pa[i]));
+				tdvpx_pa[i] = 0;
+			}
+			/* vcpu_free method frees TDVPX and TDR donated to TDX */
+			return -EIO;
+		}
+	}
+
+	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
+	if (KVM_BUG_ON(err, vcpu->kvm)) {
+		pr_tdx_error(TDH_VP_INIT, err, NULL);
+		return -EIO;
+	}
+
+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	tdx->td_vcpu_created = true;
+	return 0;
+
+free_tdvpx:
+	for (i = 0; i < tdx_info->nr_tdvpx_pages; i++) {
+		if (tdvpx_pa[i])
+			free_page((unsigned long)__va(tdvpx_pa[i]));
+		tdvpx_pa[i] = 0;
+	}
+	kfree(tdvpx_pa);
+	tdx->tdvpx_pa = NULL;
+free_tdvpr:
+	if (tdvpr_pa)
+		free_page((unsigned long)__va(tdvpr_pa));
+	tdx->tdvpr_pa = 0;
+
+	return ret;
+}
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	struct msr_data apic_base_msr;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm_tdx_cmd cmd;
+	int ret;
+
+	if (tdx->initialized)
+		return -EINVAL;
+
+	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd.error)
+		return -EINVAL;
+
+	/* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
+	if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
+		return -EINVAL;
+
+	/*
+	 * As TDX requires X2APIC, set local apic mode to X2APIC.  User space
+	 * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
+	 * KVM_SET_CPUID2.  Otherwise kvm_set_apic_base() will fail.
+	 */
+	apic_base_msr = (struct msr_data) {
+		.host_initiated = true,
+		.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
+		(kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0),
+	};
+	if (kvm_set_apic_base(vcpu, &apic_base_msr))
+		return -EINVAL;
+
+	ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
+	if (ret)
+		return ret;
+
+	tdx->initialized = true;
+	return 0;
+}
+
 #define TDX_MD_MAP(_fid, _ptr)			\
 	{ .fid = MD_FIELD_ID_##_fid,		\
 	  .ptr = (_ptr), }
@@ -953,13 +1125,14 @@  static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
 
 static int __init tdx_module_setup(void)
 {
-	u16 num_cpuid_config, tdcs_base_size;
+	u16 num_cpuid_config, tdcs_base_size, tdvps_base_size;
 	int ret;
 	u32 i;
 
 	struct tdx_md_map mds[] = {
 		TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
 		TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
+		TDX_MD_MAP(TDVPS_BASE_SIZE, &tdvps_base_size),
 	};
 
 	struct tdx_metadata_field_mapping fields[] = {
@@ -1013,6 +1186,11 @@  static int __init tdx_module_setup(void)
 	}
 
 	tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
+	/*
+	 * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
+	 * -1 for TDVPR.
+	 */
+	tdx_info->nr_tdvpx_pages = tdvps_base_size / PAGE_SIZE - 1;
 
 	/*
 	 * Make TDH.VP.ENTER preserve RBP so that the stack unwinder
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 173ed19207fb..d3077151252c 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,12 +17,20 @@  struct kvm_tdx {
 	u64 xfam;
 	int hkid;
 
+	bool finalized;
+
 	u64 tsc_offset;
 };
 
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
+	unsigned long tdvpr_pa;
+	unsigned long *tdvpx_pa;
+	bool td_vcpu_created;
+
+	bool initialized;
+
 	/*
 	 * Dummy to make pmu_intel not corrupt memory.
 	 * TODO: Support PMU for TDX.  Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index bb73a9b5b354..f5820f617b2e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -150,6 +150,8 @@  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return -EOPNOTSUPP; }
 static inline void tdx_hardware_unsetup(void) {}
@@ -169,6 +171,8 @@  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+
+static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c002761bb662..2bd4b7c8fa51 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6274,6 +6274,12 @@  long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_SET_DEVICE_ATTR:
 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
 		break;
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -ENOTTY;
+		if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
+			goto out;
+		r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}

[v19,044/130] KVM: TDX: Do TDX specific vcpu initialization

Commit Message

Comments

Patch